The process by which the 3D structure is inferred from just images is called photogrammetry. What happens is that structure can be determined from vision if the object in question is viewed from more than one angle. An the viewing angle changes, things that are closer more at a different rate than things that are further. You have probably noticed this effect when travelling on a highway, it is called parallax.

If the same point is detected in two or more images, the location of that point can be triangulated from tracing the direction from which it was viewed. Doing this process manually would be feasible, but excruciatingly tedious. Nevertheless, it has been performed with aerial imagery taken from kites and balloons to get the topography of large areas since the 1800's.

Fortunately for us, computational power and computer vision algorithms have advanced enough that individual points can be matched using machines and not human labor. Once points have been matched, all that is missing is some trigonometry to get a precise location in three dimensions. If every pixel on an image is traced to its origin in 3D, then a surface can be reconstructed.