By looking at photographs and drawing on past experiences, humans can often perceive depth in images that are themselves perfectly flat. However, getting computers to do the same has proven to be quite difficult.
The problem is difficult for several reasons, one being that information is inevitably lost when a scene that takes place in three dimensions is reduced to a two-dimensional (2D) representation. There are well-established strategies for retrieving 3D information from multiple 2D images, but they each have limitations. A new approach called “virtual matching,” which has been developed by researchers at MIT and other institutions, can circumvent some of these shortcomings and succeed in cases where conventional methodology fails.
Existing methods that reconstruct 3D scenes from 2D images rely on images that contain some of the same features. Virtual matching is a 3D reconstruction method that works even with images taken from wildly different views that do not exhibit the same characteristics.
The standard approach, called “structure from motion”, is modeled on a key aspect of human vision. Because our eyes are separated from each other, they each provide slightly different views of an object. A triangle can be formed whose sides consist of the line segment connecting the two eyes, plus the line segments connecting each eye to a common point on the object in question. Knowing the angles in the triangle and the distance between the eyes, it is possible to work out the distance to that point using elementary geometry – although the human visual system, of course, can make rough judgments about distance without having to go through arduous trigonometric calculations. This same basic idea – triangulation or parallax views – has been exploited by astronomers for centuries to calculate the distance to distant stars.
The triangulation is a key element of the structure of the movement. Suppose you have two images of an object – a sculpted figure of a rabbit, for example – one taken from the left side of the figure and one from the right. The first step would be to find points or pixels on the surface of the rabbit that the two images share. A researcher could go from there to determine the “poses” of the two cameras – the positions from which the photos were taken and the direction in which each camera was facing. Knowing the distance between the cameras and the way they were oriented, one could then triangulate to calculate the distance to a selected point on the rabbit. And if enough commonalities are identified, it might be possible to get a detailed idea of the general shape of the object (or “bunny”).
Tremendous progress has been made with this technique, comments Wei-Chiu Ma, a doctoral student in MIT’s Department of Electrical and Computer Engineering (EECS), “and people are now matching pixels with ever-increasing precision.” As long as we can observe the same point or points on different images, we can use existing algorithms to determine relative positions between cameras. But the approach only works if the two images largely overlap. If the input images have very different points of view – and therefore contain little or no commonalities – he adds, “the system may fail”.
In the summer of 2020, Ma came up with a new way of doing things that could dramatically expand the scope of structure from motion. MIT was closed at the time due to the pandemic, and Ma was at home in Taiwan, relaxing on the couch. Looking at the palm of his hand and the tips of his fingers in particular, it occurred to him that he could clearly picture his fingernails, even though they were not visible to him.
This was the inspiration for the notion of virtual correspondence, which Ma later pursued with his advisor, Antonio Torralba, EECS professor and researcher at the Computer Science and Artificial Intelligence Laboratory, along with Anqi Joyce Yang and Raquel Urtasun from the University of Toronto. and Shenlong Wang of the University of Illinois. “We want to incorporate human knowledge and reasoning into our existing 3D algorithms,” says Ma, the same reasoning that allowed him to look at his fingertips and conjure up fingernails from the other side – the side that he couldn’t see.
Structure from motion works when two images have points in common, as this means that a triangle can always be drawn connecting the cameras to the common point, and depth information can thus be gleaned from this. Virtual correspondence offers a way to go further. Suppose, again, that a photo is taken of the left side of a rabbit and another photo is taken of the right side. The first photo might reveal a spot on the rabbit’s left leg. But since light travels in a straight line, one could use a general knowledge of rabbit anatomy to know where a ray of light from the camera to the leg would emerge on the other side of the rabbit. This point may be visible in the other image (taken from the right side) and if so, it could be used via triangulation to calculate distances in the third dimension.
Virtual matching, in other words, takes a point from the first image on the rabbit’s left flank and connects it to a point on the rabbit’s unseen right flank. “The advantage here is that you don’t need overlapping images to proceed,” notes Ma. “By looking through the object and exiting the other end, this technique provides commonality with which to work that were not initially available.” And in this way, the constraints imposed on the conventional method can be circumvented.
One might wonder how much prior knowledge is needed for this to work, because if you had to know the shape of everything in the image from the start, no math would be needed. The trick Ma and his colleagues employ is to use certain familiar objects in an image – such as the human form – to serve as an “anchor”, and they have devised methods to use our knowledge of the human form to help determine camera poses and, in some cases, infer image depth. Additionally, Ma explains, “the prior knowledge and common sense built into our algorithms are first captured and encoded by neural networks.”
The team’s ultimate goal is much more ambitious, says Ma. “We want to create computers that can understand the world in three dimensions, just like humans.” This objective is still far from being achieved, he acknowledges. “But to go beyond where we are today and build a system that acts like humans, we need a more nurturing framework. In other words, we need to develop computers that can not only interpret still images, but also understand short video clips and possibly feature films.
A scene from the movie “Good Will Hunting” demonstrates what he has in mind. The audience sees Matt Damon and Robin Williams from behind, sitting on a bench overlooking a pond in Boston’s Public Garden. The next shot, taken from the opposite side, offers frontal (albeit fully clothed) views of Damon and Williams with an entirely different background. Anyone who watches the film knows immediately that they are watching the same two people, even though the two shots have nothing in common. Computers can’t make that conceptual leap just yet, but Ma and his colleagues are working hard to make these machines more adept and, at least as far as vision is concerned, more like us.
The team’s work will be presented next week at the Computer Vision and Pattern Recognition conference.