Given a still image, CSAIL deep-learning system generates videos that predict what will happen next in a scene.
Living in a dynamic physical world, it’s easy to forget how effortlessly we understand our surroundings. With minimal thought, we can figure out how scenes change and objects interact.
But what’s second nature for us is still a huge problem for machines. With the limitless number of ways that objects can move, teaching computers to predict future actions can be difficult.
Recently, researchers from MIT’s Computer Science and Artificial Intelligence Laboratory(CSAIL) have moved a step closer, developing a deep-learning algorithm that, given a still image from a scene, can create a brief video that simulates the future of that scene.
Trained on 2 million unlabeled videos that include a year’s worth of footage, the algorithm generated videos that human subjects deemed to be realistic 20 percent more often than a baseline model.
The team says that future versions could be used for everything from improved security tactics and safer self-driving cars. According to CSAIL PhD student and first author Carl Vondrick, the algorithm can also help machines recognize people’s activities without expensive human annotations.
“These videos show us what computers think can happen in a scene,” says Vondrick. “If you can predict the future, you must have understood something about the present.”
Vondrick wrote the paper with MIT professor Antonio Torralba and Hamed Pirsiavash, a former CSAIL postdoc who is now a professor at the University of Maryland Baltimore County (UMBC). The work will be presented at next week’s Neural Information Processing Systems (NIPS) conference in Barcelona.
How it works
Multiple researchers have tackled similar topics in computer vision, including MIT Professor Bill Freeman, whose new work on “visual dynamics” also creates future frames in a scene. But where his model focuses on extrapolating videos into the future, Torralba’s model can also generate completely new videos that haven’t been seen before.
Previous systems build up scenes frame by frame, which creates a large margin for error. In contrast, this work focuses on processing the entire scene at once, with the algorithm generating as many as 32 frames from scratch per second.
“Building up a scene frame-by-frame is like a big game of ‘Telephone,’ which means that the message falls apart by the time you go around the whole room,” says Vondrick. “By instead trying to predict all frames simultaneously, it’s as if you’re talking to everyone in the room at once.”
Of course, there’s a trade-off to generating all frames simultaneously: While it becomes more accurate, the computer model also becomes more complex for longer videos. Nevertheless, this complexity may be worth it for sharper predictions.
To create multiple frames, researchers taught the model to generate the foreground separate from the background, and to then place the objects in the scene to let the model learn which objects move and which objects don’t.
The team used a deep-learning method called “adversarial learning” that involves training two competing neural networks. One network generates video, and the other discriminates between the real and generated videos. Over time, the generator learns to fool the discriminator.
From that, the model can create videos resembling scenes from beaches, train stations, hospitals, and golf courses. For example, the beach model produces beaches with crashing waves, and the golf model has people walking on grass.
Testing the scene
The team compared the videos against a baseline of generated videos and asked subjects which they thought were more realistic. From over 13,000 opinions of 150 users, subjects chose the generative model videos 20 percent more often than the baseline.
Vondrick stresses that the model still lacks some fairly simple common-sense principles. For example, it often doesn’t understand that objects are still there when they move, like when a train passes through a scene. The model also tends to make humans and objects look much larger in size than reality.
Another limitation is that the generated videos are just one and a half seconds long, which the team hopes to be able to increase in future work. The challenge is that this requires tracking longer dependencies to ensure that the scene still makes sense over longer time periods. One way to do this would be to add human supervision.
“It’s difficult to aggregate accurate information across long time periods in videos,” says Vondrick. “If the video has both cooking and eating activities, you have to be able to link those two together to make sense of the scene.”
These types of models aren’t limited to predicting the future. Generative videos can be used for adding animation to still images, like the animated newspaper from the Harry Potter books. They could also help detect anomalies in security footage and compress data for storing and sending longer videos.
“In the future, this will let us scale up vision systems to recognize objects and scenes without any supervision, simply by training them on video,” says Vondrick.
Learn more: Creating videos of the future
Physicians have long used visual judgment of medical images to determine the course of cancer treatment. A new program package from Fraunhofer researchers reveals changes in images and facilitates this task using deep learning.
The experts will demonstrate this software in Chicago from November 27 to December 2 at RSNA, the world’s largest radiology meeting.
Has a tumor shrunk during the course of treatment over several months, or have new tumors developed? To answer questions like these, physicians often perform CT and MRI scans. Tumors are usually evaluated only visually, and new tumors are often over-
looked. “Our program package increases confidence during tumor measurement and follow-up,” explains Mark Schenk from the Fraunhofer Institute for Medical Image Computing MEVIS in Bremen, Germany. “The software can, for example, determine how the volume of a tumor changes over time and supports the detection of new tumors.” The package consists of modular processing components and can help medical technology manufacturers automate progress monitoring.
The computer learns on its own
The package is unique in its use of deep learning, a new type of machine learning that reaches far beyond existing approaches. This method is helpful for image segmentation, during which experts designate exact organ outlines. Existing computer segmentation programs seek clearly defined image features such as certain gray values. “How-
ever, this can often lead to errors,” according to Fraunhofer researcher Markus Harz. “The software assigns areas to the liver that do not belong to the organ.” These errors must be corrected by physicians, a process which can often be quite time-consuming.
The new deep learning approaches promise improved results and should save physicians valuable time. To demonstrate their self-learning methods, Fraunhofer scientists trained the software with CT liver images from 149 patients. Results showed that the more data the program analyzed, the better it could automatically identify liver contours.
Finding hidden metastases
A further application of the approach is image registration, in which software aligns images from different patient visits so that physicians can easily compare them. Machine learning can aid the particularly difficult task of locating bone metastases in the torso in which hip bones, ribs, and spine are visible. Currently, these metastases are often overlooked due to time constraints in clinical practice. Deep learning methods can help reliably discover metastases and thus improve treatment outcomes.
Researchers focus on a combination of classical approaches and machine learning: “We wish to harness existing expertise to implement deep learning as effectively and reliably as possible,” stresses Harz. Fraunhofer MEVIS builds upon years of experience in practical application: for example, the algorithms for highly precise lung image registration have been integrated into several commercial medical software applications.
Learn more: Machine learning to help physicians
New training technique would reveal the basis for machine-learning systems’ decisions.
In recent years, the best-performing systems in artificial-intelligence research have come courtesy of neural networks, which look for patterns in training data that yield useful predictions or classifications. A neural net might, for instance, be trained to recognize certain objects in digital images or to infer the topics of texts.
But neural nets are black boxes. After training, a network may be very good at classifying data, but even its creators will have no idea why. With visual data, it’s sometimes possible to automate experiments that determine which visual features a neural net is responding to. But text-processing systems tend to be more opaque.
At the Association for Computational Linguistics’ Conference on Empirical Methods in Natural Language Processing, researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) will present a new way to train neural networks so that they provide not only predictions and classifications but rationales for their decisions.
“In real-world applications, sometimes people really want to know why the model makes the predictions it does,” says Tao Lei, an MIT graduate student in electrical engineering and computer science and first author on the new paper. “One major reason that doctors don’t trust machine-learning methods is that there’s no evidence.”
“It’s not only the medical domain,” adds Regina Barzilay, the Delta Electronics Professor of Electrical Engineering and Computer Science and Lei’s thesis advisor. “It’s in any domain where the cost of making the wrong prediction is very high. You need to justify why you did it.”
“There’s a broader aspect to this work, as well,” says Tommi Jaakkola, an MIT professor of electrical engineering and computer science and the third coauthor on the paper. “You may not want to just verify that the model is making the prediction in the right way; you might also want to exert some influence in terms of the types of predictions that it should make. How does a layperson communicate with a complex model that’s trained with algorithms that they know nothing about? They might be able to tell you about the rationale for a particular prediction. In that sense it opens up a different way of communicating with the model.”
Neural networks are so called because they mimic — approximately — the structure of the brain. They are composed of a large number of processing nodes that, like individual neurons, are capable of only very simple computations but are connected to each other in dense networks.
In a process referred to as “deep learning,” training data is fed to a network’s input nodes, which modify it and feed it to other nodes, which modify it and feed it to still other nodes, and so on. The values stored in the network’s output nodes are then correlated with the classification category that the network is trying to learn — such as the objects in an image, or the topic of an essay.
Over the course of the network’s training, the operations performed by the individual nodes are continuously modified to yield consistently good results across the whole set of training examples. By the end of the process, the computer scientists who programmed the network often have no idea what the nodes’ settings are. Even if they do, it can be very hard to translate that low-level information back into an intelligible description of the system’s decision-making process.
In the new paper, Lei, Barzilay, and Jaakkola specifically address neural nets trained on textual data. To enable interpretation of a neural net’s decisions, the CSAIL researchers divide the net into two modules. The first module extracts segments of text from the training data, and the segments are scored according to their length and their coherence: The shorter the segment, and the more of it that is drawn from strings of consecutive words, the higher its score.
The segments selected by the first module are then passed to the second module, which performs the prediction or classification task. The modules are trained together, and the goal of training is to maximize both the score of the extracted segments and the accuracy of prediction or classification.
One of the data sets on which the researchers tested their system is a group of reviews from a website where users evaluate different beers. The data set includes the raw text of the reviews and the corresponding ratings, using a five-star system, on each of three attributes: aroma, palate, and appearance.
What makes the data attractive to natural-language-processing researchers is that it’s also been annotated by hand, to indicate which sentences in the reviews correspond to which scores. For example, a review might consist of eight or nine sentences, and the annotator might have highlighted those that refer to the beer’s “tan-colored head about half an inch thick,” “signature Guinness smells,” and “lack of carbonation.” Each sentence is correlated with a different attribute rating.
As such, the data set provides an excellent test of the CSAIL researchers’ system. If the first module has extracted those three phrases, and the second module has correlated them with the correct ratings, then the system has identified the same basis for judgment that the human annotator did.
In experiments, the system’s agreement with the human annotations was 96 percent and 95 percent, respectively, for ratings of appearance and aroma, and 80 percent for the more nebulous concept of palate.
In the paper, the researchers also report testing their system on a database of free-form technical questions and answers, where the task is to determine whether a given question has been answered previously.
In unpublished work, they’ve applied it to thousands of pathology reports on breast biopsies, where it has learned to extract text explaining the bases for the pathologists’ diagnoses. They’re even using it to analyze mammograms, where the first module extracts sections of images rather than segments of text.
“There’s a lot of hype now — and rightly so — around deep learning, and specifically deep learning for natural-language processing,” says Byron Wallace, an assistant professor of computer and information science at Northeastern University. “But a big drawback for these models is that they’re often black boxes. Having a model that not only makes very accurate predictions but can also tell you why it’s making those predictions is a really important aim.”
“As it happens, we have a paper that’s similar in spirit being presented at the same conference,” Wallace adds. “I didn’t know at the time that Regina was working on this, and I actually think hers is better. In our approach, during the training process, while someone is telling us, for example, that a movie review is very positive, we assume that they’ll mark a sentence that gives you the rationale. In this way we train the deep-learning model to extract these rationales. But they don’t make this assumption, so their model works without using direct annotations with rationales, which is a very nice property.”
Learn more: Making computers explain themselves
MIT researchers have developed a new chip designed to implement neural networks. It is 10 times as efficient as a mobile GPU, so it could enable mobile devices to run powerful artificial-intelligence algorithms locally, rather than uploading data to the Internet for processing.
In recent years, some of the most exciting advances in artificial intelligence have come courtesy of convolutional neural networks, large virtual networks of simple information-processing units, which are loosely modeled on the anatomy of the human brain. Neural networks are typically implemented using graphics processing units (GPUs), special-purpose graphics chips found in all computing devices with screens. A mobile GPU, of the type found in a cell phone, might have almost 200 cores, or processing units, making it well suited to simulating a network of distributed processors.
Deep learning has already had a huge impact on computer vision and speech recognition, and it’s making inroads in areas as computer-unfriendly as cooking. Now a new startup led by University of Toronto professor Brendan Frey wants to cause similar reverberations in genomic medicine.
Deep Genomics plans to identify gene variants and mutations never before observed or studied and find how these link to various diseases. And through this work the company believes it can help usher in a new era of personalized medicine.
Genomic research is hard. Scientists still know relatively little about our genes and how they interrelate. But Frey and others in the field now know enough that they can equip machines to do the heavy lifting. And there’s an awful lot of this heavy lifting to do. “Genomics is no longer about small datasets,” Frey tells Gizmag. “It’s now about very, very large datasets.”
For context, the first effort to sequence a full human genome took 13 years – running from 1990 to 2003. There are now many companies working to sequence many genomes at a time. The largest of these is called Illumina. “Illumina,” Frey says, “expects to sequence one million genomes in the next year. Each genome contains three billion letters. That’s a lot of data.”