Machine-learning system doesn’t require costly hand-annotated data.
In recent years, computers have gotten remarkably good at recognizing speech and images: Think of the dictation software on most cellphones, or the algorithms that automatically identify people in photos posted to Facebook.
But recognition of natural sounds — such as crowds cheering or waves crashing — has lagged behind. That’s because most automated recognition systems, whether they process audio or visual information, are the result of machine learning, in which computers search for patterns in huge compendia of training data. Usually, the training data has to be first annotated by hand, which is prohibitively expensive for all but the highest-demand applications.
Sound recognition may be catching up, however, thanks to researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL). At the Neural Information Processing Systems conference next week, they will present a sound-recognition system that outperforms its predecessors but didn’t require hand-annotated data during training.
Instead, the researchers trained the system on video. First, existing computer vision systems that recognize scenes and objects categorized the images in the video. The new system then found correlations between those visual categories and natural sounds.
“Computer vision has gotten so good that we can transfer it to other domains,” says Carl Vondrick, an MIT graduate student in electrical engineering and computer science and one of the paper’s two first authors. “We’re capitalizing on the natural synchronization between vision and sound. We scale up with tons of unlabeled video to learn to understand sound.”
The researchers tested their system on two standard databases of annotated sound recordings, and it was between 13 and 15 percent more accurate than the best-performing previous system. On a data set with 10 different sound categories, it could categorize sounds with 92 percent accuracy, and on a data set with 50 categories it performed with 74 percent accuracy. On those same data sets, humans are 96 percent and 81 percent accurate, respectively.
“Even humans are ambiguous,” says Yusuf Aytar, the paper’s other first author and a postdoc in the lab of MIT professor of electrical engineering and computer science Antonio Torralba. Torralba is the final co-author on the paper.
“We did an experiment with Carl,” Aytar says. “Carl was looking at the computer monitor, and I couldn’t see it. He would play a recording and I would try to guess what it was. It turns out this is really, really hard. I could tell indoor from outdoor, basic guesses, but when it comes to the details — ‘Is it a restaurant?’ — those details are missing. Even for annotation purposes, the task is really hard.”
Because it takes far less power to collect and process audio data than it does to collect and process visual data, the researchers envision that a sound-recognition system could be used to improve the context sensitivity of mobile devices.
When coupled with GPS data, for instance, a sound-recognition system could determine that a cellphone user is in a movie theater and that the movie has started, and the phone could automatically route calls to a prerecorded outgoing message. Similarly, sound recognition could improve the situational awareness of autonomous robots.
“For instance, think of a self-driving car,” Aytar says. “There’s an ambulance coming, and the car doesn’t see it. If it hears it, it can make future predictions for the ambulance — which path it’s going to take — just purely based on sound.”
The researchers’ machine-learning system is a neural network, so called because its architecture loosely resembles that of the human brain. A neural net consists of processing nodes that, like individual neurons, can perform only rudimentary computations but are densely interconnected. Information — say, the pixel values of a digital image — is fed to the bottom layer of nodes, which processes it and feeds it to the next layer, which processes it and feeds it to the next layer, and so on. The training process continually modifies the settings of the individual nodes, until the output of the final layer reliably performs some classification of the data — say, identifying the objects in the image.
Vondrick, Aytar, and Torralba first trained a neural net on two large, annotated sets of images: one, the ImageNet data set, contains labeled examples of images of 1,000 different objects; the other, the Places data set created by Torralba’s group, contains labeled images of 401 different scene types, such as a playground, bedroom, or conference room.
Once the network was trained, the researchers fed it the video from 26 terabytes of video data downloaded from the photo-sharing site Flickr. “It’s about 2 million unique videos,” Vondrick says. “If you were to watch all of them back to back, it would take you about two years.” Then they trained a second neural network on the audio from the same videos. The second network’s goal was to correctly predict the object and scene tags produced by the first network.
The result was a network that could interpret natural sounds in terms of image categories. For instance, it might determine that the sound of birdsong tends to be associated with forest scenes and pictures of trees, birds, birdhouses, and bird feeders.
To compare the sound-recognition network’s performance to that of its predecessors, however, the researchers needed a way to translate its language of images into the familiar language of sound names. So they trained a simple machine-learning system to associate the outputs of the sound-recognition network with a set of standard sound labels.
For that, the researchers did use a database of annotated audio — one with 50 categories of sound and about 2,000 examples. Those annotations had been supplied by humans. But it’s much easier to label 2,000 examples than to label 2 million. And the MIT researchers’ network, trained first on unlabeled video, significantly outperformed all previous networks trained solely on the 2,000 labeled examples.
“With the modern machine-learning approaches, like deep learning, you have many, many trainable parameters in many layers in your neural-network system,” says Mark Plumbley, a professor of signal processing at the University of Surrey. “That normally means that you have to have many, many examples to train that on. And we have seen that sometimes there’s not enough data to be able to use a deep-learning system without some other help. Here the advantage is that they are using large amounts of other video information to train the network and then doing an additional step where they specialize the network for this particular task. That approach is very promising because it leverages this existing information from another field.”
Plumbley says that both he and colleagues at other institutions have been involved in efforts to commercialize sound recognition software for applications such as home security, where it might, for instance, respond to the sound of breaking glass. Other uses might include eldercare, to identify potentially alarming deviations from ordinary sound patterns, or to control sound pollution in urban areas. “I really think that there’s a lot of potential in the sound-recognition area,” he says.
Neural networks using light could lead to superfast computing.
Neural networks are taking the world of computing by storm. Researchers have used them to create machines that are learning a huge range of skills that had previously been the unique preserve of humans—object recognition, face recognition, natural language processing, machine translation. All these skills, and more, are now becoming routine for machines.
So there is great interest in creating more capable neural networks that can push the boundaries of artificial intelligence even further. The focus of this work is in creating circuits that operate more like neurons, so-called neuromorphic chips. But how to make these circuits significantly faster?
Today, we get an answer of sorts thanks to the work of Alexander Tait and pals at Princeton University in New Jersey. These guys have built the world’s first photonic neuromorphic chip and show that it computes at ultrafast speeds.
Optical computing has long been the great white hope of computer science. Photons have significantly more bandwidth than electrons and so can process more data more quickly. But the advantages of optical data processing systems have never outweighed the additional cost of making them, and so they have never been widely adopted.
That has started to change in some areas of computing, such as analog signal processing, which requires the kind of ultrafast data processing that only photonic chips can provide.
Now neural networks are opening up a new opportunity for photonics. “Photonic neural networks leveraging silicon photonic platforms could access new regimes of ultrafast information processing for radio, control, and scientific computing,” say Tait and co.
At the heart of the challenge is to produce an optical device in which each node has the same response characteristics as a neuron. The nodes take the form of tiny circular waveguides carved into a silicon substrate in which light can circulate. When released this light then modulates the output of a laser working at threshold, a regime in which small changes in the incoming light have a dramatic impact on the laser’s output.
Crucially, each node in the system works with a specific wavelength of light—a technique known as wave division multiplexing. The light from all the nodes can be summed by total power detection before being fed into the laser. And the laser output is fed back into the nodes to create a feedback circuit with a non-linear character.
An important question is just how closely this non-linearity mimics neural behavior. Tait and co measure the output and show that it is mathematically equivalent to a device known as a continuous-time recurrent neural network. “This result suggests that programming tools for CTRNNs could be applied to larger silicon photonic neural networks,” they say.
That’s an important result because it means the device that Tait and co have made can immediately exploit the vast range of programming nous that has been gathered for these kinds of neural networks.
They go on to demonstrate how this can be done using a network consisting of 49 photonic nodes. They use this photonic neural network to solve the mathematical problem of emulating a certain kind of differential equation and compare it to an ordinary central processing unit.
The results show just how fast photonic neural nets can be. “The effective hardware acceleration factor of the photonic neural network is estimated to be 1,960 × in this task,” say Tait and co. That’s a speed up of three orders of magnitude.
That opens the doors to an entirely new industry that could bring optical computing into the mainstream for the first time. “Silicon photonic neural networks could represent first forays into a broader class of silicon photonic systems for scalable information processing,” say Taif and co.
Of course much depends on how well the first generation of electronic neuromorphic chips perform. Photonic neural nets will have to offer significant advantages to be widely adopted and will therefore require much more detailed characterization. Clearly, there are interesting times ahead for photonics.
Learn more: World’s First Photonic Neural Network Unveiled
Machine-learning techniques that mimic human recognition and dreaming processes are being deployed in the search for habitable worlds beyond our solar system. A deep belief neural network, called RobERt (Robotic Exoplanet Recognition), has been developed by astronomers at UCL to sift through detections of light emanating from distant planetary systems and retrieve spectral information about the gases present in the exoplanet atmospheres.
RobERt will be presented at the National Astronomy Meeting (NAM) 2016 in Nottingham by Dr Ingo Waldmann on Tuesday 28th June.
“Different types of molecules absorb and emit light at specific wavelengths, embedding a unique pattern of lines within the electromagnetic spectrum,” explained Dr Waldmann, who leads RobERt’s development team. “We can take light that has been filtered through an exoplanet’s atmosphere or reflected from its cloud-tops, split it like a rainbow and then pick out the ‘fingerprint’ of features associated with the different molecules or gases. Human brains are really good at finding these patterns in spectra and label them from experience, but it’s a really time consuming job and there will be huge amounts of data.
We built RobERt to independently learn from examples and to build on his own experiences. This way, like a seasoned astronomer or a detective, RobERt has a pretty good feeling for what molecules are inside a spectrum and which are the most promising data for more detailed analysis. But what usually takes days or weeks takes RobERt mere seconds.”
Berlin researchers develop a robot that can learn to navigate through its environment guided by external stimuli
Researchers of Freie Universität Berlin, of the Bernstein Fokus Neuronal Basis of Learning, and of the Bernstein Center Berlin and have developed a robot that perceives environmental stimuli and learns to react to them. The scientists used the relatively simple nervous system of the honeybee as a model for its working principles. To this end, they installed a camera on a small robotic vehicle and connected it to a computer. The computer program replicated in a simplified way the sensorimotor network of the insect brain. The input data came from the camera that—akin to an eye—received and projected visual information. The neural network, in turn, operated the motors of the robot wheels—and could thus control its motion direction.
The outstanding feature of this artifical mini brain is its ability to learn by simple principles. “The network-controlled robot is able to link certain external stimuli with behavioral rules,” says Professor Martin Paul Nawrot, head of the research team and professor of neuroscience at Freie Universität Berlin. “Much like honeybees learn to associate certain flower colors with tasty nectar, the robot learns to approach certain colored objects and to avoid others.”
In the learning experiment, the scientists located the network-controlled robot in the center of a small arena. Red and blue objects were installed on the walls. Once the robot’s camera focused on an object with the desired color—red, for instance—, the scientists triggered a light flash. This signal activated a so-called reward sensor nerve cell in the artificial network. The simultaneous processing of red color and the reward now led to specific changes in those parts of the network, which exercised control over the robot wheels. As a consequence, when the robot “saw” another red object, it started to move toward it. Blue items, in contrast, made it move backwards. “Just within seconds, the robot accomplishes the task to find an object in the desired color and to approach it,” explains Nawrot. “Only a single learning trial is needed, similar to experimental observations in honeybees.”
The current study was carried out at Freie Universität Berlin within an interdisciplinary collaboration between the research groups “Neuroinformatics” (Institute of Biology) led by Professor Martin Paul Nawrot and “Intelligent Systems and Robotics” (Institute of Computer Science) led by Professor Raúl Rojas. The scientists are now planning to expand their neural network by supplementing more learning principles. Thus, the mini brain will become even more powerful—and the robot more autonomous.