Anyone who’s been to a concert knows that something magical happens between the performers and their instruments. It transforms music from being just “notes on a page” to a satisfying experience.
A University of Washington team wondered if artificial intelligence could recreate that delight using only visual cues — a silent, top-down video of someone playing the piano. The researchers used machine learning to create a system, called Audeo, that creates audio from silent piano performances. When the group tested the music Audeo created with music-recognition apps, such as SoundHound, the apps correctly identified the piece Audeo played about 86% of the time. For comparison, these apps identified the piece in the audio tracks from the source videos 93% of the time.
The researchers presented Audeo Dec. 8 at the NeurIPS 2020 conference.
“To create music that sounds like it could be played in a musical performance was previously believed to be impossible,” said senior author Eli Shlizerman, an assistant professor in both the applied mathematics and the electrical and computer engineering departments. “An algorithm needs to figure out the cues, or ‘features,’ in the video frames that are related to generating music, and it needs to ‘imagine’ the sound that’s happening in between the video frames. It requires a system that is both precise and imaginative. The fact that we achieved music that sounded pretty good was a surprise.”
Audeo uses a series of steps to decode what’s happening in the video and then translate it into music. First, it has to detect which keys are pressed in each video frame to create a diagram over time. Then it needs to translate that diagram into something that a music synthesizer would actually recognize as a sound a piano would make. This second step cleans up the data and adds in more information, such as how strongly each key is pressed and for how long.
“If we attempt to synthesize music from the first step alone, we would find the quality of the music to be unsatisfactory,” Shlizerman said. “The second step is like how a teacher goes over a student composer’s music and helps enhance it.”
The researchers trained and tested the system using YouTube videos of the pianist Paul Barton. The training consisted of about 172,000 video frames of Barton playing music from well-known classical composers, such as Bach and Mozart. Then they tested Audeo with almost 19,000 frames of Barton playing different music from these composers and others, such as Scott Joplin.
Once Audeo has generated a transcript of the music, it’s time to give it to a synthesizer that can translate it into sound. Every synthesizer will make the music sound a little different — this is similar to changing the “instrument” setting on an electric keyboard. For this study, the researchers used two different synthesizers.
“Fluidsynth makes synthesizer piano sounds that we are familiar with. These are somewhat mechanical-sounding but pretty accurate,” Shlizerman said. “We also used PerfNet, a new AI synthesizer that generates richer and more expressive music. But it also generates more noise.”
Audeo was trained and tested only on Paul Barton’s piano videos. Future research is needed to see how well it could transcribe music for any musician or piano, Shlizerman said.
“The goal of this study was to see if artificial intelligence could generate music that was played by a pianist in a video recording — though we were not aiming to replicate Paul Barton because he is such a virtuoso,” Shlizerman said. “We hope that our study enables novel ways to interact with music. For example, one future application is that Audeo can be extended to a virtual piano with a camera recording just a person’s hands. Also, by placing a camera on top of a real piano, Audeo could potentially assist in new ways of teaching students how to play.”