top of page

PyData the Musical Revue! Act I: (bird) Song and Dance

  • Writer: Rebecca Lovering
    Rebecca Lovering
  • Feb 20, 2018
  • 4 min read

I lied, it's not going to be musical.

These posts are going to be about a couple of the talks I attended at PyData that were interesting but not as actionable or directly relevant as others. It's worth knowing what people are out there doing, and who knows when a fun project will spark a serious idea?

Neural Networks for the Segmentation of Vocalizations - David Nicholson and Yarden Cohen

Segmenting vocalizations is training a machine to recognize where different human speaking sounds begin and end. One application of this branch of study is to make machines better at understanding where one word ends and another begins. If you use voice to text, you're already benefiting from the work that's been done in this area, but we have far more to do. Improvement in understanding which segments are expected vs those that are produced could help us, for example, diagnose and respond appropriately to people with speech impediments who are trying to use automated voice systems. (The really important work we're talking about doing is, of course, better timing on karaoke videos as they highlight the words, which this research could play into automating.)

While native speakers of a language have no problem separating words, it's actually really difficult to figure out directly and uniquely from acoustic patterns, which is what a machine has to use. Although our minds naturally insert "spaces" in an utterance, speakers are making vocal sounds throughout the whole thing. (In fact, if you try to be properly silent between each word in a sentence, it sounds bizarre and unnatural. Go ahead, try.) Accordingly, we need to teach machines what the patterns are that mean, "hey, new meaningful sound!"

For some context/perspective on the problem, take a look at this:

These are spectrograms of someone speaking. The red lines are tracking what are called "formants," concentrations of resonance at certain frequencies. It's actually possible to read these, to a certain extent, and make decent guesses about what sound is being made, at least for vowels, but it is incredibly difficult. (One of my finals in phonology required us to transcribe a sentence from these. There was weeping and the gnashing of teeth in the darkness.) Would it be interesting to know that each of those rows is a person saying one syllable with two sounds in it? Any idea where one sound ends and the second begins, or what they are? They're the syllables "dee," "dah," and "doo." (Easiest to diagnose is "dee" because of its significantly higher top formants.)

These cryptic diagrams are essentially a computer's only understanding of human speech. Describing relevant sounds computationally/mathematically is complicated, and right now we're only looking at samples recorded very intentionally in a lab, without a lot of other noise to cloud them. In a manner of speaking, even this spectrogram is "noise," because we want the phoneme, not the acoustic manifestation of it. This problem is essentially one of looking for signal in noise.

So, having laid out the area of inquiry, this talk was largely focused not on human segments, but segments of birdsong. Birds have fascinatingly human-like utterance patterns, in that they can learn new communicatively-meaningful utterances and recombine them to create new utterances throughout their lives, not just when they are plastic-brained infants. Because different species have such distinctive calls, looking at learned birdsong across species can give us some insight into the extent of the learnability of birdsong and the influence of nature over nurture. In particular, Nicholson and Cohen chose Bengalese finches as their species of interest (songbirdscience.com).

Thankfully, birdsong segments, although very like human phonemes in some important ways, are much more acoustically distinct. Additionally, they're less complex than human speech, so are more accessible and easily labeled (just more easily, not actually, objectively easily). Those labeled samples can then be fed into a neural network (which requires staggering quantities of labeled training data to get good at anything).

The two approaches examined here were a fully convolutional neural network (CNN) and a convolutional and bidirectional LSTM (CNN-biLSTM). Convolutional neural networks assume their input will be an image, a useful assumption here because we're going to feed it spectrograms and ask for labeled spectrograms as the output. Because images are resource-intensive to deal with from a conventional NN perspective (the number of connections and therefore weights between neurons gets out of hand very quickly as the number of pixels goes up), neurons in a CNN are connected to fewer than all the neurons in the layer preceding them.

A typical CNN

Because segmentation is a time-series problem, let's get an LSTM involved! An LSTM (long short-term memory) is a particular kind of recurring neural network (RNN) that can use information across time in a way that conventional RNNs can't. A lot of language problems have long-term dependencies (think about subject and verb agreement when you have a lot of modifiers between them: "he slowly, torturously, with the unholy glee only found in antiques collectors and Pokémon masters, drew out his magnifying glass" - there are 17 tokens between the subject and the verb). LSTMs have proven great tools for retaining early information and having it influence solutions down the line. Instead of the path of an RNN, in which every layer is identical and gives a yes/no (albeit weighted and bias-able), an LSTM has four particular layers that interact to produce changes to the current state of the system more sensitively. It's a sugar molecule to the traditional RNN's identical graphene sheets.

A single cell of an LSTM

These two approaches go about their job differently, and the researchers thought that combining them might offer the best of all possible worlds. Sadly, neither model really distinguished itself in the test they tried to run, which was based on which could most accurately label a random sampling of songs. The biLSTM had a slight advantage, but not a significant one. Moving forward, Nicholson and Cohen aspire to try out feedforward vs recurrent neural networks, to see if that's another place to make progress. The dream is to transfer this learning to human speech, improving automated speech recognition (ASR), detection and diagnosis of speech disorders, and the automatic segmentation of songs (human ones).

Seeing what these tools can do was a good experience, and helped me understand them from a different angle. If you want to take a look at the project on GitHub, feel free:

 
 
 

Kommentare


Move: Putting Great Ideas Into Great Words

© 2024 by Rebecca Lovering

  • GitHub
  • LinkedIn - Grey Circle
bottom of page