Archive for the ‘research’ Category

Dissertation done

Friday, January 1st, 2010

I’ve put the finishing touches on my dissertation. My defense was at the beginning of September, and my committee didn’t ask for (m)any changes, but there were a few things I wanted to fix up. The official “publication” date will be February 2010.

Binaural Model-Based Source Separation and Localization

When listening in noisy and reverberant environments, human listeners are able to focus on a particular sound of interest while ignoring interfering sounds. Computer listeners, however, can only perform highly constrained versions of this task. While automatic speech recognition systems and hearing aids work well in quiet conditions, source separation is necessary for them to be able to function in these challenging situations.

This dissertation introduces a system that separates more than two sound sources from reverberant, binaural mixtures based on the sources’ locations. Each source is modelled probabilistically using information about its interaural time and level differences at every frequency, with parameters learned using an expectation maximization (EM) algorithm. The system is therefore called Model-based EM Source Separation and Localization (MESSL). This EM algorithm alternates between refining its estimates of the model parameters (location) for each source and refining its estimates of the regions of the spectrogram dominated by each source. In addition to successfully separating sources, the algorithm estimates model parameters from a mixture that have direct psychoacoustic relevance and can usually only be measured for isolated sources. One of the key features enabling this separation is a novel probabilistic localization model that can be evaluated at individual time-frequency points and over arbitrarily-shaped regions of the spectrogram.

The localization performance of the systems introduced here is comparable to that of humans in both anechoic and reverberant conditions, with a 40% lower mean absolute error than four comparable algorithms. When target and masker sources are mixed at similar levels, MESSL’s separations have signal-to-distortion ratios 2.0 dB higher than four comparable separation algorithms and estimated speech quality 0.19 mean opinion score units higher. When target and masker sources are mixed anechoically at very different levels, MESSL’s performance is comparable to humans’, but in similar reverberant mixtures it only achieves 20–25% of human performance. While MESSL successfully rejects enough of the direct-path portion of the masking source in reverberant mixtures to improve energy-based signal-to-noise ratio results, it has difficulty rejecting enough reverberation to improve automatic speech recognition results significantly. This problem is shared by other comparable separation systems.

Download it as a single pdf (7.6 MB)

Download separate chapters as pdfs (some of the internal links don’t work):

Boston Music Hack Day

Saturday, November 28th, 2009

I spent last weekend at the Boston Music Hack Day. It was a lot of fun. The idea was that people who are into music and web tech would get together for the weekend and build something cool.

My project was called Bowie S-S-S-Similarities. I got all of David Bowie’s music (thanks, Brian!) and lined it up from beginning to end. Then I ran my autotaggers from Musically Intelligent Machines over every 10-second clip in it. Then I measured the similarity of the tags applied to every 10-second clip with every other 10-second clip, generating a big (5000×5000) similarity matrix. I originally wanted to use some sort of google maps interface to browse through this big matrix, but I couldn’t find one, so I wrote a python wrapper to let me browse through it and listen to the clips. Bowie seemed like a good artist to pick for this because he’s put out a lot of albums and they’ve been quite varied.

Some fun observations from playing around with it:

  • The album Low was different from everything else, being one of Bowie’s ambient collaborations with Brian Eno.
  • The track “Kingdom Come” from the album Scary Monsters is the most self-similar track and is also the most similar to Bowie’s other tracks.
  • Using a similarity based only on genre tags makes a pretty different similarity matrix than using a similarity based on tags related to vocals.

The audience liked the hack and the presentation enough to get me a copy of Rock Band. Check out a picture of me and Frank, who also won a copy of Rock Band for his hack PartyLister, on Frank’s post.

After getting home, I read Kurt’s blog post about his hack, which used Microsoft’s Seadragon to zoom and pan around a giant visualization of artist similarity. I thought that was pretty awesome, and as it was what I had wanted to do for my hack, I made a Seadragon viewer for the Bowie similarity matrix. I got it working most of the way with the seadragon.py script to chop up the big image into a pyramid of tiles and the Seadragon AJAX code to display it. But I couldn’t get that to work on the blog here, so I used the online Seadragon builder.

Note that the seadragon version is using more music than my original hack (9000 clips), which should be all of Bowie’s studio albums. I’d like to add the ability to see which clips you’re mousing over and to play them, but I haven’t had time. I would also like to get the image to be in color, but I haven’t been able to get the Python Imaging Library to behave with numpy’s colormaps. And finally, it would be nice to make the similarity steerable so that you can seamlessly switch between different types of similarity or even different weights on different tags.

WASPAA 2009

Monday, September 28th, 2009

My paper was accepted to WASPAA this year. It is entitled The ideal interaural parameter mask: a bound on binaural separation systems. It’s about source separation, an upper bound on source separation algorithms like MESSL. It also includes some improvements to MESSL, namely a prior on ILD and an explicit model of reverberation, which both improve separations. In fact, they bring it quite close to the limit set by the proposed upper bound. Here’s the abstract:

We introduce the Ideal Interaural Parameter Mask as an upper bound on the performance of mask-based source separation algorithms that are based on the differences between signals from two microphones or ears. With two additions to our Model-based EM Source Separation and Localization system, its performance approaches that of the IIPM upper bound to within 0.9~dB. These additions battle the effects of reverberation by absorbing reverberant energy and by forcing the ILD estimate to be larger than it might otherwise be. An oracle reliability measure was also added, in the hope that estimating parameters from more reliable regions of the spectrogram would improve separation, but it was not consistently useful.

Reading papers

Thursday, July 9th, 2009

I’m working on the literature review chapter of the dissertation and it’s gotten me thinking. It’s a real pain to put together a good survey. It’s hard to know what papers are out there, what they say, and what’s notable about them. I’ve been using a few tools for this, but there’s a lot of room for improvement.

I’ve been using citeulike for a while, and it’s great for scraping IEEE and JASA abstracts. It can import bibtex files, but they’re harder to get linked in to pdfs and you might end up with a lot of duplicates if you’re not careful. On Neeraj’s recommendation, I’ve been trying out mendeley and it’s a bit cooler. It can read in a directory of pdfs and figure out to some extent what they are. This is most useful with popular papers, because they have some sort of fingerprinter that recognizes the same pdf from multiple users and matches up the metadata. That way, only one person has to correct each entry and others can benefit. I’m not sure if it’s been able to recognize when a pdf is the same as a bibtex entry, but it might be able to. They also seem to have a very responsive feedback system using UserVoice. And of course there’s always google scholar to actually find these papers.

But, I think these apps could be a lot more useful. Instead of just linking to the papers that cite a paper, a lot could be gained by keeping track of the “anchor text” that does the linking. This means not only noting that paper X cites paper Y, but that paper X describes paper Y in this way, uses this information from it, cites it in this context, etc.

The first thing that this would enable would be the annotation of a paper’s bibliography with the relevant parts of its text. These are all of the outgoing links from a paper. By analyzing the paper, all of the [22]s or the (Mojo, 1987)s could be associated with the right entry in the bibliography. It would give the references some much-needed context. It would also show which references in a bibliography were actually discussed and which were just mentioned in passing. This could be an application by itself. And while we’re linking the references to the bibliography, it could put in some hyperlinks, like the hyperref package in latex does, but after the fact.

The second thing that it would enable would be the annotation of a paper with all of the things that other papers have said about it. These would be all of the incoming links from other papers. It would give you some context on a new paper that you had just come across in addition to what the authors wrote in the abstract. If you wanted to be really fancy, each reader could have trusted sources of these incoming links. These opinions are like little mini reviews or summaries that have already been published, no need to solicit readers’ opinions. Instead of the first few sentences on a “papers that cite X” page on google scholar, you’d get a page of summaries, reviews, and extracts.

Both of these features would make it much easier to get introduced to a new field or to write a more balanced review of a familiar or semi-familiar field. I know it’s tough to match up bibliography entries with references and with papers themselves, and that there are some user interface issues to work out here, but it shouldn’t be that hard. Maybe crowdsourcing could help if necessary. Hopefully all of this would help allay that niggling fear of missing an important paper.

Off to Montreal

Friday, May 15th, 2009

Just when you thought I’d given up on this blog…big news! The grant I wrote to the Quebec government came through. I’m going to do a postdoc with Doug Eck at the Universite de Montreal. It should be very cool, I’ll be building up my machine learning chops working on autotagging and other music problems. I move up there in the fall. Now all I have to do is write my dissertation and learn French…

Ground truth

Monday, December 1st, 2008

The funny thing about my research, whether it’s music classification, source separation, or any other sort of machine learning task I can think of, is the difference between developing an algorithm and deploying it. It’s actually harder to develop an algorithm than it is to deploy it. To deploy an algorithm, if you’re shooting from the hip, you just need to build it and run it on the data you want to analyze. So if I want to develop a music classifier, I extract some features, train and classifier, and classify some music. To develop the algorithm, however, you need to do everything you would need to do to deploy it, but then you also need ground truth. That is to say that you need to know what answer you’re expecting before you get it, so you can tell how well you’re doing. So, paradoxically, I need to have my music already classified in order to see how well my classifier can classify it.

It was always clear to me that if you can get new ground truth data, you can do cool new things, but it took a while for it to sink in that you really can’t develop a system to solve a problem without having the problem already solved, in some sense. Of course, the power of machine learning comes from being able extrapolate results from a small subset of labeled data to an infinite amount of as-yet-unlabeled data. I can develop music classifiers (and pick the one that does best) using a small set of already-classified music and then use it to classify as much music as I want. The question is, is that as-yet-unlabeled data really that similar to the test set? When you have enough data to know the answer to that question, you probably have enough data to do pretty well with a basic classifier.

As an aside, I’m always highly doubtful of claims that computers can latch onto things beyond human perception. For watermarking, sure, it’s designed so that machines can perceive it, but people can’t. But when it comes to very human-grounded ideas like similarity, I think it is impossible to try to circumvent human “subjectivity”. There really is no objective measure of whether two sounds are similar besides the consistency in subjective ratings of human listeners. I think much of the trick of developing (provably) useful algorithms is defining problems that have objective solutions and then solving them objectively.

ISMIR 2008

Friday, October 3rd, 2008

ISMIR was fun this year. I was pleasantly surprised by the quality of the papers, there were many solid experiments. I added a lot of them to my to-read list. I enjoyed hanging out with the ISMIR crowd, people I only get to see a few times a year or less. While I got a lot out of the conference, I regret not expanding my social circle more and not showing off my majorminer search demo more. I did, however, get to spend some quality time with my dad, my sister, and Joanne and I probably spent more time in West Philly than ever before.

There was an interesting panel on Commercial Music Discovery and Recommendation, which I found a little bit discouraging. The message seemed to be that academic research and corporate development are different things and shouldn’t be confused. Elias said something to the effect that given the choice between 10x more data and an algorithm that was a few percent better, he’d take the data. Brian said that companies do what they do pretty well and academics should focus on doing things that companies can’t. Anthony didn’t foresee developments in MIR affecting him very much, predicting instead that user interface was the area that could improve the online music experience the most.

For more thorough coverage of the conference, take a look at some of the other ISMIR attendes’ blogs. Google blog search pointed me to a number of posts (in pagerank order): Paul Lamere (one of many), Elias Pampalk, Michael Good, Justin Donaldson, Jeremy ?, Kris West, Luke Barrington, Matthias Mauch, and Karin Dressler (in German).

MajorMiner good news and bad news

Wednesday, September 10th, 2008

The bad news is that I’m having some DNS issues with majorminer.com. The good news, is that you can now access the same great MajorMiner game and search at majorminer.org.

The other good news is that there’s a new and improved search page. Hopefully it’s easier to understand what’s going on, I’ve even added a FAQ. If you look hard enough, you might also notice a new feature I’ve introduced: similarity browsing. For each clip that was autotagged, I computed a similarity value between its autotag vector and all of the others, finding the thirty or so nearest neighbors. You can follow this web of similarity around to some fun stuff, even though it’s not a huge collection of music. Here’s a random starting place.

This similarity is a semantic similarity, as Doug Turnbull likes to call it. The the clips might not have an exactly similar sound, the rhythms might not match, or they might differ in an instrument or two, but for the most part you would describe them with the same sorts of words. If you want to know what kinds of words we’re talking about, take a look at the autotagging results. Have a look and let me know what you think (you can leave a link to any exceptional results you find in the comments).

Journal of New Music Research

Thursday, August 14th, 2008

The journal version of lat year’s ISMIR paper is ready to be published. The main addition is an analysis of the tags we’ve collected with the game, including a comparison with tags for the same music from Last.fm. In these experiments, we compared the accuracy of classifiers trained on different tag corpora, which was a bit tricky. Since the Last.fm tags and the MajorMiner tags were not the same, we could only compare seven of them directly. For an overall comparison, we used the mean accuracy across tags, which is useful, but not terribly sophisticated. Here’s the abstract:

We have designed a web-based game, MajorMiner, that makes collecting descriptions of musical excerpts fun, easy, useful, and objective. Participants describe 10 second clips of songs and score points when their descriptions match those of other participants. The rules were designed to encourage players to be thorough and the clip length was chosen to make judgments objective and specific. To analyze the data, we measured the degree to which binary classifiers could be trained to spot popular tags. We also compared the performance of clip classifiers trained with MajorMiner’s tag data to those trained with social tag data from a popular website. On the top 25 tags from each source, MajorMiner’s tags were classified correctly 67.2% of the time, while the social tags were classified correctly 62.6% of the time.

MIREX Audio Tag Classification

Wednesday, August 13th, 2008

Every time I write about MajorMiner, people ask when I’m going to make the data publicly available. Well, I’m starting to do that by building a MIREX task around it. The task is officially called the Audio Tag Classification task and you can take a look at the details on its MIREX wiki page. As emails and conversations bounced around, it became not just a classification task, but also a retrieval task. Doug Turnbull formulated it well by breaking it down into three related tasks:

  1. Clip-Tag classification: determine whether each tag applies to each clip or not
  2. Clip retrieval: for each tag, rank the clips by their relevance
  3. Tag retrieval: for each clip, rank the tags by their relevance

There are only a few days left until the submission deadline, but if you want to throw something together, more submissions would be great. In case you’re wondering, the main contributors to the design of this task have been Kris West, Thierry Bertin-Mahieux, Doug Turnbull, and Greg Tsoumakas. Mert Bay is running things at IMIRSEL.