MESSL code online

After getting a number of queries, I’ve realized that I should just put the matlab code for MESSL online, so everyone can use it. Note, however, that this is for noncommercial use only. If you would like to use it commercially, please contact me. Also let me know if you find it useful. And if you use it academically, I would appreciate it if you cited this paper:

Michael I. Mandel, Ron J. Weiss, and Daniel P. W. Ellis. Model-based expectation maximization source separation and localization. IEEE Transactions on audio, speech, and language processing, 18(2):382-394, February 2010. [ bib | DOI | .pdf | Abstract ]

Update 2012-05-24: I have moved the code to github to make it easier to distribute and track.

Here is the README for the code:

Model-Based EM Source Separation and Localization

Copyright 2006-2009 Michael I Mandel and Ron Weiss, all rights reserved
Last updated 2009-08-20

Basic usage to separate two sources:

% Load stereo wav files of the same length
y1 = wavread('XXX1.wav');
y2 = wavread('XXX2.wav');
lr = y1 + y2;

[m,p] = messl(lr, tau, 2, 'vis', 1);

% Reconstruct wavforms from masks
yhat1 = reconstruct(m, lr, 1);
yhat2 = reconstruct(m, lr, 2);

Fancier usage, initialized from PHAT-histogram:

% Localize and then run MESSL
tdoa = phatLoc(LR, tau, 2, 1024, 1);
[m,p] = messl(lr, tau, 2, 'vis', 1, 'tauPosInit', tdoa);

Even fancier usage, garbage source and ILD prior (better in reverb,
but only when using dummy-head recordings):

[m,p] = messl(lr, tau, 2, 'vis', 1, 'ildPriorPrec', 3, ...
'GarbageSrc', 1, 'sr', 16000);

Can also use prob2mask to make the mask more definitive, i.e. closer
to binary, but not binary.

m2 = prob2mask(m);
yhat1 = reconstruct(m2, lr, 1);
yhat2 = reconstruct(m2, lr, 2);

Posted in research | 1 Comment

Blogging is hard

Let’s go tumblr-ing.

Posted in meta | Leave a comment

ISMIR 2010

I just got back from ISMIR 2010 in Utrecht, which was lots of fun. I had one paper there on which I was first author and it was the first talk I’ve given at a conference. Well, I had a talk at SAPA 2006, but that’s really a workshop. So it was very exciting and I think it went over pretty well.

Learning tags that vary within a song

This paper examines the relationship between human generated tags describing different parts of the same song. These tags were collected using Amazon’s Mechanical Turk service. We find that the agreement between different people’s tags decreases as the distance between the parts of a song that they heard increases. To model these tags and these relationships, we describe a conditional restricted Boltzmann machine. Using this model to fill in tags that should probably be present given a context of other tags, we train automatic tag classifiers (autotaggers) that outperform those trained on the original data.

You can read the full paper here and see my slides here.

Posted in Uncategorized | Leave a comment

Dissertation done

I’ve put the finishing touches on my dissertation. My defense was at the beginning of September, and my committee didn’t ask for (m)any changes, but there were a few things I wanted to fix up. The official “publication” date will be February 2010.

Binaural Model-Based Source Separation and Localization

When listening in noisy and reverberant environments, human listeners are able to focus on a particular sound of interest while ignoring interfering sounds. Computer listeners, however, can only perform highly constrained versions of this task. While automatic speech recognition systems and hearing aids work well in quiet conditions, source separation is necessary for them to be able to function in these challenging situations.

This dissertation introduces a system that separates more than two sound sources from reverberant, binaural mixtures based on the sources’ locations. Each source is modelled probabilistically using information about its interaural time and level differences at every frequency, with parameters learned using an expectation maximization (EM) algorithm. The system is therefore called Model-based EM Source Separation and Localization (MESSL). This EM algorithm alternates between refining its estimates of the model parameters (location) for each source and refining its estimates of the regions of the spectrogram dominated by each source. In addition to successfully separating sources, the algorithm estimates model parameters from a mixture that have direct psychoacoustic relevance and can usually only be measured for isolated sources. One of the key features enabling this separation is a novel probabilistic localization model that can be evaluated at individual time-frequency points and over arbitrarily-shaped regions of the spectrogram.

The localization performance of the systems introduced here is comparable to that of humans in both anechoic and reverberant conditions, with a 40% lower mean absolute error than four comparable algorithms. When target and masker sources are mixed at similar levels, MESSL’s separations have signal-to-distortion ratios 2.0 dB higher than four comparable separation algorithms and estimated speech quality 0.19 mean opinion score units higher. When target and masker sources are mixed anechoically at very different levels, MESSL’s performance is comparable to humans’, but in similar reverberant mixtures it only achieves 20–25% of human performance. While MESSL successfully rejects enough of the direct-path portion of the masking source in reverberant mixtures to improve energy-based signal-to-noise ratio results, it has difficulty rejecting enough reverberation to improve automatic speech recognition results significantly. This problem is shared by other comparable separation systems.

Download it as a single pdf (7.6 MB)

Download separate chapters as pdfs (some of the internal links don’t work):

Posted in research | 1 Comment

Sidebar updates

Walking on the moon

I finally cleared my photo backlog, which went back to December, and uploaded them to flickr. There are more than will fit in the sidebar, so check out the rest of my concert photos and random phone photos there.

I also added a LibraryThing widget, which shows the last N books I’ve read. I haven’t been doing much extracurricular reading in the last month or two, but I haven’t told you about the ones before that, so they’re new to you.

Posted in books, meta | Leave a comment

Reading papers

I’m working on the literature review chapter of the dissertation and it’s gotten me thinking. It’s a real pain to put together a good survey. It’s hard to know what papers are out there, what they say, and what’s notable about them. I’ve been using a few tools for this, but there’s a lot of room for improvement.

I’ve been using citeulike for a while, and it’s great for scraping IEEE and JASA abstracts. It can import bibtex files, but they’re harder to get linked in to pdfs and you might end up with a lot of duplicates if you’re not careful. On Neeraj’s recommendation, I’ve been trying out mendeley and it’s a bit cooler. It can read in a directory of pdfs and figure out to some extent what they are. This is most useful with popular papers, because they have some sort of fingerprinter that recognizes the same pdf from multiple users and matches up the metadata. That way, only one person has to correct each entry and others can benefit. I’m not sure if it’s been able to recognize when a pdf is the same as a bibtex entry, but it might be able to. They also seem to have a very responsive feedback system using UserVoice. And of course there’s always google scholar to actually find these papers.

But, I think these apps could be a lot more useful. Instead of just linking to the papers that cite a paper, a lot could be gained by keeping track of the “anchor text” that does the linking. This means not only noting that paper X cites paper Y, but that paper X describes paper Y in this way, uses this information from it, cites it in this context, etc.

The first thing that this would enable would be the annotation of a paper’s bibliography with the relevant parts of its text. These are all of the outgoing links from a paper. By analyzing the paper, all of the [22]s or the (Mojo, 1987)s could be associated with the right entry in the bibliography. It would give the references some much-needed context. It would also show which references in a bibliography were actually discussed and which were just mentioned in passing. This could be an application by itself. And while we’re linking the references to the bibliography, it could put in some hyperlinks, like the hyperref package in latex does, but after the fact.

The second thing that it would enable would be the annotation of a paper with all of the things that other papers have said about it. These would be all of the incoming links from other papers. It would give you some context on a new paper that you had just come across in addition to what the authors wrote in the abstract. If you wanted to be really fancy, each reader could have trusted sources of these incoming links. These opinions are like little mini reviews or summaries that have already been published, no need to solicit readers’ opinions. Instead of the first few sentences on a “papers that cite X” page on google scholar, you’d get a page of summaries, reviews, and extracts.

Both of these features would make it much easier to get introduced to a new field or to write a more balanced review of a familiar or semi-familiar field. I know it’s tough to match up bibliography entries with references and with papers themselves, and that there are some user interface issues to work out here, but it shouldn’t be that hard. Maybe crowdsourcing could help if necessary. Hopefully all of this would help allay that niggling fear of missing an important paper.

Posted in ideas, research | Leave a comment

Off to Montreal

Just when you thought I’d given up on this blog…big news! The grant I wrote to the Quebec government came through. I’m going to do a postdoc with Doug Eck at the Universite de Montreal. It should be very cool, I’ll be building up my machine learning chops working on autotagging and other music problems. I move up there in the fall. Now all I have to do is write my dissertation and learn French…

Posted in research | 1 Comment


As I’ve said many times, I’m a big fan of Malcolm Gladwell’s essays, so when Uncle Wayne and Aunt Jane got me a gift certificate to Barnes and Noble for Hanukkah, I bought a copy of his new book, Outliers. I couldn’t put it down and it was a quick read, so I finished it in a few days. The book is broken into two halves. The first half explores the idea that successful people are successful because firstly they get lucky and secondly they work hard to take advantage of that luck. The second half explores the idea that different cultures are, in fact, different and that those differences have real effects over many generations. It’s linked to the first half in that these differences are intertwined with the lucky breaks people get.

While I enjoyed it, the book seemed a bit padded at times. There were tangential tables that took up multiple pages and the epilogue, an account of the lucky occurrences throughout Gladwell’s family tree, was a nice anecdote but didn’t really bring any more support to the thesis. A couple of the citations came from wikipedia. Certainly not worth the $29 list price, or even the $20 discount price, I’d recommend waiting for the paperback edition or a secondhand copy. There were, however, a number of choice Gladwellian factoids, which I will relate.

In the first half of the book, there were some interesting anecdotes about the lucky breaks that Bill Gates and Bill Joy got on their ways to the top, but the most interesting idea was the fact that there are certain birth months and years that are better than others for success in various fields. The first example of this is in sports, where a national birthday cutoff for kids leads to a disproportionate number of the best adult athletes being born just after that cutoff. The explanation is that the kids who are the oldest for their age group are the biggest and best and they get put on traveling teams, get more practice, play more, and get better coaching, which eventually leads to a significant advantage when they grow up. In English football, the cutoff is September 1st and, at one point in the mid 90s, the Premier League had twice as many players born in the three months after the cutoff than the three before it (nature article). This is apparently also true in elementary school where older 4th graders score better on math tests than younger classmates. This advantage even continues through college, where “students belonging to the relatively youngest group in their class are under-represented by about 11.6 percent.”

Gladwell also describes two cases where birth year gave people legs up. The first was in the software industry. The titans of which were disproportionately born in 1954 and 1955. Gladwell’s argument is that these people were 20 or 21 in 1975 when the Altaire 8800, the first personally-attainable computer, was released. The second is in the lawyers specializing in hostile takeovers in the 1970s, who were disproportionately born during the great depression in the 1930s when birth rates dropped significantly. This gave them better access to schools, colleges, and law schools, which had just been expanded to accommodate the previous generation.

The second half of the book focused on the measurable effects of differences between cultures. In an interesting, but less convincing argument, Gladwell claims that rice cultivation in paddies leads to more entrepreneurial farmers, while wheat and corn cultivation lead to stronger feudal hierarchies. Apparently rice cultivation is quite a tricky endeavor and yields are increased by leveling the ground in the paddy, maintaining the correct water level, using the right combination of rice strains, weeding thoroughly, and fertilizing properly. Rice landlords charged fixed rent, allowing rice farmers to profit from larger harvests while wheat landlords payed fixed wages regardless of yield.

Chinese rice farmers were able to grow rice all year round, harvesting and planting new seedlings two or three times a year. French peasants, on the other hand, planted in the spring, harvested in the fall, and hibernated through the winter. Rice paddies, furthermore, are enriched by nutrients in the irrigation and can be used continuously. Wheat and corn fields, on the other hand, are exhausted by agriculture and need to lie fallow every few years to recover. Gladwell suggests that this difference in farming practices led to opposing cultural analogies for human mental growth, and to differences national school schedules: the American school year is on average 180 days long, while the Japanese school year is 243 days long.

His slightly more convincing argument in this section was about plane crashes. Apparently, plane crashes are generally caused by the compounding of a number of small errors, a condition that is best mitigated by sharing responsibilities between the captain and the first officer. In cultures that have a great deal of respect for authority, such as Korea, the deference the first officers showed to captains tended to cause more crashes. Between 1988 and 1998, Korean Airlines lost 4.79 planes in accidents for every million departures. Compare that to United Airlines, which in the same period lost 0.27 planes in accidents for every million departures. By training its first officers to be more assertive when they noticed a problem, Korean Air has gotten these numbers in line with other carriers. An IBM psychology Geert Hofstede surveyed employees around the globe and used their answers to assemble a set of dimensions for measuring how cultures differ from each other, now known as Hofstede’s dimensions. Korea is apparently second from the top of Hofstede’s list in deference towards authority.

Overall it was a fun read, but I think I’m a bigger fan of Gladwell’s shorter writings.

Posted in books | 1 Comment

Ground truth

The funny thing about my research, whether it’s music classification, source separation, or any other sort of machine learning task I can think of, is the difference between developing an algorithm and deploying it. It’s actually harder to develop an algorithm than it is to deploy it. To deploy an algorithm, if you’re shooting from the hip, you just need to build it and run it on the data you want to analyze. So if I want to develop a music classifier, I extract some features, train and classifier, and classify some music. To develop the algorithm, however, you need to do everything you would need to do to deploy it, but then you also need ground truth. That is to say that you need to know what answer you’re expecting before you get it, so you can tell how well you’re doing. So, paradoxically, I need to have my music already classified in order to see how well my classifier can classify it.

It was always clear to me that if you can get new ground truth data, you can do cool new things, but it took a while for it to sink in that you really can’t develop a system to solve a problem without having the problem already solved, in some sense. Of course, the power of machine learning comes from being able extrapolate results from a small subset of labeled data to an infinite amount of as-yet-unlabeled data. I can develop music classifiers (and pick the one that does best) using a small set of already-classified music and then use it to classify as much music as I want. The question is, is that as-yet-unlabeled data really that similar to the test set? When you have enough data to know the answer to that question, you probably have enough data to do pretty well with a basic classifier.

As an aside, I’m always highly doubtful of claims that computers can latch onto things beyond human perception. For watermarking, sure, it’s designed so that machines can perceive it, but people can’t. But when it comes to very human-grounded ideas like similarity, I think it is impossible to try to circumvent human “subjectivity”. There really is no objective measure of whether two sounds are similar besides the consistency in subjective ratings of human listeners. I think much of the trick of developing (provably) useful algorithms is defining problems that have objective solutions and then solving them objectively.

Posted in research | 1 Comment

ISMIR 2008

ISMIR was fun this year. I was pleasantly surprised by the quality of the papers, there were many solid experiments. I added a lot of them to my to-read list. I enjoyed hanging out with the ISMIR crowd, people I only get to see a few times a year or less. While I got a lot out of the conference, I regret not expanding my social circle more and not showing off my majorminer search demo more. I did, however, get to spend some quality time with my dad, my sister, and Joanne and I probably spent more time in West Philly than ever before.

There was an interesting panel on Commercial Music Discovery and Recommendation, which I found a little bit discouraging. The message seemed to be that academic research and corporate development are different things and shouldn’t be confused. Elias said something to the effect that given the choice between 10x more data and an algorithm that was a few percent better, he’d take the data. Brian said that companies do what they do pretty well and academics should focus on doing things that companies can’t. Anthony didn’t foresee developments in MIR affecting him very much, predicting instead that user interface was the area that could improve the online music experience the most.

For more thorough coverage of the conference, take a look at some of the other ISMIR attendes’ blogs. Google blog search pointed me to a number of posts (in pagerank order): Paul Lamere (one of many), Elias Pampalk, Michael Good, Justin Donaldson, Jeremy ?, Kris West, Luke Barrington, Matthias Mauch, and Karin Dressler (in German).

Posted in research | Leave a comment