Archive for the ‘research’ Category

ISMIR 2008

Friday, October 3rd, 2008

ISMIR was fun this year. I was pleasantly surprised by the quality of the papers, there were many solid experiments. I added a lot of them to my to-read list. I enjoyed hanging out with the ISMIR crowd, people I only get to see a few times a year or less. While I got a lot out of the conference, I regret not expanding my social circle more and not showing off my majorminer search demo more. I did, however, get to spend some quality time with my dad, my sister, and Joanne and I probably spent more time in West Philly than ever before.

There was an interesting panel on Commercial Music Discovery and Recommendation, which I found a little bit discouraging. The message seemed to be that academic research and corporate development are different things and shouldn’t be confused. Elias said something to the effect that given the choice between 10x more data and an algorithm that was a few percent better, he’d take the data. Brian said that companies do what they do pretty well and academics should focus on doing things that companies can’t. Anthony didn’t foresee developments in MIR affecting him very much, predicting instead that user interface was the area that could improve the online music experience the most.

For more thorough coverage of the conference, take a look at some of the other ISMIR attendes’ blogs. Google blog search pointed me to a number of posts (in pagerank order): Paul Lamere (one of many), Elias Pampalk, Michael Good, Justin Donaldson, Jeremy ?, Kris West, Luke Barrington, Matthias Mauch, and Karin Dressler (in German).

MajorMiner good news and bad news

Wednesday, September 10th, 2008

The bad news is that I’m having some DNS issues with majorminer.com. The good news, is that you can now access the same great MajorMiner game and search at majorminer.org.

The other good news is that there’s a new and improved search page. Hopefully it’s easier to understand what’s going on, I’ve even added a FAQ. If you look hard enough, you might also notice a new feature I’ve introduced: similarity browsing. For each clip that was autotagged, I computed a similarity value between its autotag vector and all of the others, finding the thirty or so nearest neighbors. You can follow this web of similarity around to some fun stuff, even though it’s not a huge collection of music. Here’s a random starting place.

This similarity is a semantic similarity, as Doug Turnbull likes to call it. The the clips might not have an exactly similar sound, the rhythms might not match, or they might differ in an instrument or two, but for the most part you would describe them with the same sorts of words. If you want to know what kinds of words we’re talking about, take a look at the autotagging results. Have a look and let me know what you think (you can leave a link to any exceptional results you find in the comments).

Journal of New Music Research

Thursday, August 14th, 2008

The journal version of lat year’s ISMIR paper is ready to be published. The main addition is an analysis of the tags we’ve collected with the game, including a comparison with tags for the same music from Last.fm. In these experiments, we compared the accuracy of classifiers trained on different tag corpora, which was a bit tricky. Since the Last.fm tags and the MajorMiner tags were not the same, we could only compare seven of them directly. For an overall comparison, we used the mean accuracy across tags, which is useful, but not terribly sophisticated. Here’s the abstract:

We have designed a web-based game, MajorMiner, that makes collecting descriptions of musical excerpts fun, easy, useful, and objective. Participants describe 10 second clips of songs and score points when their descriptions match those of other participants. The rules were designed to encourage players to be thorough and the clip length was chosen to make judgments objective and specific. To analyze the data, we measured the degree to which binary classifiers could be trained to spot popular tags. We also compared the performance of clip classifiers trained with MajorMiner’s tag data to those trained with social tag data from a popular website. On the top 25 tags from each source, MajorMiner’s tags were classified correctly 67.2% of the time, while the social tags were classified correctly 62.6% of the time.

MIREX Audio Tag Classification

Wednesday, August 13th, 2008

Every time I write about MajorMiner, people ask when I’m going to make the data publicly available. Well, I’m starting to do that by building a MIREX task around it. The task is officially called the Audio Tag Classification task and you can take a look at the details on its MIREX wiki page. As emails and conversations bounced around, it became not just a classification task, but also a retrieval task. Doug Turnbull formulated it well by breaking it down into three related tasks:

  1. Clip-Tag classification: determine whether each tag applies to each clip or not
  2. Clip retrieval: for each tag, rank the clips by their relevance
  3. Tag retrieval: for each clip, rank the tags by their relevance

There are only a few days left until the submission deadline, but if you want to throw something together, more submissions would be great. In case you’re wondering, the main contributors to the design of this task have been Kris West, Thierry Bertin-Mahieux, Doug Turnbull, and Greg Tsoumakas. Mert Bay is running things at IMIRSEL.

Interspeech 2008

Tuesday, August 12th, 2008

Ron had a paper accepted to Interspeech this year about adding speech models (source priors) to MESSL. It is entitled, “Source separation based on binaural cues and source model constraints.” As much as I’d like to go, Brisbane, Australia is a bit farther than Pittsburgh was. Here’s the abstract:

We describe a system for separating multiple sources from a two-channel recording based on interaural cues and known characteristics of the source signals. We combine a probabilistic model of the observed interaural level and phase differences with a prior model of the source statistics and derive an EM algorithm for finding the maximum likelihood parameters of the joint model. The system is able to separate more sound sources than there are observed channels. In simulated reverberant mixtures of three speakers the proposed algorithm gives a signal-to-noise ratio improvement of 2.1 dB over a baseline algorithm using only interaural cues.

ISMIR 2008

Monday, August 11th, 2008

My paper was accepted to ISMIR this year in Philadelphia. It uses the MajorMiner data we’ve collected to explore the relationship between different granularities of music metadata. That is to say, we compare the accuracy with which clip-level audio classifiers can be trained using ground truth data that is supplied at the artist, album, track, or clip level. There’s lots of music metadata supplied at one of these coarser granularities, e.g. pandora, Last.fm, the all music guide, which have described a much greater fraction of the artists out there than the tracks. This paper looks at the feasibility of using such data to train clip-level classifiers. Since the MajorMiner data is collected at the clip level, it’s easy to blur it out to tracks, albums, or artists, and also easy to evaluate the accuracy of the final clip classifiers. Here’s the abstract:

Multiple-instance learning algorithms train classifiers from lightly supervised data, i.e. labeled collections of items, rather than labeled items. We compare the multiple-instance learners mi-SVM and MILES on the task of classifying 10-second song clips. These classifiers are trained on tags at the track, album, and artist levels, or granularities, that have been derived from tags at the clip granularity, allowing us to test the effectiveness of the learners at recovering the clip labeling in the training set and predicting the clip labeling for a held-out test set. We find that mi-SVM is better than a control at the recovery task on training clips, with an average classification accuracy as high as 87% over 43 tags; on test clips, it is comparable to the control with an average classification accuracy of up to 68%. MILES performed adequately on the recovery task, but poorly on the test clips.

Broken sound card

Wednesday, May 21st, 2008

My desktop is getting a little old. I got it the summer before I went to college, which means its 8th birthday is coming up in August. It’s still plugging along, though, hosting this blog, among other things with a whopping 384 MB of rambus ram. Aside from the memory, it’s starting to show its age in other ways, like the sound card going on the fritz.

It all started when I went to check out guitarati.com. I clicked on a “play” link and garbage came out the speakers. It wasn’t complete garbage, it sounded like music in the most tenuous of ways, but it wasn’t supposed to sound like that. “Too bad that website’s sound is broken,” I thought, except that everything I played after that sounded garbled in the same way. Working with sound and computers for a living, I figured I should investigate the problem. My first thought was that something was getting the endianness of the 16-bit samples wrong, like matlab did (does?) on linux. Trying a variety of endianness styles in playback, however, didn’t solve the problem.

Original chirp Recorded chirp

My next thought was to record the weirdness, and the mic seemed to be working fine (although I couldn’t hear the playback to be certain). I constructed a simple chirp in the linear algebra program octave, the spectrogram of which can be seen at the left. What I recorded coming out of my speakers was very different from that, as can be seen from the spectrogram on the right. Disregard the gradual changes in color, the important thing to notice is that instead of just one line sloping up in the low frequencies, there are suddenly lines every 1380 Hz sloping up and down from common beginnings. The multiple horizontal repetitions are just repeated playbacks.

This repetition in frequency indicated that the sound coming out of the speakers had a bandwidth of only 690 Hz, instead of the requested 22050 Hz. Furthermore, playing the sound back at different sampling rates changed the bandwidth of the signal coming out of the speakers. The ratio of requested bandwidth to actual bandwidth turned out to be almost exactly 32:1. This sort of replication happens when upsampling a signal (pictures from the time and frequency domains), and seems to indicate that 31 of every 32 samples were being set to 0, with the horrible distortion coming from the resulting aliasing.

Even with a pretty good idea of what was happening, I still couldn’t figure out why it was happening. I tried resetting the various sound drivers, restarting the computer, even booting off a LiveCD, and nothing worked. I purchased a USB sound card (for $7 + $6 shipping), plugged it in, and could hear again. It’s a cute little device, basically just a USB plug on one end and a headphone and mic jack on the other. It was quite a relief to be able to watch my backlog of youtube videos, listen to music, and stream NPR while I was working out. Sound is pretty handy, as it turns out.

The question still remains, whether what happened to my internal sound card was a hardware failure or a just a bad setting. There aren’t that many reports online of the hardware on a sound card failing, it’s not like there are moving parts. If you have any insight, I’m all ears.

Recti-Linear room simulator

Sunday, May 18th, 2008

Spectrogram of simulated impulse response For studying auditory localization and separation, it’s very important to have realistic spatial recordings. These can come from a number of sources, what I’ve been using in the past is a collection of impulse responses recorded through a KEMAR dummy in a real classroom. Each impulse response allows a sound to be simulated for one particular listener and source position in the room. While these are very realistic, they are time-consuming to record. The particular binaural impulse responses I’ve been using were recorded by Tim Streeter in Barbara Shinn-Cunningham’s lab at BU.

The next-best way to create spatial recordings is to simulate these binaural impulse responses. There are a couple of decent packages out there for doing this. Stephen McGovern’s rir is very fast, but only creates a bare-bones impulse response. Douglas Campbell et al.’s roomsim includes lots of features to make the impulse responses realistic, but it is very slow and can only be driven through a GUI.

I’ve written a new room simulator in the same spirit, but combining the best features of both of these. It’s called rlrs, short for “recti-linear room simulator” and you can download it here (3.6 MB). I’m releasing it under the GPLv3. Here’s the intro from the README:

This code will generate binaural impulse responses from a simulation of the acoustics of a rectilinear room using the image method. It has a number of features that improve the realism and speed of the simulation. It can generate a pair of 680 ms impulse responses sampled at 22050 Hz in 75 seconds on a 1.8 GHz Intel Xeon. It’s easy to run from within scripts to generate a large set of impulse responses programmatically.

To improve the realism, it applies anechoic head-related transfer functions to each incoming reflection, allows fractional delays, includes frequency-dependent absorption due to walls, includes frequency- and humidity-dependent absorption due to air, and varies the speed of sound with temperature. It also randomly perturbs sources in proportion to their distance to the listener to simulate imperfections in the alignment of the walls.

To improve simulation speed, it performs all calculations in the frequency domain and the complex exponential generation code is written in C, it only calculates the Fourier transforms of anechoic HRTFs as it needs them, and then it caches them, and it culls sources that are beyond the desired impulse response length or are significantly quieter than the direct path.

MajorMiner music search

Monday, April 28th, 2008

We’ve started using the data that we collected through the MajorMiner game. We’re using it in two ways: making it searchable directly, and training autotaggers with it. The human search finds all of the clips that have had a particular tag applied to them by at least two people, sorted by the number of times it’s been applied. You can type a search directly into the search box, or browse through the top few. People are pretty good at finding things in music, as it turns out, check out british, u2, tambourine, and scratch. This search also takes advantage of the newly introduced canonicalization of tags, so that funk matches funky. But there are always ambiguity issues, e.g. club as lyric vs genre.

The machine search is a little more involved. We took all of the tags that had been applied to enough (35) clips and used them to train classifiers. Actually, we only used clips from half of the artists in our collection to train the classifiers, then we ranked all of the clips from the rest of the artists by each classifier’s output. This means we can look at all of those clips sorted by how much they appeal to the rap classifier, the saxophone classifier, the house classifier, and so on. I like how the guitar classifier catches Outkast’s acoustic guitar (!), but also the Jesus and Mary Chain’s fuzzed out guitar. For those of you interested in the details, we have a couple of papers that we’ve submitted recently describing them, but the gist is that we’re using the features from last MIREX and the usual SVM classifier.

Some thought went into the ranking of the tags on the main search page as well. Since we know the answers for some of the clips in the test set, and we ranked the tags by how well their classifier was able to learn them. Actually, we used a Bayesian estimate of the classification accuracy from the beta-binomial model to do the ranking more intelligently. The basic idea is that test accuracy is measured more accurately for tags with a lot of test examples, and less accurately for tags with few test examples. The measured accuracy of tags are then shrunk towards the overall mean accuracy in proportion to how well the model thinks they are estimated. So even though club has a better raw accuracy than rap, it was tested on many fewer examples, so it ends up below rap in the final ranking, i.e. the raw accuracy is more likely a random fluctuation than a meaningful result.

So go check out some of the creative ways our players have found to describe music, and describe some music yourself!

The Puzzling Nature of Success in Cultural Markets

Saturday, April 26th, 2008

Matthew Salganik gave a talk in the EE department with this title. He got his PhD in sociology last year under Duncan Watts, studying the (un)predictability of hits, blockbusters, best-sellers, etc. You probably read about it. If not, the basic idea is that they set up a website where people would come and listen to music and examined the influence of popularity on people’s listening habits. We’re not talking millions of songs here, just 48 chosen pretty much at random from unknown bands on PureVolume. Users could listen to any song and then after listening to it had the opportunity to download it.

As users arrived, they were assigned to one of eight completely separate “worlds”. In seven of these worlds, the users could see how many times each song had been downloaded, the last world served as a control in that users couldn’t see how popular each song was. The punchline is that in the worlds where people could influence each other, popular songs were downloaded a lot, but different songs became popular in each world. In the control group, some songs were still downloaded more than others, but the difference wasn’t as striking.

Popularity vs quality

The graph from the talk that really stuck with me was this one, taken from their Science paper. It shows the marketshare of each song in the control world versus its marketshare in each of the seven influence worlds. The marketshare in the control world is taken as an un-influenced measure of quality, while the marketshare in the influence worlds are taken as measures of popularity. What you can see is a triangular shape indicating that the “bad” songs were unpopular in all worlds, while the “good” songs were only popular in some of the worlds. Sagalnik said that this agreed with what people in hit-based industries told them, that it’s easy to predict what won’t be a hit, but hard to predict what will.