Netflix prize competitor: With the best algorithms, metadata becomes worthless

  • Metadata is useful when you need to interpret the results, and most people care about why something is recommended to them. Top netflix algorithms are black boxes from the user standpoint, that doesn't help when making a decision to buy or rent based on that recommendation.

    Compare that with Amazon's approach when they sacrifice the predictive power for the useful explanation (customers who bought X also bought Y)

  • This is fascinating. I was aware of one half of what the article talks about: me and my co-author broke the anonymity of the netflix data (see http://www.cs.utexas.edu/~arvindn/ for paper/press links). Our main insight was that everyone's movie watching behavior is different. The quote "User tastes are infinite shades of grey" in the article just about sums it up perfectly.

    What's funny is that I keep arguing for using more meta-data with my friends who are participating in the competition. I guess I didn't realize that data mining algorithms actually capture the nuances of user tastes.

  • This reminds me of some of the spam filtering algorithms I've read about.

    You'd think that categorizing spam based on keywords (or sender IP, etc) would be useful, but machine learning algorithms can pick up more subtle nuances of language patterns and act more effectively.

    http://portal.acm.org/citation.cfm?id=1216017&jmp=cit...

  • Metadata is in fact useful, though not the metadata that you might expect. One of the biggest wins many teams made was when they started ranking similarity based on edit distance of titles.

  • This makes perfect sense; genres and other information about movies is just an approximation of user taste, while the actual taste of users themselves is clearly the best data to train your models on, since that's what you have to predict.

    Any good model should be able to derive the relationships between films without knowing them beforehand, solely by using users' choices. And, likely, these relationships will be more useful than any from an external database.

  • I can see that metadata about the movies becomes worthless, because there is already a wealth of data about each movie entity in the dataset. However, metadata about the users should be fruitful since each user has fairly few data points to use for prediction.

    Take for example two users who have each rated only Wall-E, and they both rated it a 5. Now, given Jet Li's "The One", what prediction do you give for each user? It is unlikely that two real people with this one data point on Wall-E would have the same outcome on "The One", so any additional data that can help to statistically separate the people can only help your case. For example, is the person male/female? What are the person's favorite genre's (something netflix collects), even things like did the person sign up for 6-at-a-time or 2-at-a-time might correlate slightly.

  • The way I see it is that these people have a set of data that one could say is a line on an xy axis. This line goes up, down, etc, and there does not seem to be any pattern. So they come up with a bunch of algorithms that go as near to the line as possible -> they approximate the line with an algorithm. So from that, they can predict how the next step of the function is going to look like.

    Metadata is like placing some dots on this line and saying "this spot is horror", "this spot is comedy". It becomes irrelevant, because you are already near enough to the line, and that dot does not help you any.

    If I were dealing with this problem, what I would do is break free of these constraints and concentrate on taking the data as an abstract blob of random, then splitting the individual data (i.e, move data into separate 'dimensions') till I had hundreds of straight lines, and then using those for prediction. But I'm sure they must have tested this already :)

    I'm rooting for the team with the two jewish guys and the black guy, afterwards they could get together and make a sitcom. Or a joke.

  • I could see how it's easier to learn user simple preferences from their voting history, but it's shortsighted to say "all meta data is useless".

    What about deriving statistical information from scripts, reviews, or online forums?

  • undefined

  • I would guess that an SVD/SVM feature extraction of the movie script could be of predictive value.