Monday, March 30, 2009

PODS paper

I'm a co-author of a paper that will be appearing in PODS: An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets, by Kirsch, Mitzenmacher, Pietracaprina, Pucci, Upfal, and Vandin. This version isn't quite the final official version (which is due shortly), but should be much the same. I think this is my first official "database paper", which is very exciting. (I'm grateful to Eli Upfal for inviting me and Adam to work on this project.)

The motivation, roughly, is the following. The setting is a standard data mining problem: we have transactions, where each transaction is a set of items, and we are looking for interesting correlations. Suppose we have 1,000,000 transactions, on 1,000 items, where each item appears in 1/1,000 of the transactions. We observe that the pair of items (i,j) appears in 7 transactions. Is this a significant observation? If we compare to a "random model" where each item appears independently in each transaction with probability 1/1,000, then it might appear so; on average, the pair should appear in only 1 transaction. But the problem is there are almost 500,000 pairs of items. It's not surprising that some pair should appear 7 times; in fact, we'd expect about 50 such pairs. So if we were mining for common pairs, the count alone should not necessarily flag this pair as unusual; it would also matter how many such pairs with this count that we've seen.

The underlying issue is that we're doing multi-hypothesis testing. We're asking "do any pairs appear a statistically significant number of times", not "does this specific pair appear a statistically significant number of times." The first question is naturally harder to answer, since it also should involve the global behavior of all pairs of items, not just the counts from each pair.

Our theoretical starting point is therefore to look at the distributions we obtain on frequencies of subsets of items under our random model. Let Q_{k,s} be the number of itemsets of size k that appear together in at least s transactions in the random model. We can show that, for sufficiently large s (and under certain other conditions), Q_{k,s} is well-approximated by a Poisson random variable, using the Chen-Stein method. (As with all things probabilistic, that sounds very intuitive if we could just assume independence everywhere, but there are dependence issues to take care of -- hence the Chen-Stein method.) We can then leverage this to design statistical tests for finding interesting itemsets, based in part on how far the corresponding values in our observed data set are from the Poisson distribution.

The idea is to expand on the approach of "return all itemsets of size k that appear at least this threshold many times", both so that we have a more rigorous notion of what that threshold should be, and so that we can tell if there are enough such itemsets that what we're seeing is likely to be important. This appears to effectively cut down on the number of spurious itemsets we return, or what in the literature is called the False Discovery Rate. Hopefully, this approach will prove interesting enough for further exploration from data-miners.

No comments: