Thursday, September 24, 2009

Extending the Sketching of Sketches Result

I'm putting online a paper with Zhenming Liu and Kai-Min Chung (both graduate students at Harvard) that extends one of the results from Declaring Independence via the Sketching of Sketches by Indyk and McGregor (SODA 2008).

They consider a model where a stream of pairs (i,j) in [n]^2 arrive, giving a joint distribution (X,Y), and the goal is to determine how close the joint distribution is to the product of the marginal distributions under various metrics, which naturally corresponds to how close X and Y are to being independent. All the normal goals in the streaming setting hold (small space, small computation per new pair, one or small number of passes, etc.). We extended one of their main results to higher dimensions (a problem they had left open) : the stream is now k-dimensional vectors in [n]^k, and we wish to approximate the L_2 distance between the joint distribution and the product of the marginal distributions in a single pass. We give a randomized algorithm that is a (1 plus-minus epsilon) approximation (with probability 1-\delta) that requires space logarithmic in n and the stream size m and proportional to 3^k. The paper should be of interest to those who know and like the original Indyk/McGregor paper, which I assume is a signficant part of the streaming community.

The Indyk/McGregor proof required a clever application of the Arithmetic Mean-Geometric Mean inequality. To move the proof up into higher dimensions, we end up having to use the AM-GM inequality twice, along with a clever partitioning of the objects (edges in [n]^k) to get the right constant factor 3^k out. It also seems that this is the "right" dependence on k, at least for this algorithmic approach. (A previous result by Braverman/Ostrovsky also tackled this problem, but had a 2^O(k^2) dependence on k.)

We submitted this to SODA, where it didn't get in, but seemed to be on the borderline. The reviews were on the whole quite reasonable. My take is that in this paper, we (by which, of course, one should read "Kai-Min and Zhenming") took a nice previous SODA result, presented a non-trivial extension, and really wrote it up very clearly and nicely -- I'd like to think we really clearly explained the essence of the argument and why it works. The reviewers, I think, recognized this, but on the other hand had to deal with the issue that it was just a single result. The original paper had more results, and, at the risk (or hope) of instigating commentary, I'll suggest that SODA as a whole is more interested in quantity over writing quality ; a hodgepodge of lemmas and theorems stands a better chance of getting in than a single well-written idea. (The original Indyk/McGregor paper, I should be clear, both has a number of results and excellent ideas, and is not what I'm referring to in this statement.) I understood going in that this would probably be borderline for SODA; I had hoped it would make it in, but this is certainly not a case where I would complain about the decision or the review content by the PC. (One might argue that perhaps SODA should take more papers, and in particular seek out more papers of this type, but that's a different, higher-level argument.)

There are other conferences we could submit to, but the conferences with upcoming deadlines didn't seem particularly suitable. (In particular, they weren't in the US, and would require either extensive travel for me, or visa issues for the students.) Since the result seemed to us essentially done as it was, we're submitting to a journal. But it's available now for anyone who is interested.


Daniel Lemire said...

Sad to see that your paper will only be on the radar of the 1,213 Google-Reader subscribers to your blog, before it ends up into a journal.

Michael Mitzenmacher said...

Daniel --

You raise an interesting point. Suppose instead that I sent my paper to a "2nd tier" conference. Would more people know about from it appearing in the conference, or from other means of promoting it (such as discussing it on my blog, giving talks about it, etc.). It's not clear to me what the answer is. (I think that appearing in SODA would certainly have given the paper more visibility.)

I know there are enough conferences now that it's hard to keep up with all the papers that are out there, and much of my focus goes to "themes" -- for example, I pay attention to papers on cuckoo hashing, and interesting papers on Bloom filters -- and to individual authors whose work I find important. The conference structure provides some useful information in categorizing what I want to read, but I think much less than it did when I was a graduate student.

AC said...

Well done! I had spent some cycles thinking about this problem. Pity SODA didn't take it.

Amit Chakrabarti

aram harrow said...

You can do something similar for l_1 distance using entropy estimation, I think.
e.g. use the techniques of arXiv:0908.3961 to estimate the entropy of X,Y,Z,... as well as the joint random variable XYZ.

If the difference is epsilon then the l_1 distance is at least sqrt(2 epsilon). The upper bound on l_1 distance is only O(epsilon * log(alphabet size)), though

Michael Mitzenmacher said...

The L1 case has been studied in other papers -- the most recent work I know of is