I'm posting online the "final version" of a paper that will be appearing at KDD, On Compressing Social Networks; the collaboration came out of a visit to Yahoo last summer.
(Co-authors: Chierichetti, Kumar, Lattanzi, Panconesi, and Raghavan.)
The paper definitely tries to look at both the "theory" and "practice" of compressing social network graphs. On the theoretical side, one contribution is that we formalize arrangement problems that are variations of the standard minimum linear arrangement problem that naturally capture features of the underlying compression problem. In the linear arrangement problem, we are given a graph G, and we wish to lay the nodes via a permutation mapping pi:V->[1,n] out on the number line (in positions 1 to n, the number of nodes) in such a way as to minimize sum_{i,j in E} |pi(i) - pi(j)|.
But when we're trying to compress edges in the graph, and we're given an ordering of the nodes, the "cost" of compressing an edge is the cost of the pointer from one vertex to another, and as the number giving the gap between the positions i and j in the node ordering is |pi(i) - pi(j)|, the cost in bits for representing an edge from i to j is (essentially) log_2 |pi(i) - pi(j)|. This suggests considering the minimum logarithmic arrangement problem, where the goal is to find an ordering to minimize sum_{i,j in E} log |pi(i) - pi(j)|. (More generally, one could consider other functions besides the logarithm of the gap --which could naturally correspond to different encodings of the natural numbers in this context -- but this seems the most natural.) We show that (in multigraphs) this problem is NP-hard. I admit, having pondered over this a bit, I wish we had a bit "nicer" of a reduction. In particular, we don't have hardness of approximation results (or an approximation algorithm), which would be a step forward. We also consider a more complicated variation of the arrangement problem that corresponds naturally to compressing successive gaps instead of individual gaps.
On the practical side, we explore several algorithms for compressing social networks, and see how they do on real data. Our main innovation here is to come up with a quick and useful method for generating an ordering on the social network nodes so that nodes with lots of common neighbors are near each other in the ordering, a property which is useful for compression. For Web graphs, this has generally been done using the external information of the URLs; sorting by URL naturally gives an ordering where nodes with common neighbors are near each other. In social networks, there are possible pieces of external information we could use -- zip code, or order of entry into the social network -- but these are much less reliable. We find a "shingle ordering" using the ideas used for document similarity based on min-wise independence; basically, we pick a (first) random ordering of the nodes, and then group nodes according to their first neighbor in this random ordering. Two nodes i and j with neighbor sets N(i) and N(j) are then in the same group with probability equall to the Jacard coefficient |N(i) intersect N(j)| / |N(i) union N(j)|; orders within the groups can be handled according to external information, a second random ordering, etc. This works quite well. (The best comparable technique we know of, which also works quite well, is to order nodes by Gray-code orderings of their neighborhoods; see this paper by Boldi, Santini, and Vigna.)
I enjoyed working on this paper quite a bit, for several reasons. Micah Adler and I wrote one of the early papers on Web graph compression, utilizing the idea of finding nodes with similar link structures; it was nice to return to this theme. I think the paper ended up being a good mix of theoretical framework and experiments based on implementation. It's also just pleasant when my regular trips out to California lead to concrete results -- it gives me a good excuse to continue taking regular trips to California!
Monday, May 04, 2009
Subscribe to:
Post Comments (Atom)
3 comments:
I think "sum_{i,j in V} |pi(i) - pi(j)|" should be sum_{(i,j) in E}, and similarly for the min log arrangement problem.
Oops, fixed.
Hi Mike, what about trips to Roma? ;-)
Cheers
--A
Post a Comment