A shout-out today to my friend and colleague Pardis Sabeti (and her lab) for their Science article on the Ebola virus that appeared earlier today. Pardis and her group have been studying the genetics of viral diseases, and in particular the Lassa virus. So they were there and ready when the recent Ebola virus began and went to work. They sequenced 99 Ebola virus genomes from 78 patients, and have analyzed the resulting information to gain insight into how the disease is spreading and mutating. They have released the genetic sequences to the public so that the information is available to groups who could use the sequencing to help find and test cures. This is both very important work, and (sadly) very serious. As reported, 5 co-authors of the study (health care workers, a lab technician, and the director of the national Lassa fever program in Sierra Leone) have died of Ebola before today's publication.

Numerous news articles appeared today discussing this work, too many for me to link to. But the SabetiLab page here contains a great deal of information about her research and various projects. Her work in biology, which makes powerful use of statistics and computation, should serve as an inspiration for future scientists, and for people who want to use algorithmic and mathematical methods in ways to benefit the world.

## Thursday, August 28, 2014

## Wednesday, August 27, 2014

### Update on SOCG and ACM

I am happy to have an update on the SOCG/STOC colocation issue that arose last week. Or, better said, Jeff Erickson has an update, the short summary of which is that it looks like there has now been some very useful clarification. The concerns of the ACM apparently are limited to direct financial support, and the conferences can (formally) co-locate. I encourage you to read the note on Jeff's blog from Paul, and let me echo Jeff's statement here:

"Needless to say, this is fantastic news! I want to publicly thank Paul and the ACM leadership for quickly clearing up this unfortunate misunderstanding."

So say we all.

"Needless to say, this is fantastic news! I want to publicly thank Paul and the ACM leadership for quickly clearing up this unfortunate misunderstanding."

So say we all.

## Tuesday, August 26, 2014

### New Article on arxiv on Equitability and MIC

We recently put on arxiv a new draft on "Theoretical Foundations of Equitability and the Maximal Information Coefficient". This is
some follow-on work to a paper that appeared in Science a couple of
years ago, where we introduced the idea of equitability. Essentially,
in that Science paper (link to page where you can access the paper), we wanted a statistic that would give back, for
samples from a noisy functional relationship, a score corresponding to
the amount of noise (or, in that case, to the R^2 of the noisy data
relative to the relevant noiseless function), regardless of the
relationship type. The idea was that this would be useful in data
exploration settings, where we might have a large number of possible
relationship pairs and in particular a number of non-trivially
correlated relationships, and we'd want to score them, in some fair way
across the possible types of relationships (linear, parabolic,
sinusoidal, etc.), so that we could choose the most promising to look
at. We also wanted the statistic to do reasonable things for
non-functional relationships. And, finally, we wanted a pony. (But we
couldn't find a way to put that in the paper.) The maximal information
coefficient (MIC), which we built on top of mutual information, was our
proposed statistic.

The paper has gotten some interest. One thing that we heard was that people wanted a richer theoretical framework for these ideas. So now we're finally delivering one. It took a while, because the students involved -- Yakir Reshef and David Reshef -- were off doing crazy, wacky young-people things like going to medical school, making it hard to get cycles for the project. On the other hand, the time did some good, allowing us to explore to determine the formulation we wanted. The result is, I hope, an interesting mix of ideas from statistics and computer science. We're eager for feedback as we hope to formally submit somewhere soon.

In a couple of weeks we should have another paper out on the same topic that is more empirical. Naturally, when working through the theory, we came up with better algorithms for computing MIC, and it made sense to separate those results (and some others) into another paper.

The paper has gotten some interest. One thing that we heard was that people wanted a richer theoretical framework for these ideas. So now we're finally delivering one. It took a while, because the students involved -- Yakir Reshef and David Reshef -- were off doing crazy, wacky young-people things like going to medical school, making it hard to get cycles for the project. On the other hand, the time did some good, allowing us to explore to determine the formulation we wanted. The result is, I hope, an interesting mix of ideas from statistics and computer science. We're eager for feedback as we hope to formally submit somewhere soon.

In a couple of weeks we should have another paper out on the same topic that is more empirical. Naturally, when working through the theory, we came up with better algorithms for computing MIC, and it made sense to separate those results (and some others) into another paper.

## Saturday, August 23, 2014

### Is the ACM "Retaliating" Against SOCG?

Friday afternoon Jeff Erickson posted at Making SOCG blog some "bad news". Some background: very recently, the Symposium on Computational Geometry, or SoCG, decided to leave the ACM, for various reasons. There had been plans in the works for SoCG to be co-located with the Symposium on the Theory of Computing, or STOC, one of the flagship general theory conferences, in 2016. STOC is an ACM conference. Reportedly, Paul Beame, chair of SIGACT (the ACM theory special interest group) sent a note that included the following (emphasis added by Jeff):

I encourage everyone to read Jeff's post (here's the link again). Obviously, this is an issue for Computational Geometry, and an issue for Theory. But I would say it's an issue for all ACM members, and in particular those that publish research at ACM conferences. It would be nice if, should it become necessary, members from other areas in computer science voice their displeasure with ACM's actions in this case.

While I await more information and do not want to rush to judgment, if this description is true, I find it unacceptable behavior on the part of the ACM. Their job is to promote computer science. (The ACM home page begins with: "ACM, the world’s largest educational and scientific computing society, delivers resources that advance computing as a science and a profession.") Their job is not to try to monopolize and control the computer science conference market.With SoCG leaving ACM we have been told that SIGACT and ACM conferences cannot have any formal arrangements at all with the new conference or do anything official that might support it.(This decision was made at the very top of ACM and not by the staff.)This rules out any joint sessions... It also means that SIGACT will have to end our participation in this formal coordinating group.

I encourage everyone to read Jeff's post (here's the link again). Obviously, this is an issue for Computational Geometry, and an issue for Theory. But I would say it's an issue for all ACM members, and in particular those that publish research at ACM conferences. It would be nice if, should it become necessary, members from other areas in computer science voice their displeasure with ACM's actions in this case.

## Thursday, August 21, 2014

### Hashing Summer School

Back in July I took part in the Hashing Summer School in Copenhagen. This was nominally set up by me, Rasmus Pagh, and Mikkel Thorup, though Mikkel was really the host organizer that put it all together.

The course materials are all online here. One thing that was a bit different is that it wasn't just lectures -- we really did make more of a "summer school" by putting together a lot of (optional) exercises, and leaving time for people to work through some of them in teams. I am hoping the result is a really nice resource. There are lectures with the video online, and also the slides and exercises. Students could go through whatever parts they like on their own, or people might find the material useful in preparing their own lectures when teaching graduate-level topics in hashing.

The course materials are all online here. One thing that was a bit different is that it wasn't just lectures -- we really did make more of a "summer school" by putting together a lot of (optional) exercises, and leaving time for people to work through some of them in teams. I am hoping the result is a really nice resource. There are lectures with the video online, and also the slides and exercises. Students could go through whatever parts they like on their own, or people might find the material useful in preparing their own lectures when teaching graduate-level topics in hashing.

## Tuesday, August 19, 2014

### Reviewing Scales

I'm just about finished reviewing for CoNEXT (Conference on Emerging Networking Experiments and Technologies), and am starting reviewing for ITCS (Innovations in Theoretical Computer Science). One notable variation in the process is the choice of the score scale. For CoNEXT, the program chairs chose a 2-value scale: accept or reject. For ITCS, the program chair chose a 9-point scale. Scoring from 1-9 or 1-10 is not uncommon for theory conferences.

I dislike both approaches, but, in the end, believe that it makes minimal difference, so who am I to complain?

The accept-or-reject choice is a bit too stark. It hides whether you generously thought this paper should possibly get in if there's room, or whether you really are a champion for the paper. A not-too-unusual situation is a paper gets (at least initially) a majority of accept votes -- but nobody really likes the paper, or has confronted its various flaws. (Or, of course, something similar the other way around, although I believe the first case is more common, as it feels better to accept a close call than to reject one.) Fortunately, I think the chairs have been doing an excellent job (at least on the papers I reviewed) encouraging discussion on such papers as needed to get us to the right place. (Apparently, the chairs aren't just looking at the scores, but reading the reviews!) As long as there's actual discussion, I think the problems of the 2-score solution can be mitigated.

The 9 point scale is a bit too diffuse. This is pretty clear. On the description of score semantics we were given, I see:

"1-3 : Strong rejects".

I'm not sure why we need 3 different numbers to represent a strong reject (strong reject, really strong reject, really really strong reject), but there you have it. The boundaries between "weak reject", "a borderline case" and "weak accept" (scores 4-6) also seem vague, and could easily lead to different people using different interpretations. Still, we'll see how it goes. As long as there's good discussion, I think it will all work out here as well.

I prefer the Goldilocks scale of 5 values. I further think "non-linear" scoring is more informative: something like top 5%, top 10%, top 25%, top 50%, bottom 50%, but even scores corresponding to strong accept/weak accept/neutral/weak reject/strong reject seem more useful when trying to make decisions.

Finally, as I have to say whenever I'm reviewing, HotCRP is still the best conference management software (at least for me as a reviewer).

I dislike both approaches, but, in the end, believe that it makes minimal difference, so who am I to complain?

The accept-or-reject choice is a bit too stark. It hides whether you generously thought this paper should possibly get in if there's room, or whether you really are a champion for the paper. A not-too-unusual situation is a paper gets (at least initially) a majority of accept votes -- but nobody really likes the paper, or has confronted its various flaws. (Or, of course, something similar the other way around, although I believe the first case is more common, as it feels better to accept a close call than to reject one.) Fortunately, I think the chairs have been doing an excellent job (at least on the papers I reviewed) encouraging discussion on such papers as needed to get us to the right place. (Apparently, the chairs aren't just looking at the scores, but reading the reviews!) As long as there's actual discussion, I think the problems of the 2-score solution can be mitigated.

The 9 point scale is a bit too diffuse. This is pretty clear. On the description of score semantics we were given, I see:

"1-3 : Strong rejects".

I'm not sure why we need 3 different numbers to represent a strong reject (strong reject, really strong reject, really really strong reject), but there you have it. The boundaries between "weak reject", "a borderline case" and "weak accept" (scores 4-6) also seem vague, and could easily lead to different people using different interpretations. Still, we'll see how it goes. As long as there's good discussion, I think it will all work out here as well.

I prefer the Goldilocks scale of 5 values. I further think "non-linear" scoring is more informative: something like top 5%, top 10%, top 25%, top 50%, bottom 50%, but even scores corresponding to strong accept/weak accept/neutral/weak reject/strong reject seem more useful when trying to make decisions.

Finally, as I have to say whenever I'm reviewing, HotCRP is still the best conference management software (at least for me as a reviewer).

## Monday, August 18, 2014

### Back to Work

Harvard classes start up in a few weeks, and officially, my sabbatical is over. I'm back in my office, trying to get back into a Harvard routine.

I notice that I've been very light in posting over my sabbatical. After my term as chair, I was enjoying being in the background, hidden away a bit. I'm not sure if I'll get back into blogging -- it seems already to be a technology of the past -- but I figure I'll start again and see what happens.

So some short notes. On my to-do list is to go cover to cover through Ryan O'Donell's new book Analysis of Boolean Functions; Cambridge University Press was nice enough to send me a free copy, which they do with books from time to time. For those who have been following he's been releasing the book in chapters online, and you already know it's good. He's made the book is available online also, but it's nice to have a copy for my bookshelf. It's a beautiful book, both in content and in how it's all put together. My one thought (so far) as I've started my way through is that it stylistically, to me, reads like a "math book" more than a "CS book", whatever that means. That's not meant to be a complaint, just an observation.

Boaz Barak accidentally made me laugh on his Updates from ICM 2014 post, well worth reading, when he writes:

"Candes’s talk was an amazing exposition of the power and importance of algorithms. He showed how efficient algorithms can actually make the difference in treating kids with cancer!....

Hearing Candes’s talk I couldn’t help thinking that some of those advances could perhaps have been made sooner if the TCS community had closer ties to the applied math community, and realized the relevance of concepts such as property testing and tool such as the Geomans-Williamson to these kind of questions. Such missed opportunities are unfortunate for our community and (given the applications) also to society at large, which is another reason you should always try to go to talks in other areas."

I think the larger issue the slow but (over long periods) not really subtle shift of the TCS community away from algorithmic work and practical applications. I'm all for going to talks in other areas, but I think the issue is a larger scale problem.

I'm working on a new class this semester, which if I write I'm sure I'll write more about, but one thing I'd forgotten is how hard and time-consuming it is to construct a lecture. Maybe some of it is a function of me getting slower, but going through all the possible pieces, picking what you think are the right ones, making sure you've got all the details, and then (for me) writing a "script" of what you plan to go over -- it takes time. (Good thing I'll likely be on this new class for a few years.)

Plenty more to talk about -- reviewing, some new/old papers, some summer travel. So we'll see if I can get back into a blogging state of mind.

I notice that I've been very light in posting over my sabbatical. After my term as chair, I was enjoying being in the background, hidden away a bit. I'm not sure if I'll get back into blogging -- it seems already to be a technology of the past -- but I figure I'll start again and see what happens.

So some short notes. On my to-do list is to go cover to cover through Ryan O'Donell's new book Analysis of Boolean Functions; Cambridge University Press was nice enough to send me a free copy, which they do with books from time to time. For those who have been following he's been releasing the book in chapters online, and you already know it's good. He's made the book is available online also, but it's nice to have a copy for my bookshelf. It's a beautiful book, both in content and in how it's all put together. My one thought (so far) as I've started my way through is that it stylistically, to me, reads like a "math book" more than a "CS book", whatever that means. That's not meant to be a complaint, just an observation.

Boaz Barak accidentally made me laugh on his Updates from ICM 2014 post, well worth reading, when he writes:

"Candes’s talk was an amazing exposition of the power and importance of algorithms. He showed how efficient algorithms can actually make the difference in treating kids with cancer!....

Hearing Candes’s talk I couldn’t help thinking that some of those advances could perhaps have been made sooner if the TCS community had closer ties to the applied math community, and realized the relevance of concepts such as property testing and tool such as the Geomans-Williamson to these kind of questions. Such missed opportunities are unfortunate for our community and (given the applications) also to society at large, which is another reason you should always try to go to talks in other areas."

I think the larger issue the slow but (over long periods) not really subtle shift of the TCS community away from algorithmic work and practical applications. I'm all for going to talks in other areas, but I think the issue is a larger scale problem.

I'm working on a new class this semester, which if I write I'm sure I'll write more about, but one thing I'd forgotten is how hard and time-consuming it is to construct a lecture. Maybe some of it is a function of me getting slower, but going through all the possible pieces, picking what you think are the right ones, making sure you've got all the details, and then (for me) writing a "script" of what you plan to go over -- it takes time. (Good thing I'll likely be on this new class for a few years.)

Plenty more to talk about -- reviewing, some new/old papers, some summer travel. So we'll see if I can get back into a blogging state of mind.

## Friday, June 20, 2014

### See You in Prague

For those of you going to SPAA this coming week, I'll see you there. I'll be giving the last two talks at the conference, to what I expect (based on the timing) will be a nearly empty room. That just means there will be no pressure.

If you want to hear more about the papers, you can go to Abstract Talk, where I talk about the papers. Here is the link for the paper Balanced Allocations and Double Hashing, and the link for the paper Parallel Peeling Algorithms. I haven't done podcasts for Abstract Talk before, so be forgiving if you go to listen. It seems like a cool idea; what do people think of it in practice?

For those who expect to be sightseeing in Prague during the final session (or who just aren't going to SPAA), here's the brief overview.

For Balanced Allocations and Double Hashing:

In the well-known balanced allocations paradigm, balls are hashed sequentially into bins, where each ball gets d random choices from the hash functions, and is then placed in the least loaded. With double hashing, we replace the d random choices with d choices of the form a, a+b, a+2b, a+3b,... a+(d-1)b, where a and b are random values (determined by hashing). That is, we build the d choices from 2 random numbers instead of using d random numbers. (The numbers are taken mod the size of the hash table, and b should be relatively prime to the hash table size... let's stop worrying about details.) We find empirically that this makes no difference, in a very strong sense; the fraction of bins with load j appears the same for every value of j for both systems, so you can't really tell them apart. We provide a theoretical explanation, based on fluid limit models, for why this happens.

For Parallel Peeling Algorithms:

The analysis of several algorithms and data structures can be framed as a peeling process on a random hypergraph: vertices with degree less than

with high probability. We show that, with high probability, below this threshold, only O(

The Parallel Peeling Algorithms paper was, to my surprise, awarded Best Paper. Maybe I'll write up more about the surprise at some point, but I'd certainly like to thank the committee for the honor, and pass the credit to where it is due, with my co-authors Jiayang Jiang and Justin Thaler.

If you want to hear more about the papers, you can go to Abstract Talk, where I talk about the papers. Here is the link for the paper Balanced Allocations and Double Hashing, and the link for the paper Parallel Peeling Algorithms. I haven't done podcasts for Abstract Talk before, so be forgiving if you go to listen. It seems like a cool idea; what do people think of it in practice?

For those who expect to be sightseeing in Prague during the final session (or who just aren't going to SPAA), here's the brief overview.

For Balanced Allocations and Double Hashing:

In the well-known balanced allocations paradigm, balls are hashed sequentially into bins, where each ball gets d random choices from the hash functions, and is then placed in the least loaded. With double hashing, we replace the d random choices with d choices of the form a, a+b, a+2b, a+3b,... a+(d-1)b, where a and b are random values (determined by hashing). That is, we build the d choices from 2 random numbers instead of using d random numbers. (The numbers are taken mod the size of the hash table, and b should be relatively prime to the hash table size... let's stop worrying about details.) We find empirically that this makes no difference, in a very strong sense; the fraction of bins with load j appears the same for every value of j for both systems, so you can't really tell them apart. We provide a theoretical explanation, based on fluid limit models, for why this happens.

For Parallel Peeling Algorithms:

The analysis of several algorithms and data structures can be framed as a peeling process on a random hypergraph: vertices with degree less than

*k*are removed until there are no vertices of degree less than*k*left. The remaining hypergraph is known as the*k*-core. We analyze parallel peeling processes, where in each round, all vertices of degree less than*k*are removed. It is known that, below a specific edge density threshold, the*k*-core is emptywith high probability. We show that, with high probability, below this threshold, only O(

*log log n)*rounds of peeling are needed to obtain the empty*k*-core for*r*-uniform hypergraphs. Interestingly, above this threshold,*Ω(log n)*rounds of peeling are required to find the non-empty*k*-core. Since most algorithms and data structures aim to peel to an empty*k*-core this asymmetry appears fortunate; nature is on our side. We verify the theoretical results both with simulation and with a parallel implementation using graphics processing units (GPUs). Our implementation provides insights into how to structure parallel peeling algorithms for efficiency in practice.The Parallel Peeling Algorithms paper was, to my surprise, awarded Best Paper. Maybe I'll write up more about the surprise at some point, but I'd certainly like to thank the committee for the honor, and pass the credit to where it is due, with my co-authors Jiayang Jiang and Justin Thaler.

### NSF Thanks for the Year

It's that time of year where, as a background process, I have to do my annual reviews for the NSF. It's generally not a very exciting task, and their online forms remain, I think, unpleasantly designed. (I think they fixed a problem I had last year, so at least now they point out to you what you haven't filled out more clearly.)

That being said, this year, even more than others, I find myself gratefully filling them out. The NSF provides the money for my (and my students') research. And research is fun. In my few years playing administrator, I was trying to keep up with my research, but inevitably my available time and corresponding output started to decline. This year, I've been able to "get back into it", and it made me realize how much I enjoy it. Sure it's often frustrating. And writing it down (and dealing with conferences, reviews, etc.) can also be frustrating. But overall the creation process of research, including all the frustrating parts, is the most enjoyable part of the job*, and I'm glad that the NSF is there to support it.

Thanks NSF. I'll try to have those reports in well before the nominal deadline....

[* Yes, I like teaching and interacting with students too -- otherwise I'd look for work in one of the many great research labs -- that's the other most enjoyable part of the job.]

That being said, this year, even more than others, I find myself gratefully filling them out. The NSF provides the money for my (and my students') research. And research is fun. In my few years playing administrator, I was trying to keep up with my research, but inevitably my available time and corresponding output started to decline. This year, I've been able to "get back into it", and it made me realize how much I enjoy it. Sure it's often frustrating. And writing it down (and dealing with conferences, reviews, etc.) can also be frustrating. But overall the creation process of research, including all the frustrating parts, is the most enjoyable part of the job*, and I'm glad that the NSF is there to support it.

Thanks NSF. I'll try to have those reports in well before the nominal deadline....

[* Yes, I like teaching and interacting with students too -- otherwise I'd look for work in one of the many great research labs -- that's the other most enjoyable part of the job.]

## Monday, June 16, 2014

### Sad News: Berthold Vöcking

I have just seen the news that Berthold Vöcking passed away. For those who didn't know him, Berthold was an oustanding researcher in algorithms. We had several shared interests and were of a similar age; I always enjoyed his work, and very much enjoyed the few times that I worked with him. His paper How Asymmetry Helps Load Balancing still amazes me, and is one of the most wonderful papers that I wish I had written. When I first saw it I simply didn't believe the main result*; I stopped my other work, wrote up a simulation, and found it matched his analysis. I then had to spend the next few days understanding it. Once I understood it, I wondered how he had seen it, and how I had missed it. It is a truly inspired paper.

Of course he has other great papers, including the well known Tight Bounds for Worst-Case Equilibria, and maybe the less well known but very interesting (particularly to me) work on Balanced Allocations: The Heavily Loaded Case. He had just won a best paper award for ICALP for work on Online Independent Sets. He was one of the editors of Algorithms Unplugged, a wonderful book project.

Berthold was regularly on my list of people to invite to workshops, because he always had very interesting work to present and was interesting to talk to. I can't believe I won't have another chance to talk with him. His passing is a loss to our community.

* For those who care, his result was that when hashing using the "power of two choices", suppose that instead of having each new item make two random choices and then placing the item in the least loaded, you split the table into two equal halves (call them left and right), have each item make a random choice from each half, and then place the item in the least loaded, except that you always break ties to the "left half". There wouldn't seem to be any difference between the two schemes, but there is; the split-and-break-ties approach works significantly better.

Of course he has other great papers, including the well known Tight Bounds for Worst-Case Equilibria, and maybe the less well known but very interesting (particularly to me) work on Balanced Allocations: The Heavily Loaded Case. He had just won a best paper award for ICALP for work on Online Independent Sets. He was one of the editors of Algorithms Unplugged, a wonderful book project.

Berthold was regularly on my list of people to invite to workshops, because he always had very interesting work to present and was interesting to talk to. I can't believe I won't have another chance to talk with him. His passing is a loss to our community.

* For those who care, his result was that when hashing using the "power of two choices", suppose that instead of having each new item make two random choices and then placing the item in the least loaded, you split the table into two equal halves (call them left and right), have each item make a random choice from each half, and then place the item in the least loaded, except that you always break ties to the "left half". There wouldn't seem to be any difference between the two schemes, but there is; the split-and-break-ties approach works significantly better.

Subscribe to:
Posts (Atom)