Thursday, November 13, 2014

The News Is Out

Harvard Computer Science was given a sizable gift from Steve Ballmer;  we're excited at the opportunity to grow this gift will provide us over the next several years.  For more news, see the Crimson article, the Boston Globe article, the New York Times blog writeup, etc. 

Monday, November 10, 2014

Veterans Day

Veterans Day is a federal holiday, but, at Harvard classes go on.  This year, the holiday is observed for staff, but we'll be teaching.

I can't easily find out when the practice of having classes on Veterans Day became standard.  I found this Crimson article from 2004, but it makes it sound like holding classes (it seems to just mention sections, not actual lectures) was not official policy.  I also am not sure when this dichotomy of the day being a staff holiday, but classes are held, started.  I'm a bit embarrassed to say I haven't noticed this scheduling issue until this year, so perhaps the changes are new.

I understand that, from a teaching and scheduling standpoint, it's a bit annoying with the holidays messing up the teaching schedule.  It's also annoying when your children are out of school for the day for a federal holiday while you have to teach.  Also, it seems a bit odd, if not somewhat unpatriotic, "cancelling" Veterans Day for the students (and faculty), although I know that most people generally do not engage in veteran-specific activities for the holiday.  (That's true of our family, and my wife is a veteran.)    

I suppose I might not really have thought too much about it, except that somehow I noticed that Columbus Day is a university holiday at Harvard, and, because I'm sure most readers do not get enough John Oliver in their diet, here what his show has to say about that.  

Happy Veterans Day.


Saturday, November 01, 2014

Weekend Links (November 1 Edition)

A variety of links on related news.

Ebola #1 (local):  The HackEbola with Data website gives detail for the upcoming Harvard Hacking event in conjunction with Statistics without Borders.  If doing some data mining for a cause is your thing, please check it out.  (No, you don't have to be associated with Harvard to participate...)

Ebola #2:  New Yorker magazine has an article out on "The Ebola Wars" that among other things, describes the work done by Harvard colleague Pardis Sabeti and her lab

Ranking again (no, not again!):  This one is tells the story of King Abdulaziz University and the US News and World Report Rankings.  (I note, as follow up from a few years ago, that I remain disturbed that nobody is offering to pay me large sums just to have me affiliated with them for the purposes of raising their citation count.)

Crimson #1:  If you're into this sort of thing, the umpteenth "Stanford vs. Harvard" article.  (Yawn.)

Crimson #2:  Anyone know of a good Dean who might be available?  

Yale:  My take on how you grow a department:  keep telling people you are going to/need to grow the department, and why.  Lather, rinse, repeat.  I hope this article is just the start. 

Tuesday, October 28, 2014

Rankings Everywhere

Like many (theoretical) computer scientists, I spent far, far too much time this weekend thinking about rankings.  Well, really, it was more that I was being far, far too amused by the various posts and comments on the novel ranking site "Ranking of CS Departments based on the Number of Papers in Theoretical Computer Science",  which led to posts at the Computational Complexity blog and in theory, the first of which shut down the comments rather quickly, and the second of which is at this time nearing the amazing 100 comment line.  (Luca's post deserves it -- it was the funniest thing I've seen all week.  Kudos.) 

Now, to fan these flames to further untold heights, I was informed that new US News and World Report Computer Science Rankings are out.  Needless to say, I disagree with them.  How exactly can Harvard be second to MIT?  Ridiculous.

(For those who are concerned about such things, the methodology is also given, though I haven't checked to see what parts of it are reproducible.  And what is that "Web Of Science" thing?  Is it as good as the citation tools in Google Scholar?)

For the ultimate in ranking tongue-in-cheek-iness, and for those who haven't seen it, I also recommend the NRC Deranker, which will (eventually) provide you the ranking you personally agree most with. 


Monday, October 27, 2014

Welcome, Attorneys!

You might wonder why I am welcoming attorneys to the blog.  Well, I happen to know that some attorneys read my blog.  I know this because recently, as an expert witness, while I was testifying in court, the attorney on the other side during my cross-examination specifically asked me about my blog.  Based on their questions, I am reasonably sure they had downloaded all the posts.  I imagine some poor associate attorney low on the totem pole had to read through it all to see if they could find anything embarrassing they could use in the context of the case. 

I don't think it was successful.  But who knows.  At one point, the attorney cross-examining me asked if one of the reasons I liked to do consulting work was because it pays well.  I answered, "I would say that's one of the reasons."  That wasn't a hard answer to give;  it's an honest answer.  But afterward I recalled that, years ago, I had talked about consulting on my blog.  I looked it up, and sure enough, here is what I wrote years ago:
I enjoy consulting, for several reasons:
  1. It fulfills a need to be more directly relevant to the real world.
  2. It gives me a chance to work with different sets of people.
  3. It allows me to learn new things and exercise skills that I don't necessarily use as a professor.
  4. It pays well.
Based on my experience as a witness, I am of very high certainty that the attorney had this post in front of him when he asked the question.  Of course, when he asked the question, he failed to note the other reasons I had given for consulting, or provide me the blog post for context.  But my assumption is that they were simply looking for a "gotcha" moment.  Perhaps the question and my response made me look bad to the jury.  (After subsequent clarifying questions on the issue from the client's attorney, I would like to believe it had no effect.)  I imagine that they were hoping that I would say that the pay wasn't important, in which case I am sure they would have readily put up the post from my blog to show the jury to try to discredit me.   

I talk very little about my legal consulting on the blog or publicly.  The main reason, obviously, is out of respect for and professional obligation to my clients.  But I find I don't blog about it even in very general terms, where confidentiality or other similar issues would not come into play.  And one good reason not to is that an attorney could use anything I write on the subject to try to impeach my credibility as a witness, even if it is out of context. 

Of course, as this example shows, anything I post online can potentially later be used against me in a case or in court.  Or in other situations.  These issues regarding our permanent digital identity, and how modern technology allows our past to easily be brought up and used against us, are not new;  I'm not suggesting I'm offering novel insight here.  Really, I'm just relaying an instance where the issue of my digital identity, in the context of this blog, hit me head on.  Certainly when I wrote that post years ago about consulting I did not think it would come up later in this way.  Mostly it makes me wonder how my children, who are busy exploring their own identities (digital and otherwise), will manage in a world where most of their history will be available online, for people to use or misuse in ways that I could not imagine.     

Friday, October 24, 2014

Harvard Junior Faculty Job Search

Harvard Computer Science is doing a junior faculty search this year.  We pretty much have been doing one every year, but it's nice to be able to make an announcement and point people to the ad.

So here is the ad, telling you what you need so send in if you're applying. 

Also as usual, we're looking in all areas, because we're always interested in great people.  However, in the interest of full disclosure, the focus this year is described as
This is a broad faculty search and we welcome outstanding applicants in all areas of computer science, including applicants whose research and interests connect to such areas as engineering, health and medicine, or the social sciences. Of particular interest are candidates with a focus on systems and data science, including operating systems, networking, software engineering, data storage and management, and computational science.
We'll look forward to your application.

Wednesday, October 22, 2014

Cuckoo Filters

An upcoming paper appearing at CoNext will be of interest to any Bloom Filter user or aficionado:

Cuckoo Filter:  Practically Better than Bloom
(Bin Fan, David Andersen, Michael Kaminsky, Michael Mitzenmacher)

Let me describe a cuckoo filter and some of what's in the paper for you.  If you want to avoid a technical discussion, all you need to know is that for reasonably large sized sets, for the same false positive rate as a corresponding Bloom filter, cuckoo filters use less space than Bloom filters, are faster on lookups (but slower on insertions/to construct), and amazingly also allow deletions of keys (which Bloom filters cannot do).  If you want to look at code, there's even a github repository for you with code for cuckoo filters. 

Note:  I'll assume you probably know what a Bloom filter is in the discussion that follows.  As a reminder, it's a hash-based data structure for set membership that can give false positives;  by allowing false positives, you can save space in your representation of the set.  I'll assume you probably know what cuckoo hashing is.  As a reminder, in cuckoo hashing, you have buckets that can contain a fixed number of keys, and each key gets some number d of buckets (obtained by getting d values from hashing the key) where it can be placed.  When you insert a new key, if all of its buckets are full, you can move keys out of that bucket to one of its other buckets to make room.  Since apparently there are drinking games that exist based on taking a drink whenever I use the word "Bloom filter" or "cuckoo hashing" in a talk, I imagine most readers of the blog know about these things.  (Or, always, Wikipedia.)

The framework used in the paper comes from the following idea.  One way of implementing Bloom filter functionality when given a static set of items is to use a perfect hash function;  that is, as hash function that maps the n elements of your set into an array of size n in a 1-1 fashion.  Then, you use another hash function to store a fingerprint of the element instead of the element itself in the array location.  To do a set membership test on an element z, you hash z once to find its location in the array, and hash z again to find its fingerprint, and check against the fingerprint in the array location.  For elements not in the set, you get a false positive rate of 2^{# of fingerprint bits}.

The problem is that the perfect hashing framework doesn't allow you to insert new items naturally.  Or do deletions (but standard Bloom filters don't allow deletions also -- you need a counting Bloom filter for that).  So a natural thing to do is to think of using some other sort of hash table, besides a perfect hash table, and see what happens.  This idea was explored way back in for instance some papers I worked on in 2006 (here and here) using "d-left" hash tables, which are based on a multiple choice hashing scheme.

If you're going to use multiple choice hashing schemes, though, you should think about using cuckoo hashing.  The ability to move keys around means you should get better space utilization;  for example, even with 2 choices, if your buckets can hold 4 items, cuckoo hashing can get you about 95% space utilization.  The problem with cuckoo hashing in this setting is that, for a Bloom filter, you want to just keep fingerprints of keys, not the keys themselves.  So, when you want to move the key, how do you figure out where to move it to -- you no longer have the key to hash?

The approach used in the paper is a trick called partial-key cuckoo hashing.  To get two buckets for our key, we do a hash to get the first bucket (idealized to be uniformly at random), but to get the second bucket, we do not hash the key, but just the fingerprint of the key;  we XOR that value with the value for the first bucket to get the second.  The upside of this approach is given a bucket and a fingerprint in the bucket we can find the other bucket corresponding to the fingerprint.  (XORing the hash of the fingerprint works if the fingerprint is in the first bucket or the second, as is easily checked.)  The downside is if our fingerprint has F bits, that means our second bucket is really only chosen from 2^F possibilities;  it is not uniform conditioned on the first.

Because of this limitation, one interesting aspect of the paper is that we show, in theory, this approach just will not work.  Specifically, there is an Omega(log n) lower bound on the size of the fingerprint that must be stored; this means the cuckoo filter is super-linear in space, unlike standard Bloom filters which are linear in space.  This lower bound is easy to derive.  Suppose that you have buckets that can hold b keys.  If 2b+1 keys have the same pair of buckets, then there is trouble;  you can't fit 2b+1 keys into 2b bucket slots.  So the question is whether there are collections of 2b+1 keys that have the same pair of buckets.  With standard cuckoo hashing, this would be a low probability event.  With partial-key cuckoo hashing, the probability depends on the size of the fingerprint, since this determines how many possible second choices of a bucket there are for each key (given their first random choice of a bucket).  If you have too few choices, you will get too many collisions, or too many keys in a pair of buckets.  Some straightforward calculations yield the Omega(log n) lower bound on the size of the fingerprint.

So now that we've proven theoretically that this won't work, how could it work in practice?  If you go back to those calculations, you find that the lower bound has some nice constant factors.  The bounds you get on the fingerprint size are actually (roughly speaking) log n/(2b).  That constant factor in the denominator, even when b=4, covers a large range of n.  In practice, fingerprints can be quite a reasonable size, and still avoid the potential bottleneck of too large a group of keys all hashing to the same small number of buckets. 

In any case, this may be the first paper I've written that contains a proof that the described data structure fails in theory but works in practice.  (The other way around I generally find much easier.) 

Big open theoretical question:  try to analyze why partial-key cuckoo hashing gets roughly the same load threshold as regular cuckoo hashing.  (Anyone want to work on that?  Let me know...)

This post is long enough;  there's a lot more in the paper, including comparisons with other schemes, bit-saving tricks, and lots of performance results.  Have a look.  

Postscript/Tangent:  In a recent review (for a different paper, with a different theme) one reviewer made the following comment/criticism:  "Despite years of research, no one has found a variant of cuckoo hashing that can outperform simpler methods in realistic situations."  To which I say, hunh?  Yes, I understand linear probing is wonderful in practice due to cache effects and block sizes and in many and perhaps even most situations can yield better performance than cuckoo hash tables.  But certainly not everywhere.  Cuckoo hash tables certainly seem at least potentially better for many hardware settings, and for at least some important applications -- such as this one -- cuckoo hashing seems to be a clear win.  (The issue here is the number of fingerprints you have to deal with on a lookup affects the false positive probability -- so even if linear probing lookups are "faster" due to cache effects and block sizes, if you try to get this sort of space utilization you get larger contiguous blocks which increase your false positives.)

My co-author David Andersen also violently disagrees with this review comment, and points to these recent papers at NSDI and Eurosys.  

Perhaps I'm missing something, but my thought was this was a sign of a bad review (both in content, and in its overall rudeness).  I felt obliged to try to correct the record (and I can only hope the reviewer sees this post).     
 

Friday, October 03, 2014

Andreessen-Horowitz Again

Andreessen Horowitz had a second Academic Roundtable, gathering together academics, VCs, and people in startups to talk for a few days about a variety of topics.  I wasn't sure why I was invited the first time (which I wrote about last year), and am even less clear on why I was invited back, but I again had a very good time. 

Before a few words about the talks, the high order point:  I'm told that most/all of the talks will be put online over the coming weeks.  So if you're interested, you can experience them yourself.  I think it's great that they're doing that this year;  if you like the talks, you should tell them, so they keep doing it.  (If they don't have comments there, you can always comment here.)  Sadly, they don't seem to be up yet, but I'll post the links when they become available. 

Highlights would include a discussion of Bitcoin -- interesting to hear what Ed Felten, well-known Princeton professor and now ex-Chief Technologist of the FTC, thinks about the Bitcoin economy.  Dawn Song of Berkeley gave a general talk on security issues of the present and future, while Dan Boneh of Stanford gave a talk on the power of program obfuscation.  Raj Rajikumar of CMU gave some history and some peeks into the future of driverless cars -- it's not just Google, you know.  Tuomas Sandholm of CMU talked about his take on the starting of startups while still being an academic (based on now-multiple experiences), and Josh Bloom of UC Berkeley (and wise.io) described the differences between writing papers about machine learning and building products using machine learning.

Of course, some heated discussion about the variety of issues that arise between academic work and transfer of technology to startups inevitably ensued.  (Maybe that's why I get invited.)  The key idea repeated by many (on both sides of the fence) in various forms was that businesses (and in particular startups) are very busy going down their road of development and product, and they may see many things out the sides of that road that are very interesting, but don't have time to explore off the road.  Academics are interested in things way off the road, often thinking of issues much further out in time-scale.  And (at least in my opinion) the role academics play is a good thing;  there (obviously) remains a lot of ways the two worlds can interact and cooperate. 

Thursday, September 18, 2014

On Academia vs. Industry.... MSR SVC Closing

I'm one of those professor types that ends up defending the joys of life as an academic versus a career in industry -- some of you who read this blog have probably seen me comment at Matt Welsh's blog or Daniel Lemire's blog/Google+ chain.  And to be clear I'm not anti-industry, I just want to make sure there's a fair discussion.

In that vein, I'm sad to hear that Microsoft Research Silicon Valley will be closing, as described in this article.  I know many people at MSRSV, and have visited there often.  It's a loss to the community to have it close.  I have sympathy for those who work there -- this must be stressful -- but this sympathy is happily diminished because I know the people there are so talented, they will quickly move on to other employment.  (When Digital Systems Research Center closed long ago, a number of the people I worked with in the lab just moved on to some small company that seemed to be just starting to really get going, called Google.  I wonder how it worked out for them.) 

After a moment of silence for the lab, I do feel it necessary to point out that this is one of the "issues" in the academia vs industry life choice.  The life cycle for companies moves rapidly.  That's not a bad thing, but it's a thing.  Disruptions like this are a non-trivial risk in industry, much less so in academia.  (Just look at the history of research labs.)  Again, I'm sure it will work out for the people at MSRSV, but any life shift like this -- even if it ends positively -- is stressful.  Without wanting to overstate the consequences, it's worth pointing out.