My Biased Coin: September 2008

Tuesday, September 30, 2008

Enrollments Up

The last two years, we've seen a huge increase in enrollments in our introductory computer science course. Last year it was about double the size it was the previous year; this year, it looks like it's gone up another 50%. Yes, the intro course has about tripled in the last two years.

Part of that might be that we have a new instructor, who is apparently young, dynamic, bringing a more modern perspective, and all those things that inspire students to take introductory courses. Another explanation is that Harvard last year introduced minors (called secondary fields), so now by taking 4 classes you can get a minor in computer science. A more timely explanation is that the financial meltdown is encouraging students to look at computer science and applied math this year (instead of the more commonly popular major of economics).

The gain is starting to translate into other courses. The intro theory course (automata/complexity) is up over 50% from last year.

Needless to say we're excited by this change, although still processing what it means and how to deal with it longer term. Obviously, we're hoping this will help us make the case for more hires...

Friday, September 26, 2008

Allerton : Moves on Deletions and Insertions

To wrap up, the other paper I presented at Allerton (pdf, talk slides), joint with Adam Kirsch, continues our work in analyzing the power of moving objects around in multi-level hash tables. The gold standard here, at least in theory, is cuckoo hashing, which achieves very high loads but can require a large number of moves when inserting an item. That might not be practical for many applications (especially in network router hardware). In previous work, we had considered the gains obtainable from moving at most 1 item on an insertion. In this work, we add the power to move items on a deletion as well. (Lookups we keep fast and don't do any extra work.)
[The discussion below is, admittedly, probably more for those at least familiar with our previous work...]

Moving an item after another item is deleted is difficult -- generally, you'd want to move some other item deeper in your Multi-Level Hash table into the now empty space higher up in your hash table, but how can you find what item to move? Actually, we'd had ideas on how to do this back when we were working on the insertion paper. But we didn't include it then, for several reasons. First, our ideas required additional space (as explained below). Second, we couldn't analyze them, and one of the points we were aiming for in that paper was to analyze schemes. Third, we didn't have space -- not in the hardware, but in the actual paper. Blame page limits (yes, even for the journal version).

Another recent INFOCOM paper that offered an idea of how to handle moves on deletions inspired me to have us write down our approach -- because I feel our approach is clearly better than that proposed by this other INFOCOM paper, as we discuss in our paper. Even if we can't prove what we'd like, we can still show it works well through simulation.

Our (obvious) idea is just to keep a hint at each hash table cell, if there was a collision there, of where the collided item was placed, so you can find it later on a delete. Hints take space, but not too much. If multiple collisions occur, you have to choose which possible hint to keep; we found most recent collision worked best. Moves on deletes work reasonably well, but the real gains naturally occur from combining moves on insertions and deletes. This gives pretty good performance -- you can get space utilizations more in line with (though of course still not as good as) cuckoo hashing, albeit with more hash functions -- but again, using just 1 move per insert/delete. More details in the paper.

Thursday, September 25, 2008

Allerton : The Day Here

I like the Allerton conference.

Part of it is I come most every year. So by now, I know my way around, which is a nice aspect of a conference being held at the same place every year. Got the early flight in, rented my car in about five minutes at the tiny but functional Willard airport in Urbana-Champaign, and had my badge about a half hour later. After a talk and catching up with some people I went to lunch at the Brown Bag, the crowded spot for people who don't like the conference lunch. Back to listen to more talks, and give my talks. After talks, more catching up with people, while they set up the open bar and snacks. (JeffE, forget beer at the business meetings, take a day -- or just an afternoon -- off and join us!) Then dinner with a very old friend (Steve Lumetta) and his family.

Saw interesting talks by Ramesh Johari, Balaji Prabhakar, Jean Walrand, Adam Wierman, and others. I talk with people about hashing, Bloom filters, deletion channels, sticky channels, floating codes, all manner of things. It's fun being at a conference where the usually disparate seeming parts of my research don't seem quite so disparate.

Sadly, a quick trip this year. I fly back early tomorrow.

Wednesday, September 24, 2008

Allerton : Floating Codes

A couple of decades or so ago, Ron Rivest and Adi Shamir wrote a fun paper on write-once memory. Think of a punch card -- you can turn a 0 into a 1, but you can't turn it back again.
Optical disks were a potential application. Suppose you have a k-bit value that is going to be written t times. How many write-once bits (or "wits") will you need to store it through the t changes?

Fast-forward a few decades, and similar models are coming back into vogue, because of flash memory. My understanding is flash memory uses floating cells; roughly, cells are organized into larger blocks of n cells, and each cell can hold a number of electrons from 0 to q-1. It's easy to add electrons to cells, but not to remove them; to do that, you have to reset the whole block back to 0's first, which is both expensive and wears out the memory, which has a limited lifetime of resets. Suppose you want to store k ell-ary variable values in a block. How many rewrites before you need to reset? Note that a "code" in this setting consists of two things: a mapping from cell states (the state of the entire block) to variable states, and a transition function giving how cell states change when variables change. For more of an introduction, see this paper by Jiang, Bohossian, and Bruck.

In our paper (pdf,talk slides) for Allerton, we introduce (as far as I know) the question of considering floating codes in the "average case"-- the previous work was considering the worst-case number of rewrites before a reset. Given the lifetimes available to flash memory, and the potential to model system behavior before productions, we think analysis of average case performance makes sense here. For our notion of "average case", we assume that the variable values follow a Markov chain, and the goal is to maximize the long-term average time between resets. Given a code, the Markov chain on the variable states induces a Markov chain on the cell states. So the question becomes how to design a code to maximize time between resets on this Markov chain.

Following in the footsteps of the JBB paper, we find some initial interesting results, and leave a lot of open questions -- including the complexity of computing optimal codes for general versions of the problem. I think is yet another potentially interesting coding area where CS ideas should be able to play a strong role. And while I'm not (yet) an expert on the practical implications of these theoretical coding models, the connection to flash memory seems promising.

Monday, September 22, 2008

MSRNEOS

I spent most of the day over at the Microsoft Research New England Opening Symposium over at MIT. It was a very nice event. Part of the fun was that it was well attended -- I got to catch up with some people I haven't seen for a while. But they also had a very nice collection of speakers.

There was a panel discussion on interdisciplinary research (a theme of the new lab), and a natural question was how such research should be encouraged. Nobody on the panel seemed to make what I thought was an obvious point: one way to encourage such research is to make sure people across the sciences are well trained in mathematics and (theoretical) computer science. Interdisciplinary research depends on finding a common language, which historically has been mathematics, but these days more and more involves algorithms, complexity, and programming.

Luckily, to make the point for me, Erik Demaine next gave a talk entitled "(Theoretical) Computer Science is Everywhere", whipping through examples in arts, business, society, games, other sciences, and so on. It was a really inspiring talk, one that should be given to incoming freshman in the field, or better yet, to high school students who don't know what computer science is about. The talk was taped and hopefully I'll have a link to it soon. Afterwards, as several of us were talking about it, there were some minor issues raised. I myself brought up my common concern that perhaps more theorists should take a more direct role in promoting the practice of CS theory, including in other fields. My colleague Greg Morrisett questioned whether Erik couldn't have replaced "Theoretical Computer Science" with "Applied Math" and given pretty much the same talk. I think it's an interesting point, although I do think TCS's emphasis on complexity and algorithmic thinking does give it a distinct perspective.

Another talk I enjoyed (more than I thought I would from the title) was "Understanding Socio-Technical Phenomena in a Web2.0 Era" by danah boyd, who will be starting at MSR NE this January. She is an ethnographer who studies how adolescents interface with technology. So, to summarize, the talk was about why kids (at least in the US) spend all their time on MySpace and Facebook, and what it all means. It was very well presented (similar in style to Lawrence Lessig, who gives fantastic talks), and perhaps my interest stems from now having three kids; getting a perspective on how adolescents use online spaces to interact and communicate (as well as rebel and establish their own identitites) was quite enlightening.

So while they've been around for most of the summer, it's nice to have an official welcome for the new neighbors!

UPDATE: Link for talks.

Saturday, September 20, 2008

This Week, Allerton

The 46th Annual Allerton Conference is this week. The program (not as easily accessible as it should be) is here. I'll be going, but sadly, it will just be a 1-day drop-in to give my talks. Travel is a bit of a luxury, with the new baby and all.

I wasn't sure if I was going to go to Allerton this year. While I enjoy the conference, I have had growing concerns that because many talks/papers are invited and their proceedings aren't widely distributed that the papers I present there aren't as noticed as they would be otherwise. They've alleviated that concern somewhat by having the papers now available on the IEEE online system. At least I know if anyone tries to find the paper, the official copy will be available somewhere (besides my home page). Hopefully the old Allerton proceedings will go online at some point too.

The other motivation for going to Allerton this year was that it forced me to get some things done. Between the faculty search last spring and the offspring this summer, it's been a bit difficult to complete some of my research tasks. There's nothing like promising to have a paper in on a certain date - a non-artificial deadline -- to get things to some sort of stable state.

I'll hopefully have posts up later this week about the new papers.

Wednesday, September 17, 2008

Small Steps...

Every once in a while, I see a blog entry or comment with the lament that the CS publishing model is just wrong. The common statement is that the conference-oriented publishing approach leads to incremental, trivial papers, solving problems that are easy instead of the "real" problem.

I'd like to offer a contrasting opinion, one I've stated before, but may be new to some. (And since the criticisms above are repeated cyclically, I don't see why I can't repeat my response.)

Having worked some in the area of heuristic algorithms, I've gained a healthy respect for the fundamental approaches of repeated refinement, parallelization of efforts to explore a search space, and the occasional bit of random wandering. Greedy-style heuristics don't get to good answers by big jumps, but a multitude of small steps. Parallelization, leveraging the power of crowds, greatly speeds up such heuristics, and frequent communication between the agents working in parallel helps move everything along. And most good heuristic algorithms require a bit of a random (or explicit!) push to keep moving through the space of possibilities, to find new parts of the search space to yield better solutions than the already plumbed over areas.

The CS conference publication model shares these features. Yes, there are many more small steps than big jumps. Yes, there are times where less fruitful and interesting directions are explored. But the community as a whole moves rapidly, churning through ideas and, I think, rapidly focusing on promising ones. Also, new ideas and directions arise with, I think, remarkable frequency. Looking at any single conference, you might think over 90% of it is junk -- or, at least, no so important in the grand scheme of things. And you'd probably be right.** But that doesn't a priori mean the conference publishing system is broken, and any argument that it is based on such reasoning doesn't even start to convince me.

This doesn't mean that we are excused from having taste and opinions as to what constitutes good work. Indeed, without this, we'd lack an evaluation function to lead us in the right directions. And I'm not saying the the contrasting "mathematics" style of publishing only fuller, complete papers is necessarily worse -- I don't think I've seen evidence one way or another. (I might admit to having a personal preference.) But if you want to argue the point, you need to do more than just look at individual papers in individual conferences, and focus on the whole.

[** A favorite quip from grad school that I still remember was when a systems friend told me "95% of theory papers are crap!", and I simply responded, "So that means we're 4% better than systems." Good times.]

Thursday, September 11, 2008

Communications of the ACM, Theory Bias?

Anyone else noticing a pleasantly refreshing pro-theory bias at the "new" Communications of the ACM? One news article on spectral graph theory extensively quotes Fan Chung Graham, Milena Mihail, and Jon Kleinberg. Another on privacy extensively quotes Cynthia Dwork and Frank McSherry. One of the presented research papers is on Distributed Selection by Fabian Kuhn, Thomas Locher, and Roger Wattenhofer (from SPAA 2007), with a perspective given by Hagit Attiya. Finally, there's a puzzle column by Peter Winkler.

I admit, I'm no longer just automatically throwing it away. The entire magazine seems more interesting since the editorial refreshening. But deserving nods to the work of theory community will certainly help hold my attention.

Tuesday, September 09, 2008

SODA, and Parallel Sessions

The list of SODA accepted papers is up.

For both my own curiosity, and probably Claire Mathieu's, does anyone have any ideas or suggestions on how to schedule parallel sessions? It seems like a nasty problem, particularly since you have limited information on what conflicts are, other than very general categories for papers.

Wednesday, September 03, 2008

Big Numbers

In my algorithms class, at some point, I teach RSA and primality testing. For the corresponding assignment, I usually have students do a "numerical exercise" which involves, naturally, computing with some big numbers. They also have a later programming assignment where they have to use numbers that may not fit into a standard-programming-language-sized integer. Since this is not a programming class, and I do not prescribe a language for them to use, I essentially tell them it's part of the homework to figure this out.

Surprisingly, it seems to give a number of people significant grief every year. I know that for C and Java it's a pain -- the natural thing to do is to download and install a BigNumber library, which should be easy, but often some trouble arises. (And most students, even though they've had the intro programming courses, do not seem to have learned how to do useful things like download and run useful libraries.) There are other programming languages which handle big numbers essentially transparently -- ah, the days of programming in Lisp -- which are fine for the numerical exercise, but may not be as nice for the larger programming assignment.

My only real point here is that, in those lists of skills I would expect CS majors to have, I'd have to put "being able to do computations with very large numbers" somewhere on the list. Yes, you're unlikely to need it for your standard checkbook program, but it seems to me this issue arises in plenty of places. (It has for me -- and my graduate students -- several times.) And it's not so important that they know how to do this specific task off the top of their head, but more that when they're given a task like this, they are able to find the information they need and complete it in a timely fashion. For better or worse, that's something some of them get out of my class.

My Biased Coin