Monday, June 30, 2008

Strange Transactions on Information Theory Policies

The journal where I've published the most now is the IEEE Transactions on Information Theory. (Probably because when I write any journal article in information theory, I send it there, while my other work is more scattered.) So I hope it's clear that, as a journal, I hold it in high regard. That being said, they have some strange policies that I just don't understand. (If you're not familiar with the journal and want to check what I'm saying, their submission guidelines are here.)

First and foremost, they have a distinction between "regular papers" and "correspondence". As they briefly make unclear, "a correspondence item will have one point to make briefly and without adornment, whereas a regular paper will be a more well-rounded treatment of a problem area." In the old days, as far as I can tell, this distinction didn't matter much, and roughly the distinction boiled down to length -- for me, long papers were regular papers, shorter papers were correspondence. Perhaps in the IT community there was more prestige for a regular paper, but I've never been aware of it -- at least in my mind citation count would give a better idea of a paper's importance than its nominal type -- and never cared.

This strangeness didn't become important until recently, when to deal with the apparent problem that the Transactions was becoming too big and thereby using too much paper [yes, I know this shouldn't be a problem, but they still send out lots of paper copies], they limited correspondence items to 5 journal pages.

This, to me, is just strange. First, I dislike the idea of page limits in general for journals. The journal is supposed to be the "permanent record" of a scientific result. If it takes 50 pages to present the result, then that's what it takes. A journal certainly has a right (indeed, perhaps a duty) to have authors properly edit a work so that there's not a gross amount of wasted verbiage, but to set a priori page limits, in my mind, goes against the purported purpose of journal publication.

Second, I don't see how this page limit jibes with this (already strange) regular paper/correspondence distinction. Sometimes, presenting a single point briefly takes more than 5 pages. (Arguably, it often takes this long, if one wants to do a suitable job presenting previous work, or substantial mathematical setup must be done.)

As an example, I submitted a paper to ToIT. I thought it was clearly a correspondence, according to their definition, since it really covered a single thematic issue. Given the policy, I wasted at least a day cutting things from my version to make it a (just over) 5 page submission version. (I had tables and a picture -- it really was a hard exercise to cut that last page.)

The reviews came back, and most of the complaints were questions of the form, "Why don't you talk about Z?", where in fact my discussion of Z was something I took out to get things down to 5 pages. (Duh.) The editor asked the editor-in-chief -- no exceptions to the 5 page rule.

So the editor, smartly, decided that I had a 6-page regular paper. While I appreciated the solution, the whole thing just highlighted the silliness of the policy, and perhaps the strangeness in distinguishing "regular papers" and "correspondence" in the first place.

Another policy I object to is their decision not to publish already-published conference articles unless there is "substantially new" material. (This is not spelled out, but I have been told the rough figure is "30% of the article must be new", whatever that means.) Again, in my mind, this goes against the purported purpose of a journal as the final record of a research result. Conference papers are, in my mind, dramatically more likely to be wrong, and at the very least have generally had little feedback in the reviewing process. A journal is supposed to provide more in terms of reviewing; if a conference paper and journal paper ended up identical, I'd have no problem with it -- and I'd congratulate the author on writing such a solid conference version (assuming I didn't question the reviewing practices of the journal).

I can understand this policy in the context of copyright. An IEEE journal can't (re-)publish something another organization has the copyright to, and an author should understand this. But when the paper was in an IEEE conference -- what's the problem?

These two policies interact in a very strange way in the context of the IEEE International Symposium on Information Theory (ISIT) -- the big IT conference. ISIT papers are -- get this -- 5 pages. One might naturally hope that ISIT papers could be turned right over into Transactions correspondences -- and get the benefit of more complete reviewing and editing for that "final version". But not, it seems, under the current policy that the journal version must have something "new" over the conference version. But I can't add anything new to my 5 page conference version without going over that arbitrary 5 page correspondence limit. Am I the only one that sees something wrong with that picture? (And yes, I have ISIT papers that I won't bother submitting to ToIT, because they're really correspondences that should/would be 6-7 pages once I add that 30% new stuff in.) If they're going to not publish things from conferences, even IEEE conferences, then having a 5 page ISIT limit and a 5 page correspondence limit is, in my mind, just strange.

Monday, June 23, 2008

Yet Another Survey... Deletion Codes

Next week I'll be at SWAT for an invited talk. When I gave them a choice of topics, they went with: A Survey of Results for Deletion Channels and Related Synchronization Channels. I've actually given this talk a few other places, but there's been some new stuff, so it's been updated a bit. (I'll put the latest version of the slides up on my talks page after the talk...)

The invitation gave me an excuse to buckle down and finish something I started over a year ago -- a survey article on results for the capacity of the deletion channel (and related problems) to go with the talk. I must be out of practice writing, because it's taken me most of the month to get this thing done. (Well, OK, there was that whole "baby incident" that took up some time, but still, I've clearly been doing far too much administration this last year, and I need to shake off the rust.) It's probably still a bit rough so anyone interested enough to read it, please feel free to mail corrections/suggestions directly.

Apparently, based on some mild teasing from a colleague and friend, I'm known as someone who writes survey articles. (The teasing was perhaps I'm known only as someone who writes survey articles...) I don't know why I end up writing them -- I can't really say it's a fun exercise to write a survey, though it is often educational -- but somehow I feel it's valuable. In this case, there's definitely the motivation that I've done a lot of work in this area, and I want to both promote the area and my work. The survey has a couple of dozen open problems spelled out right there, so let that be motivation for some of you to read it...

Also, I'm open on suggestions on where to publish it. My last few surveys have nicely been able to go to Internet Mathematics (see here for Bloom filters and here for power laws/lognormal distributions-- although the "uncensored" version of the power law survey from my publications page is more fun) so that they've had a worthwhile permanent home. This survey won't quite fit there, and I think it's good for such things to have a visible outlet (besides, say, arxiv). Ostensibly Transactions on Information Theory will take surveys, although I haven't seen that much in practice; and this survey doesn't really fit in the Foundations and Trends model. (I like the Foundations and Trends model, but it seems to me their surveys are like "mini-textbooks" -- and cost accordingly -- while my surveys are more like "here's some background, big ideas, and pointers to papers" -- and I'd like them to be essentially free.)

Wednesday, June 18, 2008

CS Family Values

One aspect of graduate school at Berkeley I recall quite clearly was the lack of children. Graduate students having children was and is generally quite rare, and I'm not sure that CS at Berkeley was much worse in that regard than anywhere else -- I'd be interested if readers have any pointers to stats on the issue, by field and/or by school. But it struck me even then that it seemed unusual for the faculty to have kids, particularly in the theory group.

I know there's the old advice -- still apparently prevalent in some circles -- not to have children until you get tenure. Though that advice seems less widespread these days, or perhaps just more young faculty are choosing to ignore it. (Or, perhaps, I'm just out of touch -- at Harvard, at least, the common case is for CS junior faculty to have kids.) [And apparently I may also be suffering sex bias; see this summary and this pointer to an NSF report, suggesting that the effect on tenure chances for men having children is small, but is much larger for women.]

Does having kids help or hurt one's work? I don't think there's a clear answer, that it probably depends on the individual (and that the effect, in any case, is probably overestimated). On the other hand, I think a career framework that pushes people to postpone or not have children is ultimately cutting off a substantial supply of raw talent, which can't be a good thing. If academia as a whole is moving away from that 20th (19th?) century mindset, I'm all in favor of it.

(An aside -- Chloe Elizabeth Mitzenmacher arrived last week, prompting some of these thoughts. Which, due to lack of sleep, might be even less coherent than average...)

Friday, June 13, 2008

An Improved Analysis for "Why Simple Hash Functions Work"

I recently had a paper with Salil Vahdan (in SODA 2008) explaining why simple hash functions "work" for so many stream-related problems where data is hashed to be placed in a hash table or a Bloom filter or a similar structure. Here "work" means "behave essentially the way you would predict if the hash function was perfectly random." The basic reason, we suggested, was that if the data in the stream had enough randomness, it would combine with the randomness from the choice of hash function to produce value that looked independent and uniform.

Our analysis of how much randomness the data needed to have seemed a little loose. Salil and his student, Kai-Min Chung, have recently improved the analysis; here's the arxiv link, the paper will be appearing in RANDOM '08. There's some really nice and somewhat subtle arguments that I imagine will find some uses in other areas. While it "only" improves the constant factor (in terms of number of bits of randomness needed), this is certainly a case where constant factors are important.

The result further reinforces my belief that this is the right high-order explanation for why, whenever anyone actually does an experiment involving hashing, you get the right results no matter what hash function you use.

Thursday, June 12, 2008

STOC -- Choosing A Program Committee

It took longer than I expected to put together a PC for STOC, and now that it's over, I thought I'd talk about the process.

First, I downloaded the PCs for FOCS/STOC/SODA for the last three years. I wanted to get an idea of what past PCs have looked like, and I wanted the raw data. Generally, I tried to avoid asking people who have served on those committees twice in the past three years. In my mind, it's good to mix things up and get new people involved in the process.

Then I tried to wrap my head around issues of coverage. I know authors are (reasonably) annoyed when they see nobody on the PC in their BROAD AREA X, leaving them wondering if their paper will get a fair read. Despite (indeed, because of) my readily apparent algorithms bias, I wanted to make sure there were a suitable number of people in complexity/cryptography/quantum etc. Even on the algorithms side, I wanted to make sure major areas like data structures, on-line algorithms, approximation algorithms, etc. were reasonably covered as well. It's hard to cover everything, so I'm sure some people will still be disappointed, but I think that overall there is a nice balance in the committee. (We'll see if it's a suitable balance once the papers come in...)

Besides covering areas, there were other balancing issues that came to mind. The "ages" of the PC members was perhaps the next most important. I wanted some fresh faces, since serving on a PC is both good experience for younger members of our community, and a signal to their departments (or bosses, or future bosses) that they are well thought of. But I also wanted some more experienced hands who could possibly head off my mistakes before I made them and provide appropriate judgment where needed.

Additional balances I tried to make: "Practical" people and "pure" theoreticians. Research lab/industry people and academics. (No, that's not the same as the practical/theoretical balance.) US and non-US participants. I did also consider the number of women on the committee. My practice in this situation was not any sort of direct affirmative action, but just to make a conscious effort to make sure that I wasn't overlooking the numerous well-qualified possible female PC members when sending out invites. I look forward to the day when I don't think it's necessary for me (and others) to make this effort, but I personally don't think our field is there yet. For those who are interested in such matters, the STOC 2009 PC is 25% female. In checking back over the PC lists for the last three years, this seems to be more than average for FOCS/STOC/SODA. I admit to being somewhat surprised (and disappointed) by that, and I leave it to commenters to discuss the importance of this issue to our community.

The biggest difficulty I found in getting the PC finalized was getting senior people to sign on. Roughly speaking, the probability of your willingness to serve on a PC seems to decline monotonically with years of experience (with perhaps a significant step-function drop post-tenure). In this case, I don't mean this as a criticism. I know that the people who declined are indeed quite busy with many other things, including larger-scale efforts to benefit and promote the theory community. Or, as one colleague put it to me when I explained the problem I was having, "Duh!" Anyhow, it took a couple of rounds of trials to find, in particular, more senior colleagues, and I thank them especially for their willingness to volunteer, and/or for just giving in to my begging.

[Come to think of it, my own PCing has declined post-tenure, though I'm still averaging at least a couple of larger conferences per year...]

Anyhow, at some point or another, an official call for papers will go up, but I think it's OK to announce the PC (subject to last-minute issues or changes):

Susanne Albers
Andris Ambainis
Nikhil Bansal
Paul Beame
Andrej Bogdanov
Ran Canetti
David Eppstein
Dmitry Gavinsky
Leslie Goldberg
Shafi Goldwasser
Nicole Immorlica
Anna Karlin
Jonathan Katz
Jonathan Kelner
Subhash Khot
Ravi Kumar
Michael Mitzenmacher
Kamesh Munagala
Rasmus Pagh
Anup Rao
Rocco Servedio
Mikkel Thorup
Chris Umans
Lisa Zhang

Tuesday, June 10, 2008

Scientific Citations

A comment over at Lance's blog got me thinking about citations. To summarize, some anonymous commenter said that if a paper wasn't easily available online (specifically, "not available for free online (or through my acm portal subscription)") , they won't read or cite it, and then another commenter pointed out that it was the author's responsibility to acknowledge relevant work, give proper credit, and avoid duplicating previous work.

I certainly agree in spirit with the second comment, but I wonder, exactly, what are our responsibilities as authors? How much hunting, exactly, am I as an author expected to do to find relevant work? This is certainly an issue I've faced in my own work. For example, I've had the case where there may be a related article in an old Russian mathematics journal -- an article pops up in my search of related keywords and either the title or abstract seems potentially relevant, but I can't really tell without getting the article. So far, I've managed -- the blessings of always being near the Berkeley, Stanford, Harvard, or MIT libraries -- but it has sometimes been a non-trivial effort to track it down. In the old days, that library time was more or less expected. What expectations should there be in terms of tracking down old paper copies? What expectations should there be in terms of what an author is required to "spend" to get copies of possibly related work? I do think there is a reasonable argument that can be made that if your paper isn't freely available, an author can't necessarily be expected to cite it.

And of course I wouldn't have even faced the tracking problem except that I try to be diligent in my searches for relevant work. Working across areas, I've often found I have to spend some time guessing and doing some random walks to find out what people in another area call the concept I'm thinking about just to find the relevant papers. How much searching in Google/Google Scholar/your tool of choice should be expected of us? (I'm thinking here of the really annoying reviews I sometimes get of the form, "You should have cited this paper, I'm going to suggest rejecting your paper." That's inane. Perhaps I should have cited the paper, in which case, you should suggest that I cite the paper; that's what reviews are for.)

After all this, there's still the question of what should be cited. Science rules seem much "looser" than what I've seen in literature, history, etc. I'd never think of citing Karp or Garey and Johnson if I was showing a standard NP-completeness result (unless there was a very specific reason to do so) because it's now considered common knowledge. I think in many humanities fields that would be considered improper. Perhaps standards for various fields should be codified -- if only so that people in one field can easily understand the practices in another.

Monday, June 09, 2008

Results of the Hiring Season

I've been eagerly awaiting my chance to talk about the results of this year's hiring season for Harvard Computer Science, but wanted to wait until it seemed all the i's were dotted and t's were crossed. I think we're at that stage.

Because of some losses in Computer Science (including Mike Smith and Barbara Grosz being made Deans, of the Faculty and of Radcliffe, respectively) we were really looking to hire this year. Indeed, I think in the end my most important role as chair of the search committee was to make sure that the right people knew that we really wanted to make a lot of offers, and build support for that. The search committee itself was so great that they made the rest of the job (screening, interviewing, and deciding on candidates) relatively easy. Well, OK, it was actually a lot of work. We had hundreds of folders to go through, we interviewed 14 candidates, and because the candidates were so strong, our decisions were challenging. But the committee itself ran smoothly, with everyone really putting in a lot of effort and working to come to agreement. That's one benefit of a smaller department -- we're very collegial about this stuff.

We ended up making six offers. Given the size of Harvard's computer science faculty, this is indeed a lot, and I'm proud of the fact we were able to make the case to our peers in the School of Engineering and Applied Sciences that making this many offers was entirely appropriate. (We were helped, again, by the high quality of candidates.) In the end, we will have three new faculty members:

Yiling Chen works at the economics/computer science interface.
Stephen Chong works on information security and programming languages.
Krzysztof Gajos works on user interfaces and human-computer interaction.

We're all looking forward to their arrivals.

I, personally, feel very good about the outcome, and am pleased the search was successful. I'm of course disappointed that not everyone we made offers to decided to choose us, but I don't think any school manages 100% there. (Quite frankly, I'm sure the Dean would have been somewhat concerned with the logistics if all six had come, but that would have been a problem I'd have been happy to work through and live with.)

I expect Harvard computer science will be continuing to grow in coming years, so when you know of graduates looking for jobs, make sure Harvard is on their radar.

Friday, June 06, 2008

Is Pure Theory "Wasteful"?

Daniel Lemire has a post on his blog entitled "Why Pure Theory Is Wasteful".

I encourage people from here, if they are so inclined, to comment at his blog...

Wednesday, June 04, 2008

STOC PC -- Software Suggestions?

Well, things are moving forward on the STOC 2009 PC front. I think the committee is just about finalized; more about that in a future post.

The next big question is what software to use for paper submission and reviewing. The last several conferences I've been involved with used easychair, which I view as OK from the PC committee member side of things. Adequate, but not great.

Of course, now I'm viewing things as a PC chair, which feels somewhat different. The first priority is that it makes automated tasks easy, to save me (and the committee) time. The second (although related) priority is that there is some mechanism to easily get the data out -- preferably in some sort of tab-delimited text type way -- so that I can process it as necessary. I'd like this both so I can process information as I see fit, and so I can deal effectively with any emergencies that arise by managing the data myself. The third priority is a reasonable amount of flexibility. Finally, I'd like it easy enough that a minimally trained monkey can use it. I don't know how easychair performs in these respects.

Obviously, I'm asking too much. But I'm eager to take suggestions, both from people who have run conferences and from reviewers who just have software preferences.