Friday, April 09, 2010

More on Authorship

My last post, on what I called "60-40" papers, where one author does non-trivially more of the work than the other(s), seems to have generated some interesting comments, worth following up on.  There seem to be multiple issues in the comments that, to me, appear essentially orthogonal:

1)  How does the community properly assign credit for 60-40 papers?  Should we use author ordering or some other mechanism to assign credit?
2)  What about advisors who do minimal to zero work but put their name on the paper?
3)  At what point has a person involved in the project done so little work that they should not be included as an author (by either withdrawing willingly, or possibly by being told "you're not an author").  (I think of this as separate from the "advisor" issue.)

Let me start with item 1, assigning credit.  I promoted the approach used in theory (derived, apparently, from mathematics) of alphabetical order, claiming credit comes out through things like letters and who gives the talk, and is determined more clearly over the course of a career.  Many question this;  indeed, many other fields use entirely different systems.  Many fields use author order to signal the level of contribution in some way, so that being "first author" has significant meaning.  At the extreme, the journal Nature, for example, suggests that author contributions should be fully specified in each article in their guide to authors:

"Author Contributions: authors are required to include a statement to specify the contributions of each co-author. The statement can be up to several sentences long, describing the tasks of individual authors referred to by their initials."

Graduate student and postdocs, in particular, are more concerned with systems that clarify credit, and this is understandable.  They have short career track records, and want a job;  making sure that they get their proper credit often seems, to them, quite imperative.

I'd like to defend the alphabetical, no-explicit-credit-assigned system, and then provide a couple of stories.  (If you find that indulgent, you can skip the stories.) 

One philosophical approach is to try to start from a blank slate.  Forget about your current situation, and how your field does things.  Your starting point is that you're just starting a career in science.  What sort of system do you want to use?  I'd argue you'd want to use a system that would lead to long-lasting, productive collaborations;  that would have minimal overhead;  and that would still provide meaningful ways of calibrating people over appropriate time periods.  I think pure alphabetical does that.  It removes the need to fight over (or even discuss) who contributed exactly what, leading more easily to frequent and repeated collaboration.  To be clear, I have a strong bias:  collaborations, I think, are great for scientific production, and on the whole make research much more fun.  Alphabetical order is clearly easy.  And while it's weak on allowing someone to find out how much each individual author contributed to a specific multi-author paper, over the course of several papers, I think the calibration works, especially when augmented with additional information such as letters in job searches and promotion cases.  Further, it's not clear that other systems are really stronger in terms of assigning credit.  Authors can disagree on contributions -- how does this get settled, and what does it do to future collaborations;  in multi-author situations where order ostensibly matters many advisors will game the system, for example by putting students first regardless of their contribution in order to prep them for the job market or out of professional courtesy;  and it's not clear how, for example, to value different types of contributions, like ideas vs. data collection and analysis.  My bias is that the blank slate scientist starting their career would pick the alphabetical order system.

I have at least one data point for this conclusion: myself.  (Here's where the stories start.)  In graduate school, a bunch of us students got together and wrote a paper.  This was a case where I was definitely the 60 author, and I thought it would be best if I was first author.  The other students didn't object, but since I knew it wasn't standard for theory, I asked my advisor.  (He wasn't a co-author for this paper, so his view was not biased in that regard.)  He told me it was my choice, but that I needed to recognize the following:  I would possibly get more credit for this paper, but, from then on, I would have adopted a system where, for every paper, I'd have to face the possibility of constructing the author order with my co-authors.  Did I want to have that discussion for every paper down the line?  I went with alphabetical order and have never looked back.  I always recommend alphabetical order, although when I work with people in other areas I do defer to whatever system they want to use, and tell them they can put me wherever they like in the ordering.  (It is true that, with tenure, one can care much, much less about such things.) 

On the other side, another story.  When I applied for my CAREER grant, apparently I was on the borderline, and it took quite some time to get the final word.  I asked the NSF officer for feedback -- especially in case I needed to resubmit.  (Apparently, enough money came through in the end to fund me.)  One thing he said was that a lot of my work had been co-authored with very talented people, and it wasn't clear what my contributions were.  This was a case where, obviously, there were no recommendation letters to draw from.  Still, I was offended then by the comment, and looking back I still find it ridiculous.  At that point, I'd written multiple papers with these other authors (who were not my advisor) -- clearly they thought I was contributing something worthwhile.  And why was the assumption that they were the 60 contributor, instead of me?  It's not clear that using author ordering would have helped in this case, or that such cases are at all frequent.  But it does help me understand alternative points of view on the underlying question.

Wednesday, April 07, 2010

60-40 papers

A recent paper I worked on was a 60-40 paper.  That's what I call it when one of the authors does noticeably more of the work.  Really, it could be a 70-30 paper, or some other division; or with multiple authors, it could be a 50-30-20 paper.  But I use the phrase 60-40 paper to refer to all of these situations.  In this case, I was the 40.

60-40 papers aren't at all abnormal, and I've done enough papers not to let it bother me.  When I'm the "40" author, I usually try whenever possible to do what I can to help even things out, for example in the writing/editing/revising stages;  when I'm the "60" author, I recognize that the other authors have contributed, and the paper wouldn't be what it is without them.  I've had amusing discussions with one co-author where we ended up admitting we both thought we were the "40" author for the paper we were writing.  That was a collaboration that lasted for several papers;  apparently, we both thought we were getting a good deal.  I don't think I've been in many collaborations where multiple authors thought they were the "60", but my guess is those could be problematic. 

Fan Chung has a nice page up with advice for graduate students that I think puts the 60-40 issue in perspective.   At the end, under research collaboration:

What about the division of credit?
-- In math, we use the Hardy-Littlewood rule. That is, authors are alphabetically ordered and everyone gets an equal share of credit.
--  The one who has worked the most has learned the most and is therefore in the best position to write more papers on the topic.
--  If you have any bad feeling about sharing the work or the credit, don't collaborate. In mathematics, it is quite okay to do your research independently. (Unlike other areas, you are not obliged to include the person who fund your research.) If the collaboration already has started, the Hardy-Littlewood rule says that it stays a joint work even if the contribution is not of the same proportion. You have a choice of not to collaborate the next time. (If you have many ideas, one paper doesn't matter. If you don't have many ideas, then it really doesn't matter.) You might miss the opportunity for collaboration which can enhance your research and enrich your life. Such opportunity is actually not so easy to cultivate but worth all the efforts involved.

I'd just add a bit to this.  Usually the "60" author will, actually, get more credit in various ways:  usually they're the one to give the talk on the paper, for example.  (It can also come out in letters when really needed.) And it's not so clear that a string of 60-40 collaborations with one author repeatedly being the 60 is so bad;  without the 40, the research or the paper might not ever get done!  Good collaborations are indeed enriching.  To some, particularly graduate students, this approach and attitude might seem strange, but I recommend considering Fan's suggested understanding of collaboration.

To all the co-authors out there who have been the 60 to my 40, I appreciate your putting up with me.  And to all the co-authors who have been the 40 to my 60, as long as we had a good time working on the paper, no worries, and thanks!    

Monday, April 05, 2010

Sexual Harassment Policies (Yale v. Harvard)

My brother, in what I assume is a blatant attempt to be mentioned in this blog (Hi Steve!!!), sent me a link to the following article about a new rule (or, as the article describes it, "A Sad Day") at Yale, banning professors from having sex with undergraduates in all circumstances (not just students that, say, are in their classes).  More details at for example the Yale Alumni Magazine.    

I was all ready to start looking down my nose at the competition for being slow to adopt what are in my mind obvious rules to have, but decided to check Harvard's policy first.  (Always a good idea.)  Harvard's policy, arguably, isn't even as strong as Yale's old policy.  (Harry Lewis will, I imagine, correct me if I am mistaken in my interpretations or usage of documents.)  The relevant information seems to be here.  The policy description includes the following, under the heading UNPROFESSIONAL CONDUCT IN RELATIONSHIPS BETWEEN INDIVIDUALS OF DIFFERENT UNIVERSITY STATUS:

"Officers and other members of the teaching staff should be aware that any romantic involvement with their students makes them liable for formal action against them."

This seems to suggest that faculty can't have "romantic involvement" with their students, but some old letter to the Crimson suggests that the wording is much weaker than that (the article is here, the letter is here).  Strictly speaking (according to the letter), the wording seems to suggest that faculty members involved with students face the risk of a the student filing a sexual harassment/unprofessional conduct complaint;  but if the relationship is brought to light by a third party, there's no (apparent) cause for disciplinary action.  IANAL, but this seems like a possible interpretation;  I'm not sure what the current interpretation is here at Harvard.  

Indeed, later on the policy states:

"Amorous relationships between members of the Faculty and students that occur outside the instructional context can also lead to difficulties."

The rest of the paragraph suggests potential problems if Faculty engage in "romantic involvement" with students who they are not directly teaching, but seems to make clear (by my reading) it's not forbidden in any sense.

I've certainly heard arguments in the past that such rules shouldn't exist.  I can even see that there are potentially complicated lines -- should a professor in the Faculty of Arts and Sciences not be allowed to date a Harvard Law student?  (Extra credit:  why or why not?)  But given the potential for abuse (both intentional and unintentional) of the power relationship, I'm unapologetically on the "no faculty - undergraduate romance" side.  Or, as it says in the Yale Alumni Magazine:

'An imbalance of power forms the rationale for treating Yale College students differently from their older counterparts. Undergrads, the revised handbook says, “are particularly vulnerable to the unequal institutional power inherent in the teacher-student relationship and the potential for coercion, because of their age and relative lack of maturity.” '

Duh.  Good for Yale.

NSF Review Issues

My understanding is that the turnaround time on NSF decisions should be approximately 6 months.  (See, for instance, their own diagram of the review process.)  So I admit to getting a bit edgy after month 7 has come and gone without hearing anything on a proposal I have in.  I went back at looked at my NSF proposal history, and found a small bright side:  proposals that were accepted seemed to take a longer time for the decision feedback to arrive.  Sadly, this rule did not seem to be universal, and my personal sample size is too small for rigorous conclusions.  Feel free to share your own anecdotal evidence.  Meanwhile, I'll try my best to forget about it until we get to month 8.  

Also, this year, I have been asked (more than once) to review a single proposal "off-panel" (that is, I didn't serve on the panel that the proposal was part of).  I can't recall having been asked to do this before, and wonder if there's a policy change behind it or if it's business-as-usual and I'm only now noticing it.  I certainly don't mind -- I'm more than happy to help the NSF, and even more happy if I can do so without having to travel to DC.  On the other hand, I worry that this approach might cause the same sort of problems that can occur with subreviewers, such as consistency across reviews.  

Friday, April 02, 2010

Energy Sustainability

I spent an entertaining hour this afternoon listening to David MacKay of Cambridge (UK) give a talk about the Future of Energy here at Harvard.  I've mentioned David in this blog before.  He did a lot of early work on low density parity check codes and deletion codes, so we've run in the same circles for quite some time.  But now, besides his well-known book on information theory (Amazon link, free downloadable version), he's written a book on sustainable energy (Sustainable Energy - Without the Hot Air : Amazon Link, free downloadable version) that was the subject of his talk.  David's also recently been named Chief Scientific Advisor to the Department of Energy and Climate Change (UK). 

The talk was based on the book.  David's starting point is the question, "What would we have to do to move to a world where we weren't using fossil fuels?"  (The "we" he's talking about is usually the UK, but it applies elsewhere as well.)  He then takes a truly scientific approach.  He considers various possible renewable energy sources (wind, solar, biomass, tides), and estimates things like their energy output per unit area.   Based on these calculations, he figures out how much land would be required.  So, for instance, if you were willing to cover 1/2 of Britain with windmills, things might look OK, but that's not a very likely possibility.  He also considers the demand side of the equation, and what might feasibly be done there.

The book (and the talk) are not overtly political.  Whether you believe in global warming or not, the question of sustainable energy is important -- for national security concerns, you might not want to be dependent on getting your energy from, for example, oil-rich countries.  His book is not about the politics;  rather, he tackles these questions as a scientist, producing the numbers that are needed for intelligent, reasoned discussion and debate on the issues.  That might sound dry and, possibly, boring, but not in David's hands.  He's blessed with a fine wit and a charming style that comes out in the book and even more so when he's speaking.  (His slide showing a collage of posters from places protesting the introduction of windmills into their community, for instance, received a lot of laughs.) 

David has done a truly rare thing as a scientist, writing a book firmly about the science of an issue of current import that people are actually reading and that is raising the level of debate.  I admire his courage in taking on a challenging assignment, and hope his work helps lead to the positive changes he is looking for.    

Tuesday, March 30, 2010

Two Inspiring Posts

UPDATE:  Make that three!

Inspiring me to blog, that is....

Matt blogs about the psychology of program committees.  It's one of those things that would be funny except that it's true.  So true.  So, so true...

One of the commenters raises an interesting option.  Every PC member gets one "trump card" to decide to accept a paper unilaterally.  (I suppose one could also allow a trump card to be used to reject a paper unilaterally.  But then what happens if two people play opposing trump cards?)  What do you think?  I like the idea in principle -- if a PC member has such a strong opinion on the paper, it should be accepted -- but I think in practice it's an idea rife with complications.  Gamesmanship in the committee about when/how to use the trump card and potential abuses from conflicts of interest come to mind.  Might you be inclined to use a trump card rather than "go to waste", even if you didn't feel quite so passionately about a paper?  I think if everyone used it in the manner it was intended, it could be a great idea.  On the other hand, if PCs were perfect, we wouldn't have posts like Matt's (and might not need the trump idea in the first place).

Lance complains about traveling too much.  Some academics I know would laugh at the idea that hitting 50K miles gets you anywhere close to being a too-frequent traveler, but (like Lance) that's pretty much my threshold too -- it's roughly a flight across the country or to Europe each month.  Add in some other shorter travel, and it really does up.

At this point, I avoid trips that keep me away overnight if at all possible.  I've turned down a few colloquium talks this year because of this problem.  If we can work out a travel schedule based on my taking an early flight from Boston and a late flight back the same day, I'll definitely try to make it work.  The kids/family can manage school/dinner without me for a day.  (There's also my class to consider, of course, during the semester, but let's temporarily leave that aside.)  I'll take a late flight out the night before sometimes if needed.  But if I have to miss two dinners (or two morning walks to school) it's much, much less likely I'll do the travel.  It just adds up quickly to too much time away.  (Usually the way flights work out you're stuck leaving early the afternoon before, killing the day, or getting back mid-afternoon on the way back, killing the day.  2 days is a lot of time to devote for giving a talk.  And if I'm missing two days in a row, I'm definitely missing a class during the semester...)

Conferences obviously are different -- but it's rare I go to a conference where I don't have a paper, for essentially the same reason.  I do try to arrange longer trips, where I can bring the family -- I arrange a 1-2 week west coast Bay Area tour most every summer and try to give talks at all the major places I can manage -- but it's a difficult balancing act.  I suppose I'll want to travel more again once the kids grow up and leave the house.  I hope I'll still have work worth traveling to talk about then!

A third inspiring post by way of MuthuTim Roughgarden has won the Grace Murray Hopper Award, and Bellare and Rogaway have won the Paris Kanellakis Theory and Practice Award.  Congratulations all around!




     


Wednesday, March 24, 2010

Reading a Research Paper

In the category of "waste nothing", here's a handout I wrote way back when (over a decade ago) that I use for my standard graduate seminar class on How to Read a Research Paper.  Others have found it useful enough to use it (with permission) rather than construct their own.  If you like it, feel free to use it, and modify it as you like.  I even have the Latex around;  just ping me if you want it. 

There are other pages and places with more advice on the topic, so you might want to look around.  Searching on phrases like How to Read a Research Paper (and other variants, like How to Read a Technical Paper) will yield plenty of pages.  I like the "What to Read" section at the bottom of this page from Jason Eisner -- creative Web search and forward/backtracking references are, I think, often underappreciated skills.

Thursday, March 18, 2010

Google-Viacom Documents

I don't blog about my work as an expert witness, but for those who are interested in such things, since I (currently) have nothing to do with the case, I can happily point you to the court documents now out in public for the Google-Viacom case.  You can find pointers to them here.  Both sides are moving for summary judgment;  if you haven't heard of that before, Wikipedia describes it.  I haven't read them over yet, so I don't have any opinion to offer.   

Tuesday, March 16, 2010

More on SIGCOMM

The first round of SIGCOMM is pretty much done.  (As usual, many reviews are still out, though the deadline has passed.)  I had mentioned earlier that my first round papers, in general, seemed pretty terrible.  My colleagues agreed.  This year the ratings scale was 1-10 (which I dislike over the standard 5 point scale from previous years), and I had five papers that didn't get a score higher than a 3.  (A score of 3 is "Reject";  scores of 1 and 2 are below reject.  A score of 5 is still only a borderline reject, for comparison.)  In some cases, I gave a 3 and was the high score.  I did (eventually) read a couple of good papers that may well be accepted.  Hopefully, I'll see better papers in round 2.  

Matt Welsh, perhaps at least partially inspired by being on the SIGCOMM PC as well, has suggested an approach for dealing with the large quantity of awful submissions by charging authors to submit;  Suresh disagrees.

Interestingly, the scores (and reviews) on my papers were generally consistent across the line, with a rare exception or two.  Since there's always some overenthusiastic anonymous commenter who thinks its important to call SIGCOMM an insider's club whenever I blog about it, I'll repeat that I'm not aware of belonging to any such club, and once again, the reviews I see from others not only make sense, they match my own opinions, which I view as independent, to a striking degree.

Monday, March 15, 2010

Kindle Textbooks

I've had a couple of people point out to me that Probability and Computing: Randomized Algorithms and Probabilistic Analysis (my book) is now available on Kindle.  I looked around and saw that several other standard texts are now available that way as well:  Algorithm Design, Algorithms, Approximation Algorithms, Computational Complexity: A Modern Approach, and Introduction to the Theory of Computation.  (Even the omnibus The Princeton Companion to Mathematics is available on Kindle.)  Strangely, several other texts apparently aren't:  Introduction to Algorithms, Third Edition, Randomized Algorithms, Algorithmic Game Theory, and Concentration of Measure for the Analysis of Randomized Algorithms.  While I can understand that older texts might not be easily moved to a Kindle format, the Concentration of Measure book is new, so I'm not sure what it is that is separating Kindle-ized books from unKindled peers.  Anyone out there have any insights?

I suppose I'll see in my next book sales statement if any Kindle copies were sold.  In general, the Kindle version seem to go for just a few dollars less;  Algorithm Design is a big exception, with the Kindle edition going for over $25. less than the hardback counterpart.  Sometimes, it looks like you can buy new copies of the book less than the Kindle price (usually from Amazon third party dealers).  I don't own a Kindle (yet), and I wonder how they would be for textbooks.  The textbooks I've listed above I'm happy to have on my shelves for reference.  I suppose if I had them in a universal, pdf-like format that I could access and make use of essentially anywhere, I'd be happy with that too.  Particularly if I had capabilities like search available.  But I wouldn't want my copies of the book tied to a particular piece of hardware.  If I lose my Kindle, do I lose my books?  That's fine for disposable books -- and probably for many students many textbooks fall into that category.  (Use them for a semester, then forget about them.)  It wouldn't be fine for me for these books.

Can anyone comment on the Kindle textbook experience?  I'm interested generally, and in the particular issue of the "permanence" of books that might be references one wants to keep.  Someday soon, I'll be getting one of these (or another e-reader), and it would be useful to know whether it's currently worth moving to a system where I try to keep important texts in an electronic, rather than paper, format.        

Saturday, March 13, 2010

And More Fun News....

Right after Stuart Shieber sent me news on CS undergrad earnings, Harry Lewis sent me two additional links.  The first is a nice Boston Globe piece on the increase on undergraduate majors in the sciences at Harvard.  

The second is much more amusing.  Apparently, the Harvard Robobees project has made #1 (that's right, we're number 1!) on Sean Hannity's list of the 102 worst ways the government is spending your tax dollars.  Now, I try to stay apolitical on this blog, but I have to say, I'm impressed by Sean Hannity's lack (well, actually, more like a complete absence) of acumen in understanding the nature of scientific research.  A look at the Robobees home page, I would think, would certainly suggest that there's important scientific and engineering questions underlying the long-term challenge of building a robotic bee.  Of course, maybe the Hannity camp just objects to the government spending money on science generally, I don't know.  I'll go on record as suggesting that nobody from the Hannity camp bothered to look at the Robobee home page.      

Good News for Computer Science Majors

Undergraduate computer science majors, according to the National Association of Colleges and Employers, are still getting paid well.  If you don't want to be an engineer for an oil or chemical products company, we're still apparently the best way to go.  (I'd like to think careers in CS are more interesting than in these areas as well, but I really don't have the experience to say.)

Thanks to Stuart Shieber for the link. 

Wednesday, March 10, 2010

A Conflict Question

I was recently asked the following question:

Suppose you're the PC chair, and someone who has submitted a paper asks you NOT to have the paper reviewed by a specific person on the PC. Do you honor that request?
It's an interesting question -- that I hope others will comment on -- though as a default my answer would be yes.  I certainly have had run-ins of sufficient severity with various people through the years that I would not want (and would likely ask) for them not to review my papers if the issue came up.  Looking at it from the other end, if I am on a PC and those people submit a paper, I make sure not to review them.  (Usually it is sufficient simply to rank them low on my list of desired papers, but I have also told PC chairs in advance I would not review certain papers if they seemed likely to head my way.)  It is not that I actually think I couldn't give a fair review;  it's that I think it's inappropriate, in such a situation, for me to give a review in the first place.  If as a PC member I have the right (actually, I would say, a responsibility) to refuse to review a paper under such circumstances, it seems fair that a submitter can ask for a specific PC member to not review a paper as well.

Context does matter, though.  In the networking conferences I have served on, this is standard -- PC members and submitters are expected to list their conflicts.  Indeed, one issue that seems to have arisen lately is that there is suspicion that some people submitting papers are abusing this right, listing people as conflicts when they are not because they are known to be "challenging" reviewers.  While I'm skeptical this sort of gamesmanship gains anything (challenging reviewers are usually calibrated appropriately at the PC meeting), it is a concern that once you open the door to such requests, you may need to make sure the privilege isn't abused.

For theory conferences, where many people seem painfully unclear on what "conflict of interest" even means, I'd grant such a request as a matter of course.  

Tuesday, March 09, 2010

Congratulations to Chuck Thacker

The news has hit the wires -- Chuck Thacker is this year's Turing Award winner.

I had the great pleasure of getting to know Chuck while I worked at DEC SRC.  He's a character, a tinkerer, and a great and curious mind.  I think recognizing his work -- the Alto -- is a great choice.

 

EC Papers Up

The list of accepted papers for EC is up.  I'm happy to say that our Swoopo paper made the list.

I was surprised in the acceptance letter to find that there were 45 acceptances out of 136 papers -- an acceptance rate of about 1/3.  (Compare with WSDM.)  This makes it one of the less "competitive" CS conferences I know of, although a little research shows this is a bit unusual -- last year the numbers were 40/160 or so, so they accepted more papers and had fewer submissions this year.  Is that a trend in the making or an accident of timing this year?  Also, while I'm an EC newbie, the list of papers looks very interesting, with plenty of top-tier names.  "Competitive" or not, I'm expecting high quality.  I'm really looking forward to it -- and not just because its location makes it remarkably convenient for those of us in the greater Boston area.      

Sunday, March 07, 2010

Carousel (NSDI Paper)

The "final version" for our NSDI paper, Carousel: Scalable Logging for Intrusion Prevention Systems, is now up.

Here's the main idea.  In IPS systems, the logger can get overwhelmed during an attack.  Bad sources (with other related info) need to be recorded for later examination, but there's only a small amount of on chip memory to buffer bad sources, and only a small about of bandwidth (compared to the rate data is coming into the system) from the memory to the more robust recording infrastructure.  How do you get all, or almost all, of the sources?

In our model, bad sources are hitting the system repeatedly -- the denial of service attack setting.  On the plus side, this means you don't have to log a bad source the first time it appears - the assumption is it will come back again.  On the negative side, you want to avoid sending the same source to the recorder multiple times, as it wastes your small bandwidth.  (Dups can be removed at the recording side, though.)

The baseline solution is to just grab a bad source to record for the buffer as soon as you have room after you send one out.  We consider a random model -- there's a bunch of bad sources, and the next one to appear is random from that set -- that shows that this approach is bad;  by connecting it to the coupon collector's problem, we show it can be a logarithmic factor off of optimal.  Other experiments show it can be even worse than this in realistic situations. 

Our solution has two parts.  We hash-and-partition the bad sources, adaptively finding a partition size that so that the bad sources in each partition fit into our small memory.  That is, we hash each source, and put in a partition according to the last k bits.  This breaks our N bad sources into groups of size (roughly) N/2^k, and if N/2^k is small enough so that the partition fits into memory, then (assuming we can avoid duplicates), we no longer have a memory problem.  We just run through all the partitions, giving us our "Carousel".   

To deal with duplicates, we do the obvious -- we use a Bloom filter (or equivalent structure) to avoid sending duplicates within each partition.

This hash-and-partition plus Bloom filter type framework seems like a potential generally useful trick;  more details -- including both theoretical analysis and experiments -- are in the paper.

An interesting thing came up in the reviews.  We pointed out that the "straw man" approach -- do nothing -- was quite bad.  We mentioned that just using a Bloom filter -- without partitioning -- wouldn't really help, but then ignored that option.  Apparently, this was a big concern for the reviewers;  my understanding is that it almost "sunk" the paper.  They wanted to see more details, including simulations, on this option.  Luckily, it was still accepted, and in the final version, in response to the reviews, we've added some theory and simulations to prove our point.  (The point is you'd need a really, really big Bloom filter -- too big for your memory -- to track all the sources, so eventually you have to clear the Bloom filter, which causes you to lose whatever it was going to gain you in the first place.)  I'm not sure what the takeaway is there.  The right outcome happened -- they should have accepted the paper, and we could add the appropriate stuff.  But I'd have hated for what in my mind is a minor issue to have killed the paper.  Perhaps we should have said more about it, although at some point space prevents you from providing details on every possible variation you could consider (even if you have actually considered them).    

This seems like a good place to remind people about this book -- Algorithms for Next Generation Networks -- which includes a survey about these sorts of hashing applications in networks by Adam Kirsch, George Varghese, and me.  The book does seem a little pricey;  we have a slightly older version of the survey available online.   

   


Friday, March 05, 2010

SIGCOMM papers

A commenter asked me a while back to say what I thought of the SIGCOMM submissions this year.

I'm finally getting around to reading and reviewing. (First round reviews aren't due for at least a week!) And so far, by and large, the papers I'm getting are pretty terrible.

This generally seems to happen to me on the first round, but this year is extreme. My first several papers just don't belong at this conference. (Arguably, they don't belong at any conference...) There's some number of papers submitted at every conference that are just not serious submissions, and apparently I got more than my fair share on the first round.

This makes it harder to judge the other papers -- it's hard to calibrate when you start with a lot of junk. Many of my other papers are theoretically oriented, and I'm not too optimistic about them. There's room for theory papers at SIGCOMM, but I think the bar is, rightly, pretty high. When I read a theoretical paper for SIGCOMM, I look for one of two things. First, it could be the paper has a nice theoretical idea that's actually useful. The problem there is that it's incumbent on the paper to clearly demonstrate the utility, and most fall down in that regard. I quote the SIGCOMM call: "SIGCOMM is a highly selective conference where full papers typically report novel results firmly substantiated by experimentation, simulation, or analysis." A mathematical analysis alone generally does not count as a firm substantiation. [Such papers generally have a better chance at INFOCOM -- which I think is a good, and very important, thing! There needs to be an outlet for more theoretical networking work, and perhaps it's just better suited for a big conference. Many such papers will end up having minimal impact, but once in a while, a good idea gets built on and has a real impact.]

Second, it could be the paper really challenges our way of thinking, introducing a new framework that seems a clearly important guide for future work. Such papers are rare, but important. I seem to have a number of economics-networking papers that are very high-level, and I'm really trying to understand if any of them have that character. Again, because SIGCOMM is so selective, I think the bar is very high for such papers. I'm really looking for something that enhances our fundamental understanding of the network.

That's it for now. If you have a submission, don't let my comments make you antsy -- the meeting is still a long way away, and I'm quite sure I'm not reading your paper anyway.

Wednesday, March 03, 2010

Congratulations to David Johnson, Knuth Prize Winner

I'm pleased to hear that David Johnson has won the Knuth Prize for "his contributions to theoretical and experimental analysis of algorithms."  David has done a great many wonderful things, but I thought I'd highlight something that I imagine is underappreciated today, his book Computers and Intractability: A Guide to the Theory of NP-Completeness.  We're a bit spoiled these days, what with Wikipedia pages with lists of NP-complete problems and online compendiums with references.  But in ye olden days, when I was a grad student, if you had a reduction you needed to ponder, you went to Garey and Johnson.  The book was an inspiration, and an invaluable research resource, to many.  It's one of the most cited (the most cited?) references in computer science, and the book will rightly hold a special place in the history of computer science. 

Tuesday, March 02, 2010

Teaching Bloom Filters

As I finished my lecture today for my undergraduate algorithms and data structures course, after spending about half an hour explaining Bloom filters and what they did, I couldn't help but wonder, yet again, why this incredibly useful, simple data structure, which also didactically demonstrates some very nice concepts like one-sided error and the ability to trade off speed/memory with correctness, isn't in the standard undergraduate textbooks?  (Of course, you can always buy this book....)

If you're teaching algorithms and data structures, do you students a favor, and sneak Bloom filters in one lecture. 

Monday, March 01, 2010

Stuff in press

The Economist has a special section this month devoted to The Data Deluge.  (Much of it may still be behind their firewall.)  Many computer scientists are quoted/named in the collection of articles, and it gives a nice overview of what's going on generally.  (It makes a horrible mistake, though, by saying that Google's original advance was counting the number of links to a page to score relevance;  that ignores PageRank, and Altavista...)

Bach, Chawla, and Umboh take our previous work on the hiring problem in new directions.  Or, even, new dimensions:  they also consider multidimensional variations of the problem.  It's definitely a general problem with plenty of variations to consider;  perhaps this paper will inspire further looks at the problem.