Tuesday, June 10, 2008

Scientific Citations

A comment over at Lance's blog got me thinking about citations. To summarize, some anonymous commenter said that if a paper wasn't easily available online (specifically, "not available for free online (or through my acm portal subscription)") , they won't read or cite it, and then another commenter pointed out that it was the author's responsibility to acknowledge relevant work, give proper credit, and avoid duplicating previous work.

I certainly agree in spirit with the second comment, but I wonder, exactly, what are our responsibilities as authors? How much hunting, exactly, am I as an author expected to do to find relevant work? This is certainly an issue I've faced in my own work. For example, I've had the case where there may be a related article in an old Russian mathematics journal -- an article pops up in my search of related keywords and either the title or abstract seems potentially relevant, but I can't really tell without getting the article. So far, I've managed -- the blessings of always being near the Berkeley, Stanford, Harvard, or MIT libraries -- but it has sometimes been a non-trivial effort to track it down. In the old days, that library time was more or less expected. What expectations should there be in terms of tracking down old paper copies? What expectations should there be in terms of what an author is required to "spend" to get copies of possibly related work? I do think there is a reasonable argument that can be made that if your paper isn't freely available, an author can't necessarily be expected to cite it.

And of course I wouldn't have even faced the tracking problem except that I try to be diligent in my searches for relevant work. Working across areas, I've often found I have to spend some time guessing and doing some random walks to find out what people in another area call the concept I'm thinking about just to find the relevant papers. How much searching in Google/Google Scholar/your tool of choice should be expected of us? (I'm thinking here of the really annoying reviews I sometimes get of the form, "You should have cited this paper, I'm going to suggest rejecting your paper." That's inane. Perhaps I should have cited the paper, in which case, you should suggest that I cite the paper; that's what reviews are for.)

After all this, there's still the question of what should be cited. Science rules seem much "looser" than what I've seen in literature, history, etc. I'd never think of citing Karp or Garey and Johnson if I was showing a standard NP-completeness result (unless there was a very specific reason to do so) because it's now considered common knowledge. I think in many humanities fields that would be considered improper. Perhaps standards for various fields should be codified -- if only so that people in one field can easily understand the practices in another.


Daniel Lemire said...

One interesting point that you forget is... what happens with work published on the Web? In the good old days, only library documents were citeable. Sometimes you gave credit to someone using the label "private communication".

However, these days, you will find many excellent (and sometimes not so excellent) papers directly on the Web. Sometimes they have been submitted to journals a few years ago. However, not all good papers published on the Web end up in journals or conference proceedings. The author may die. The author may give up on the academic system. The author may drop out of his Ph.D. program, and so on. Or the paper may not be good enough to make it into a good journal, but it may contain interesting ideas worth citing.

As time goes by, I tend to cite many more informal sources. I am sure I am not alone. I now cite an arxiv paper on average once per paper, or more.

Anonymous said...

Please put a Garey and Johnson cite in, occasionally. Some of us read theory papers only rarely and need to be reminded when we can pull that book to get treatment on a "well-known" reduction. It's another 2 lines in one column and you'll save my poor, weak brain a few steps. http:// refs are fine, too. Everyone knows they might go away, and reviewers will take that into account. Great blog!

Mihai said...

As you imply, not everybody has libraries like those of MIT or Harvard. One of the aspirations of theory (I think) is to not be elitist institution-wise. This is best served by explicitly telling people that if their paper is not accesible to a student at a poor European or Indian university, they have no right to be cited.

In other words, I find it more important for our field to reach out geographically than to spare some people the effort of putting their papers on their websites or the arXiv.

Anonymous said...

This is best served by explicitly telling people that if their paper is not accesible to a student at a poor European or Indian university, they have no right to be cited.

This is about as ridiculous as telling people that if they don't have a big library, then they aren't qualified to do research.

In practice, I do not believe this is a huge issue. I've been to school at places with enormous libraries and worked at places without them. I'm sure it's worse in third-world countries, and of course we should be forgiving of bibliographic lapses by people without access to good libraries, but it doesn't change the fundamental facts. If someone rediscovers known results, they have an obligation to cite the original papers. Unless there's some excuse, such as a new proof or new applications, previously known results shouldn't be published, even if the author was unaware of the history of the problem until the paper was already written. If this is discovered after a paper is published, an acknowledgement of priority should be published, to keep from confusing the written record. I suppose some conferences don't publish such things, but in any case it's unethical to keep a paper on your web page with no indication that new references have come to light.

My impression is that 90% of the issues with missing citations have one of three causes:

(1) People are lazy, and don't want to put in an hour or two looking through bibliographies and databases. After all, non-experts can't distinguish between not searching and seaching with no results. (The hard part for people without good libraries is usually figuring out whether a handful of references are really relevant. I admit this can be hard, but there's no excuse for not at least reaching that stage.)

(2) People are corrupt, and fear that if they discover their results are known, they'll just lose a publication. They make a conscious decision not to look for what they don't want to see.

(3) People are xenophobic, and recognize that if anyone knows their results already, it's not their friends and colleagues. Instead, it's probably someone in a different subfield, and maybe a different country. Who cares if such a person loses credit? Plus, it's easy to rationalize. You are your friends don't know the result, which means nobody important knows it. Surely, by rediscovering it and popularizing it among the in crowd in your subfield, you're really doing the world a big service. It's awfully churlish of your predecessors to claim credit for having done the same in their subfields.

One of the aspirations of theory (I think) is to not be elitist institution-wise.

This is exactly why proper credit is important. Most neglected papers don't come from Berkeley or MIT, but rather from out of the way places.

I fully agree that papers should be made freely available online (and of course all of mine are), but trying to rewrite the citation rules to encourage this is just silly. Instead, universities and funding agencies should set the rules. Professors are being paid to do research, and one of the conditions of the job/grant should be that all research results must be made freely available.

Incidentally, this whole discussion has also left out the aspect of asking experts. Whenever you come up with a new result, you should e-mail experts about it, including people in other subfields you think might care. If the result is new and interesting, you'll have done them a service by letting them know (and you'll get more attention yourself). If the result is not new, someone may tell you. If you can't think of many people who would be interested to receive such an e-mail, or if you are too afraid of embarrassment because the result might be known, then you have no business submitting it for publication until you've resolved these issues. Otherwise, you are good to go.

rgrig said...

This is best served by explicitly telling people that if their paper is not accesible to a student at a poor European or Indian university, they have no right to be cited.

Let's split papers in two (rough) categories:
(1) those written before 1995 and
(2) those written after 1995

If you don't do your research in the library you are penalizing papers in category (1) much more than those in category (2). Yet, somehow, you suggest that not spending much time in the library will coerce those in category (2) into a behavior that favors poor countries.

Mihai said...

The way I understood this, it was not about being given credit for rediscovering something. I doubt anybody would argue for such an idea.

This was about whether you should cite some more-or-less related work. If my paper came together without me reading some paper X (which I didn't because it was not online), and my paper is not obsolete because of X, then why am I citing X? We often cite such papers because we want to provide more context and to raise awareness about the role of X in such a context.

But if people without a big library nearby cannot find X, its role in context is smaller, and I may decide not to cite it.

Anonymous said...

I've found that in general computer science papers tend to be particularly bad at citing well and broadly. I think the laziness reason offered above is very important. That coupled with the fact that as a community we do not seem to be enforcing standards of scholarship.

I think every author has the responsibility of contextualizing the work being presented, not just to establish originality but to make it easier for later readers to make connections. Definitely I agree that there has to be a limit to the amount of time that can be invested in this aspect of paper writing, but some of the papers appearing or landing on my desk for review make me feel that even a small amount of time has not been put in.

I don't think papers should be rejected for citing badly (modulo dishonesty, of course) but I think PC chairs and journal editors should make it clear to authors that a bad related work section cheapens the venue the paper is appearing in, and hence cannot be tolerated.


Anonymous said...

It seems reasonable to me that it is quite handy to work with papers which are freely available. Luckily, most of the young authors publish their papers online as they want to push their reputation. So far, so good.

I'm wondering about the role of publishers in this process. When submitting the final version, one needs to sign a copyright form that often does not deal with online publication by the author(s). Publishers want to make business, they want to sell their (expensive) proceedings and journals. Copies of papers can be bought, for instance from the IEEE - either directly or by paying for a IEEE eXplorer subscription. When papers are published freely by the authors, there is no need to pay for papers.

I'm not aware of any author having troubles with copyright violations so far, but I'm wondering if publishers do not react to this? Is it really okay to publish all of the papers an author wrote?
Personally, I like the idea of the academic web being a PDF web of papers. I extensively use papers which are freely available on the web. But as an author, I'm sometimes wondering if it is really okay to publish all the work one did... (as long as publishers do not adapt to the new situation...To me, the PDF appears to require a similar change in business models as for the music industry after introducing MP3). Maybe I'm wrong and this is not (yet?) a real problem. Thus, I'm looking forward to some interesting comments on this.


Anonymous said...

A constructive outcome of this discussion would be to advance the cause of open access archives like the one that the Harvard faculty agreed to establish. Self-publishing on the Web is a problem, for the reasons cited; an important result may become unavailable if the author loses interest in maintaining the web site or becomes unable to do so. By having the university take over this responsibility, we solve that problem, and get other benefits too, for example that the university can better stare down the journal publishers who say they won't publish without having copyright assigned to them.
If we work on getting this archiving modality to be the de facto standard, in a decade or two Michael won't have to raise the issue a second time.

Anonymous said...

Conferences are not or should not be in the business of giving credit (who did it first); they exist merely for the dissemination of ideas.

Anonymous said...

It drives me batty that I have to pay springer $35 for the article that is the linchpin of Harsanyi's Nobel prize. [I refuse to pay it, despite how much I desperately want to read it; I'll find it, in paper, somewhere]