A shout-out today to my friend and colleague Pardis Sabeti (and her lab) for their Science article on the Ebola virus that appeared earlier today. Pardis and her group have been studying the genetics of viral diseases, and in particular the Lassa virus. So they were there and ready when the recent Ebola virus began and went to work. They sequenced 99 Ebola virus genomes from 78 patients, and have analyzed the resulting information to gain insight into how the disease is spreading and mutating. They have released the genetic sequences to the public so that the information is available to groups who could use the sequencing to help find and test cures. This is both very important work, and (sadly) very serious. As reported, 5 co-authors of the study (health care workers, a lab technician, and the director of the national Lassa fever program in Sierra Leone) have died of Ebola before today's publication.
Numerous news articles appeared today discussing this work, too many for me to link to. But the SabetiLab page here contains a great deal of information about her research and various projects. Her work in biology, which makes powerful use of statistics and computation, should serve as an inspiration for future scientists, and for people who want to use algorithmic and mathematical methods in ways to benefit the world.
Thursday, August 28, 2014
Wednesday, August 27, 2014
Update on SOCG and ACM
I am happy to have an update on the SOCG/STOC colocation issue that arose last week. Or, better said, Jeff Erickson has an update, the short summary of which is that it looks like there has now been some very useful clarification. The concerns of the ACM apparently are limited to direct financial support, and the conferences can (formally) co-locate. I encourage you to read the note on Jeff's blog from Paul, and let me echo Jeff's statement here:
"Needless to say, this is fantastic news! I want to publicly thank Paul and the ACM leadership for quickly clearing up this unfortunate misunderstanding."
So say we all.
"Needless to say, this is fantastic news! I want to publicly thank Paul and the ACM leadership for quickly clearing up this unfortunate misunderstanding."
So say we all.
Tuesday, August 26, 2014
New Article on arxiv on Equitability and MIC
We recently put on arxiv a new draft on "Theoretical Foundations of Equitability and the Maximal Information Coefficient". This is
some follow-on work to a paper that appeared in Science a couple of
years ago, where we introduced the idea of equitability. Essentially,
in that Science paper (link to page where you can access the paper), we wanted a statistic that would give back, for
samples from a noisy functional relationship, a score corresponding to
the amount of noise (or, in that case, to the R^2 of the noisy data
relative to the relevant noiseless function), regardless of the
relationship type. The idea was that this would be useful in data
exploration settings, where we might have a large number of possible
relationship pairs and in particular a number of non-trivially
correlated relationships, and we'd want to score them, in some fair way
across the possible types of relationships (linear, parabolic,
sinusoidal, etc.), so that we could choose the most promising to look
at. We also wanted the statistic to do reasonable things for
non-functional relationships. And, finally, we wanted a pony. (But we
couldn't find a way to put that in the paper.) The maximal information
coefficient (MIC), which we built on top of mutual information, was our
proposed statistic.
The paper has gotten some interest. One thing that we heard was that people wanted a richer theoretical framework for these ideas. So now we're finally delivering one. It took a while, because the students involved -- Yakir Reshef and David Reshef -- were off doing crazy, wacky young-people things like going to medical school, making it hard to get cycles for the project. On the other hand, the time did some good, allowing us to explore to determine the formulation we wanted. The result is, I hope, an interesting mix of ideas from statistics and computer science. We're eager for feedback as we hope to formally submit somewhere soon.
In a couple of weeks we should have another paper out on the same topic that is more empirical. Naturally, when working through the theory, we came up with better algorithms for computing MIC, and it made sense to separate those results (and some others) into another paper.
The paper has gotten some interest. One thing that we heard was that people wanted a richer theoretical framework for these ideas. So now we're finally delivering one. It took a while, because the students involved -- Yakir Reshef and David Reshef -- were off doing crazy, wacky young-people things like going to medical school, making it hard to get cycles for the project. On the other hand, the time did some good, allowing us to explore to determine the formulation we wanted. The result is, I hope, an interesting mix of ideas from statistics and computer science. We're eager for feedback as we hope to formally submit somewhere soon.
In a couple of weeks we should have another paper out on the same topic that is more empirical. Naturally, when working through the theory, we came up with better algorithms for computing MIC, and it made sense to separate those results (and some others) into another paper.
Saturday, August 23, 2014
Is the ACM "Retaliating" Against SOCG?
Friday afternoon Jeff Erickson posted at Making SOCG blog some "bad news". Some background: very recently, the Symposium on Computational Geometry, or SoCG, decided to leave the ACM, for various reasons. There had been plans in the works for SoCG to be co-located with the Symposium on the Theory of Computing, or STOC, one of the flagship general theory conferences, in 2016. STOC is an ACM conference. Reportedly, Paul Beame, chair of SIGACT (the ACM theory special interest group) sent a note that included the following (emphasis added by Jeff):
I encourage everyone to read Jeff's post (here's the link again). Obviously, this is an issue for Computational Geometry, and an issue for Theory. But I would say it's an issue for all ACM members, and in particular those that publish research at ACM conferences. It would be nice if, should it become necessary, members from other areas in computer science voice their displeasure with ACM's actions in this case.
With SoCG leaving ACM we have been told that SIGACT and ACM conferences cannot have any formal arrangements at all with the new conference or do anything official that might support it. (This decision was made at the very top of ACM and not by the staff.) This rules out any joint sessions... It also means that SIGACT will have to end our participation in this formal coordinating group.While I await more information and do not want to rush to judgment, if this description is true, I find it unacceptable behavior on the part of the ACM. Their job is to promote computer science. (The ACM home page begins with: "ACM, the world’s largest educational and scientific computing society, delivers resources that advance computing as a science and a profession.") Their job is not to try to monopolize and control the computer science conference market.
I encourage everyone to read Jeff's post (here's the link again). Obviously, this is an issue for Computational Geometry, and an issue for Theory. But I would say it's an issue for all ACM members, and in particular those that publish research at ACM conferences. It would be nice if, should it become necessary, members from other areas in computer science voice their displeasure with ACM's actions in this case.
Thursday, August 21, 2014
Hashing Summer School
Back in July I took part in the Hashing Summer School in Copenhagen. This was nominally set up by me, Rasmus Pagh, and Mikkel Thorup, though Mikkel was really the host organizer that put it all together.
The course materials are all online here. One thing that was a bit different is that it wasn't just lectures -- we really did make more of a "summer school" by putting together a lot of (optional) exercises, and leaving time for people to work through some of them in teams. I am hoping the result is a really nice resource. There are lectures with the video online, and also the slides and exercises. Students could go through whatever parts they like on their own, or people might find the material useful in preparing their own lectures when teaching graduate-level topics in hashing.
The course materials are all online here. One thing that was a bit different is that it wasn't just lectures -- we really did make more of a "summer school" by putting together a lot of (optional) exercises, and leaving time for people to work through some of them in teams. I am hoping the result is a really nice resource. There are lectures with the video online, and also the slides and exercises. Students could go through whatever parts they like on their own, or people might find the material useful in preparing their own lectures when teaching graduate-level topics in hashing.
Tuesday, August 19, 2014
Reviewing Scales
I'm just about finished reviewing for CoNEXT (Conference on Emerging Networking Experiments and Technologies), and am starting reviewing for ITCS (Innovations in Theoretical Computer Science). One notable variation in the process is the choice of the score scale. For CoNEXT, the program chairs chose a 2-value scale: accept or reject. For ITCS, the program chair chose a 9-point scale. Scoring from 1-9 or 1-10 is not uncommon for theory conferences.
I dislike both approaches, but, in the end, believe that it makes minimal difference, so who am I to complain?
The accept-or-reject choice is a bit too stark. It hides whether you generously thought this paper should possibly get in if there's room, or whether you really are a champion for the paper. A not-too-unusual situation is a paper gets (at least initially) a majority of accept votes -- but nobody really likes the paper, or has confronted its various flaws. (Or, of course, something similar the other way around, although I believe the first case is more common, as it feels better to accept a close call than to reject one.) Fortunately, I think the chairs have been doing an excellent job (at least on the papers I reviewed) encouraging discussion on such papers as needed to get us to the right place. (Apparently, the chairs aren't just looking at the scores, but reading the reviews!) As long as there's actual discussion, I think the problems of the 2-score solution can be mitigated.
The 9 point scale is a bit too diffuse. This is pretty clear. On the description of score semantics we were given, I see:
"1-3 : Strong rejects".
I'm not sure why we need 3 different numbers to represent a strong reject (strong reject, really strong reject, really really strong reject), but there you have it. The boundaries between "weak reject", "a borderline case" and "weak accept" (scores 4-6) also seem vague, and could easily lead to different people using different interpretations. Still, we'll see how it goes. As long as there's good discussion, I think it will all work out here as well.
I prefer the Goldilocks scale of 5 values. I further think "non-linear" scoring is more informative: something like top 5%, top 10%, top 25%, top 50%, bottom 50%, but even scores corresponding to strong accept/weak accept/neutral/weak reject/strong reject seem more useful when trying to make decisions.
Finally, as I have to say whenever I'm reviewing, HotCRP is still the best conference management software (at least for me as a reviewer).
I dislike both approaches, but, in the end, believe that it makes minimal difference, so who am I to complain?
The accept-or-reject choice is a bit too stark. It hides whether you generously thought this paper should possibly get in if there's room, or whether you really are a champion for the paper. A not-too-unusual situation is a paper gets (at least initially) a majority of accept votes -- but nobody really likes the paper, or has confronted its various flaws. (Or, of course, something similar the other way around, although I believe the first case is more common, as it feels better to accept a close call than to reject one.) Fortunately, I think the chairs have been doing an excellent job (at least on the papers I reviewed) encouraging discussion on such papers as needed to get us to the right place. (Apparently, the chairs aren't just looking at the scores, but reading the reviews!) As long as there's actual discussion, I think the problems of the 2-score solution can be mitigated.
The 9 point scale is a bit too diffuse. This is pretty clear. On the description of score semantics we were given, I see:
"1-3 : Strong rejects".
I'm not sure why we need 3 different numbers to represent a strong reject (strong reject, really strong reject, really really strong reject), but there you have it. The boundaries between "weak reject", "a borderline case" and "weak accept" (scores 4-6) also seem vague, and could easily lead to different people using different interpretations. Still, we'll see how it goes. As long as there's good discussion, I think it will all work out here as well.
I prefer the Goldilocks scale of 5 values. I further think "non-linear" scoring is more informative: something like top 5%, top 10%, top 25%, top 50%, bottom 50%, but even scores corresponding to strong accept/weak accept/neutral/weak reject/strong reject seem more useful when trying to make decisions.
Finally, as I have to say whenever I'm reviewing, HotCRP is still the best conference management software (at least for me as a reviewer).
Monday, August 18, 2014
Back to Work
Harvard classes start up in a few weeks, and officially, my sabbatical is over. I'm back in my office, trying to get back into a Harvard routine.
I notice that I've been very light in posting over my sabbatical. After my term as chair, I was enjoying being in the background, hidden away a bit. I'm not sure if I'll get back into blogging -- it seems already to be a technology of the past -- but I figure I'll start again and see what happens.
So some short notes. On my to-do list is to go cover to cover through Ryan O'Donell's new book Analysis of Boolean Functions; Cambridge University Press was nice enough to send me a free copy, which they do with books from time to time. For those who have been following he's been releasing the book in chapters online, and you already know it's good. He's made the book is available online also, but it's nice to have a copy for my bookshelf. It's a beautiful book, both in content and in how it's all put together. My one thought (so far) as I've started my way through is that it stylistically, to me, reads like a "math book" more than a "CS book", whatever that means. That's not meant to be a complaint, just an observation.
Boaz Barak accidentally made me laugh on his Updates from ICM 2014 post, well worth reading, when he writes:
"Candes’s talk was an amazing exposition of the power and importance of algorithms. He showed how efficient algorithms can actually make the difference in treating kids with cancer!....
Hearing Candes’s talk I couldn’t help thinking that some of those advances could perhaps have been made sooner if the TCS community had closer ties to the applied math community, and realized the relevance of concepts such as property testing and tool such as the Geomans-Williamson to these kind of questions. Such missed opportunities are unfortunate for our community and (given the applications) also to society at large, which is another reason you should always try to go to talks in other areas."
I think the larger issue the slow but (over long periods) not really subtle shift of the TCS community away from algorithmic work and practical applications. I'm all for going to talks in other areas, but I think the issue is a larger scale problem.
I'm working on a new class this semester, which if I write I'm sure I'll write more about, but one thing I'd forgotten is how hard and time-consuming it is to construct a lecture. Maybe some of it is a function of me getting slower, but going through all the possible pieces, picking what you think are the right ones, making sure you've got all the details, and then (for me) writing a "script" of what you plan to go over -- it takes time. (Good thing I'll likely be on this new class for a few years.)
Plenty more to talk about -- reviewing, some new/old papers, some summer travel. So we'll see if I can get back into a blogging state of mind.
I notice that I've been very light in posting over my sabbatical. After my term as chair, I was enjoying being in the background, hidden away a bit. I'm not sure if I'll get back into blogging -- it seems already to be a technology of the past -- but I figure I'll start again and see what happens.
So some short notes. On my to-do list is to go cover to cover through Ryan O'Donell's new book Analysis of Boolean Functions; Cambridge University Press was nice enough to send me a free copy, which they do with books from time to time. For those who have been following he's been releasing the book in chapters online, and you already know it's good. He's made the book is available online also, but it's nice to have a copy for my bookshelf. It's a beautiful book, both in content and in how it's all put together. My one thought (so far) as I've started my way through is that it stylistically, to me, reads like a "math book" more than a "CS book", whatever that means. That's not meant to be a complaint, just an observation.
Boaz Barak accidentally made me laugh on his Updates from ICM 2014 post, well worth reading, when he writes:
"Candes’s talk was an amazing exposition of the power and importance of algorithms. He showed how efficient algorithms can actually make the difference in treating kids with cancer!....
Hearing Candes’s talk I couldn’t help thinking that some of those advances could perhaps have been made sooner if the TCS community had closer ties to the applied math community, and realized the relevance of concepts such as property testing and tool such as the Geomans-Williamson to these kind of questions. Such missed opportunities are unfortunate for our community and (given the applications) also to society at large, which is another reason you should always try to go to talks in other areas."
I think the larger issue the slow but (over long periods) not really subtle shift of the TCS community away from algorithmic work and practical applications. I'm all for going to talks in other areas, but I think the issue is a larger scale problem.
I'm working on a new class this semester, which if I write I'm sure I'll write more about, but one thing I'd forgotten is how hard and time-consuming it is to construct a lecture. Maybe some of it is a function of me getting slower, but going through all the possible pieces, picking what you think are the right ones, making sure you've got all the details, and then (for me) writing a "script" of what you plan to go over -- it takes time. (Good thing I'll likely be on this new class for a few years.)
Plenty more to talk about -- reviewing, some new/old papers, some summer travel. So we'll see if I can get back into a blogging state of mind.
Subscribe to:
Posts (Atom)