My Biased Coin

Friday, September 20, 2019

Changing Times

An old friend from college sent me an e-mail, and it got me thinking. When I was an undergraduate at Harvard some significant number of years ago, I took the graduate level algorithms course offered by Michael Rabin and the graduate level complexity course by Les Valiant. There were maybe a half dozen people in each of the classes. (They were great classes, of course. But CS at Harvard back then was really, really small.)

This semester, I'm teaching the graduate level course on randomized algorithms and probabilistic analysis. Right now, the enrollment is 74 students; well more than half are undergraduates. Somehow, that says something to me -- about how the field has grown, and in at least some regards how Harvard has changed. And about how much more prepared students are these days for these kinds of classes. (Knowledge or probability is much more prevalent.) Class sizes have been creeping up for so long that while it's noticeable year-to-year, it's much more stark and remarkable when I think back to my own time in college.

Of course, it's also on my mind because it's a pain teaching a graduate class that large. But it's a pain I can live with -- if I didn't like teaching, I wouldn't have become a professor. And it's gratifying, if not a little bit shocking, that there's this kind of interest in the subject I really love, that I've been excited by for decades.

Saturday, September 07, 2019

Off to ALGO/ESA 2019

I'm shortly hopping on a plane to head to ALGO/ESA. I'll be giving a survey-ish talk on Learning Augmented Algorithms, covering my work so far in the area as well as some of the work by others. I think it's a highly promising direction fitting in the framework of Beyond Worst Case Analysis, so I'm excited to give the talk, and hoping it's still a novel enough area to be new to most of the audience.

For those of you who are there, feel free to say hi -- I'm looking forward to talking to people.

Wednesday, September 04, 2019

Happy New Academic Year: Teaching Randomized Algorithms

It seems I haven't written on this blog for a while.

Today was the start of a new semester. I'll be teaching Randomized Algorithms and Probabilistic Analysis, using the new edition of my book with Eli Upfal as a base, and throwing in other material. (Everyone should buy the book! Here's a link.)

It's a graduate level class, but generally designed for first year graduate students, and there were a lot of undergrads "shopping" it today. (We don't do pre-registration at Harvard, and students get the first week to choose classes, known as shopping.) So many that people were standing out the doors of the room. But because we have a bit of a shortage of classes this semester, I'm guessing there's a good fraction of students just checking it out. We'll see Thursday, but for now I'll predict we'll fit in the classroom, and wait to see if I'm wrong. (If I'm wrong, that's wonderful too.)

It's been four years since I last taught the course, so this time I'm trying something new. When I've previously taught the course, I tried to make the class inviting and friendly by telling the class we'd begin without assuming the class knew probability, and so the first couple of weeks would be reviewing basics (like, say, linearity of expectations and union bounds), albeit in a CS algorithms context. This time, I let the class know I'm assuming they know (or will pick up) basic probability, and so they should read chapters 1-4 on their own, and we'll start with Chapter 5, Balls and Bins models. Over the last decade, I've seen a huge shift in probability knowledge -- Stat 110, Harvard's probability course, has become one of Harvard's biggest classes. Many students have already taking AI or ML or even data science courses where they've done some (further) probability. It feels appropriate (and safe) to assume people entering in the class know probability, or can review what they need on their own, and start the class further along.

Now finally, a request. It's actually hard for me to teach when using this book, because I don't want to just read the book to the students. That's boring. On the other hand, if I thought something was important, I most likely already put it in the book. We have to mix up the standard lecturing format a bit. So two things we'll be doing are

1) doing some "puzzle problems" at the beginning of most classes, so people can try to solve problems. (Kind of a flipped classroom approach, but not a full commitment.)
2) reading papers, related to the class topics.

So if you have any good suggestions of probability puzzle problems, or readable papers (particularly application papers) that use relatively basic probabilistic analysis in neat ways, send them over. I've got a semester to fill.

For curious people, here's one of today's starting problems, which I first learned about in graduate school. (I'm pretty sure I owe thanks to Claire Kenyon for teaching it. I'll link to the corresponding Wikipedia page on the problem maybe later.)

After lunch, Bob suggests the following game to see who pays. Alice and Bob will each choose a different sequence of three flips. (So they could choose "Heads-Tails-Heads'', or "Tails-Tails-Tails'' for example.) After they choose, a fair coin will be tossed until one of their sequences appears as a consecutive subsequence of the coin tosses. The player whose sequence appears first wins. (Note that if they choose the above sequences, and if the flips come up Heads-Tails-Tails-Tails, the player that chose Tails-Tails-Tails would win as soon as their subsequence appears; it's not three flips, then start over again.) Bob politely says that Alice can choose first, and after she chooses and tells him her sequence he'll choose a different sequence. What should Alice choose?

Wednesday, January 09, 2019

ANALCO, SOSA, SODA post

I spent the last few days at SODA-ANALCO-ALENEX-SOSA in San Diego. (Nice location choice, I'd say!) Here's some news.

This will be the last ANALCO (Analytic Algorithms and Combinatorics). Apparently submissions have been decreasing, so they've decided it will halt and the work on these topics will go into SODA and other conferences. I'm not sure how to think of it -- I think we as a community have far too many conferences/workshops generally, but I think the SODA model of having ANALCO and ALENEX (and now SOSA, I imagine) folded in cleanly into the main conference is an excellent model. I also like the ANALCO topics. But I can understand the time may have come to do something else. Thanks to everyone who worked to organize ANALCO and keep it going these many years.

It looks like SOSA (Symposium on Simplicity in Algorithms) will be taking its place in the SODA lineup. I co-chaired the symposium with Jeremy Fineman this year, the second for the symposium. I was surprised by the high quality of the submissions, and was then further surprised by the strong turnout at SODA. The room was quite full for the Tuesday afternoon sessions, and there were easily 75+ people at several of the talks. I do think there's a need for SOSA -- no other workshop/conference hits the theme of simplicity in our area, and it's a really nice fit with the rest of SODA. I'm hoping it will last, and in particular that they'll continue to have a good number of high quality submissions, but that depends on all of you. Ideally, there will be a positive feedback loop here -- now that there's a good home for this type of work (besides notes on the arxiv), people will be more inclined to write up and submit things to SOSA. For Tuesday's talks, I'll call out Josh Alman's great presentation on "An Illuminating Algorithm for the Light Bulb Problem" as my favorite for the day.

With ANALCO exiting, though, I think there's more room for additional satellite events at SODA, so hopefully some people will get creative.

If I had thought about it I should have live-blogged the business meeting. I'd say as highlights, first, Sandy Irani presented the report of the ad hoc committee to combat harassment and discrimination in the theory of computing community. (See here for the report.) There was an overwhelming vote to adopt their recommendations going forward. It's good to see progress in addressing these community concerns. Second, Shuchi Chawla will be the next PC chair, and she brought forward a plan to have SODA PC members be allowed to submit papers (with a higher bar) that was voted on favorably as well.

I suppose the last note is that Jon Kleinberg's invited talk was the conference highlight you expect a Jon Kleinberg talk to be, with interesting results and models related to fairness and implicit bias.

Thanks to SIAM and all the organizers for their hard work.

Wednesday, December 05, 2018

NeurIPS 2018 Post

Today was my first day at a NeurIPS conference. Advertising note, before my thoughts on the experience:
Tomorrow I'll be stationed at a poster 10:45 AM -- 12:45 PM @ Room 517 AB #169
A Model for Learned Bloom Filters and Optimizing by Sandwiching
and at the same time Diana Cai will be stationed at a poster I'm involved with:
Room 210 #26
A Bayesian Nonparametric View on Count-Min Sketch
Please stop by to talk to me if you're around -- Room 517 seems to be a bit off the beaten path and I'm concerned I'll be lonely with nobody to talk to. And of course stop by to see Diana too.

NeurIPS is just huge. It's like an extremely large academic conference (of 1000-2000 people) glued together to an extremely large industry conference (of more than that). Just the academic part (talks and poster sessions, 3 parallel sessions for talks) is a lot, and then there's a trade show with booths from 50+ companies there too.

I think I'd like the academic part more if I was in the area -- I'm coming in from the algorithms perspective, and there's a bit of language gap. It seems to me that conference suffers from some of the standard aspects of large conferences -- with scope and size that big, you have to look and find the things that are interesting and important to you, because a fair bit probably isn't. And while I can't tell entirely myself, I'm told by others that, given the size, there's not a lot of "junk" -- the work seems good-on-average. Also, given the conference size, it seems well organized -- staff managing the people flow, they keep things on time, plenty of room in the poster session (with appropriate food and drink stuff). I can appreciate the work that must go into to making something this big work.

Because the conference is so big, I've run into a good number of "theory people" here. As a percentage of attendees, we're probably small, but it's a good number because it's so big. I kept running into people at the poster sessions, which was nice.

The trade show part is impressive in its own way. If you didn't know AI was big, this would tell you. All the big players are there, but there are at least a dozen machine-learning focused companies I haven't heard of, and that's not including the dozen or more consulting/Wall Street/hedge fund firms that are big enough into AI that they want to have a presence here. I understand there's lots of networking and pre-interviewing and interviewing going on. On the positive side, I feel like if I wanted to leave academia, I could land a job with a variety of AI-based companies, beyond the big companies.

Hopefully tomorrow will be interesting as well.

Thursday, October 25, 2018

Rabin Postdoctoral Fellowship Advertising Post

Michael O. Rabin Postdoctoral Fellowship in Theoretical Computer Science

Deadline for full consideration: December 3, 2018. Applications can be submitted here:

https://academicpositions.harvard.edu/postings/8512

The Harvard John A. Paulson School of Engineering and Applied Sciences at Harvard University seeks applicants for the Michael O. Rabin Postdoctoral Fellowship in Theoretical Computer Science. The standard duration of the Rabin Fellowship is two years. Rabin Fellows will receive a generous salary as well as an annual allocation for research and travel expenses.

Past fellows are Mika Goos and Aviad Rubinstein and the current fellow is Alexander Golovnev.

We are looking for exceptional junior scientists in theoretical computer science, broadly construed. Rabin Fellows will be provided with the opportunity to pursue their research agenda in an intellectually vibrant environment with ample mentorship. While interaction with Harvard faculty, students, and visitors is encouraged, Rabin Fellows are free to pursue their own interests. Candidates are required to have a doctorate or terminal degree in Computer Science or a related area by the expected start date.

Required application documents include a cover letter, research statement, CV (including a list of publications), and names and contact information for three references. We recommend that papers mentioned in the CV be available online on your homepage or on electronic archives. Applicants will apply on-line at the above address. We encourage candidates to apply by December 3, 2018, but applications will be accepted until the position is filled.

Harvard is an equal opportunity employer and all qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability status, protected veteran status, or any other characteristic protected by law.

Monday, September 10, 2018

Lenore Blum interview

I've just read this, and do not have any specific comments, but I thought this interview with Lenore Blum on her recently announced resignation from CMU would be something of interest to the larger community. (Thanks to Margo Seltzer for forwarding it to me.)

Monday, July 23, 2018

Symposium on Simplicity in Algorithms, Submission Server

A reminder that the deadline for SOSA 2019 is August 16. The submission server is live and running (and we already have some submissions!). The easychair link is https://easychair.org/conferences/?conf=sosa2019 , and general information can be found at https://simplicityinalgorithms.com/ .

Friday, June 22, 2018

STOC Lunch Sign-ups, and Reminder

Just a reminder that next week is the 50th STOC in Los Angeles! You can still come and register on site, so please come -- the program is STOC-full of good stuff. Especially all you Californians -- no excuse not to be there.

An innovation introduced at last year's STOC was the junior/senior lunch meet-up. Senior people sign up to meet with junior people over lunch at a day of their choosing, and then (three) junior people (undergrad/grad/postdoc) can sign up to go to lunch with the senior person. The goal is to increase chances for junior people to get to network with more senior people, something that we hard heard was lacking previously. (Everyone buys their own lunch.) You can find the on-line sign-up sheet here. Please go on and sign up as soon as you can!

Tuesday, May 29, 2018

Sympoisum on Simplicity in Algorithms, 2019

Jeremy Fineman and I are co-chairing the 2019 Symposium on Simplicity in Algorithms, the second year of this new endeavor. It's nicely co-located with SODA, and the submission deadline is August 16.

This symposium arose of an apparent need to have a home for results that were focused on simplicity -- specifically, a place where you could introduce a simpler algorithm or a simpler analysis of an existing algorithm for a problem. I think we're still in the stage for testing that need, but it would be nice to make this a permanent addition to our conferences, and like ALENEX and ANALCO, it's a great fit as a co-located SODA event.

Please spread the word, and submit your papers -- we look forward to reading them. More call information is available here.

STOC 2018, Early Reg Deadline

Just a reminder that the early registration for STOC 2018 is June 1. Besides wanting that early registration discount price, you really want to also make your hotel reservation -- the conference hotel special rate is (only) available through June 1. I was looking this morning, and as the web page warns, hotels in LA that week are going to be expensive -- there's an award show in town that will be filling the hotels.

So please register today. (Or tomorrow. But by Thursday for sure.)

Monday, May 07, 2018

SIGACT-Related Stuff

Some SIGACT related-stuff of import:

First, there's an election going on!!! If you're an ACM/SIGACT member, you've probably gotten an e-mail about this, and it's probably gone into some folder that you probably never look at. Please check your e-mail for a subject like "ACM SIG 2018 Election". I believe the deadline is the first week or June or so. Please vote! (Bios and related materials for people running can be found here.)

Second, the Knuth Prize call is out. (It should be on the SIGACT web page shortly, but the main info is cut and pasted below.) July 1 deadline is perhaps the most important part.

Third, keep in mind the early registration/hotel deadline for STOC (which runs June 25-29 in Los Angeles) is coming up, June 1. It will be another "theory fest", so expect some great keynote speakers and invited talks, workshops and tutorials, poster sessions, and other great events. If you went to STOC last year in Montreal, you know the "theory fest" is a different creature than previous STOCs, and that you don't want to miss it. Also, it's the 50th STOC, so there will be some events specifically for that. STOC is the theory conference you really want to go to -- sign up and make sure to get the early registration fee.

Knuth info below:

Nomination Procedure. Anyone in the Theoretical Computer Science com-

munity may nominate a candidate. To do so, please send nominations to

knuth.prize.2018@gmail.com by July 1st, 2018. The nomination should

state the nominee’s name, summarize his or her contributions in one or two

pages, provide a CV for the nominee or a pointer to the nominees webpage, and

give telephone, postal, and email contact information for the nominator. Any

supporting letters from other members of the community (up to a limit of 5)

should be included in the package that the nominator sends to the Commit-

tee chair. Supporting letters should contain substantial information not in the

nomination. Others may endorse the nomination simply by adding their names

to the nomination letter. If you have nominated a candidate in past years, you

can re-nominate the candidate by sending a message to that effect to the above

address. (You may revise the nominating materials if you so desire).

Criteria for Selection. The winner will be selected by a Prize Committee

consisting of six people appointed by the SIGACT and TCMF Chairs, see below

for the composition of the committee. All nominations will be considered by

the Committee, including those submitted in previous years, but nomination is

not a requirement to receive the Prize. Note that the Knuth prize is awarded

to a single individual each year. Nominations of groups of researchers will not

be considered.

In selecting the Knuth-Prize winner, the Committee will pay particular at-

tention to a sustained record of high-impact, seminal contributions to the foun-

dations of computer science. The selection may also be based partly on ed-

ucational accomplishments and contributions such as fundamental textbooks

and high-quality students. The award is not given for service to the theoretical

computer science community, but service might be included in the citation for

a winner if appropriate. The current prize committee consists of Avrim Blum

(TTIC), Allan Borodin (U. Toronto), Alan Frieze (CMU), Shafi Goldwasser (UC

Berkeley), Noam Nisan (Hebrew U.), and Shang-Hua Teng (USC, chair).

Sunday, April 15, 2018

New Papers/Code for MIC and MINE

Several years ago, I worked on a project where the goal was to try to come up with an "equitable" version of a measure of dependence; the idea was you could take a large multi-dimensional data set, score the dependence for each pair of variables, rank the pairs by their score, and then look at the top-scoring paris to try to determine the most interesting relationship to follow up on in further work. We were motivated by the need for data exploration tools for multi-dimensional data sets.

After a large number of years, we've updated the site http://www.exploredata.net/ , with some (finally) recently published papers, and new versions of the code that are faster, more accurate, and can do additional tasks (what we call TIC as well as MIC). Our technical information subpage has links to papers, including the relatively recent papers in JMLR and the Annals of Applied Statistics. Our MINE-Application page contains links to our new version of the code, as well as links to other versions (such as minepy, a library that has APIs in python and Matlab).

The incentive for all this was, in part, one of the co-authors, Yakir Reshef, finishing up his PhD thesis. Congratulations Yakir!

Wednesday, April 11, 2018

Sublinear Algorithms Workshop

I was asked to post to announce the workshop/bootcamp on Sublinear Algorithms, June 10-13 at MIT. I plan to be there and possibly talk about some new work.

From the web page (which you should go to to register, if you plan to attend!):

Synopsis

As big data is getting bigger, there is a need for analyzing data with sublinear constraints -- that is, for algorithms which require only sublinear time, space, measurements and/or samples. The goal of this workshop is to bring together experts in various areas of Computer Science, Electrical Engineering, Statistics and Mathematics to discuss recent work and exciting new challenges. It is hoped that the multidisciplinary nature of the workshop will highlight common goals and themes, as well as to facilitate an interchange of technical ideas that may be of use more widely than previously thought. The workshop will be preceded by a one day bootcamp on June 10, with the goal of presenting the basic techniques, definitions and goals in several of the communities.

Monday, March 26, 2018

An Ad-Hoc Committee on Sexual Harassment and Related Issues

The following is from Yuval Rabani, regarding a joint initiative we are moving forward with to establish policies, procedures, and institutions to deal with harassment and related ethical issues. You may see the post on other theory-related blogs as well.

------------

Recently, many theoreticians have become aware of issues, stories, and rumors concerning sexual harassment within our community, in other CS communities, and more broadly in science.

A number of initiatives, most notably the mushrooming codes of conduct at theory conferences, are already being put into practice.

In consultation among some of the main organizations running theory venues (IEEE TCMF/FOCS, ACM SIGACT/STOC+JACM, EATCS/ICALP, SIAM/SODA+SICOMP) we’ve decided to appoint a joint committee to discuss and propose coordinated policies, procedures, and institutions to deal with harassment and related ethical issues which cut across organizational boundaries. Sandy Irani will chair the committee. Its charter is stated as follows:

"We are setting an ad-hoc committee to draft a proposal for joint ToC measures to combat discrimination, harassment, bullying, and retaliation, and all matters of ethics that might relate to that. Proposed measures may include, but are not restricted to, coordinating policies and guidelines, and setting community-wide institutions for reporting and oversight. The primary goal should be a determination to deter and root out such behavior in the theory community. The issues of false reporting and due process should be taken into account. The committee is expected to conduct the necessary research on existing practices. The committee will submit a report to the appointing organizations by September 30, 2018.”

If you wish an organization be included in the loop, please contact me. If you wish to convey to the committee ideas and thoughts, please contact Sandy or other members as they’ll be announced.

In the meantime, while we are waiting for the committee’s more thoughtful suggestions, here are a couple of simple and potentially effective steps, off the top of my head:

1. If you are harassing someone, please stop.
2. If you are not harassing anyone, please don’t start.

I will gladly contribute to a lively open discussion and react to comments, especially if they occasionally reach my awareness by relaying their existence to my email feed. (Regrettably, I don’t spend all my waking hours monitoring theory blogs.)

Thursday, March 22, 2018

Swedish Summer School

I was asked to post the following notice for the upcoming Swedish Summer School for (theoretical) computer scientists. I gave some lectures for it a couple of summers back, and really enjoyed it. Maybe the students did also.

A warning that Djuronaset is not in the city of Stockholm, but a bit over an hour away from downtown Stockholm by bus. While not in the city, it's a beautiful area for walks, and the facilities are very nice. (In particular, they have wonderful saunas in the hotel that should be used daily.)

----------------------------------

The 5th Swedish Summer School in Computer Science (http://s3cs.eecs.kth.se/) will be held August 5-11, 2018, in the beautiful Stockholm archipelago at Djuronaset (http://djuronaset.com/en/). The school runs for a full week Monday-Friday in early August when Sweden is at its loveliest, with arrival on Sunday evening and departure Saturdaymorning.

We will celebrate our 5th anniversary by going significantly out of our comfort zone and learn about quantum computation. Ronald de Wolf (https://homepages.cwi.nl/~rdewolf/) will give a series of lectures accessible to people who do not know quantum from before. The idea is that this will help all those of us who routinely skip intimidating quantum talks at workshops and conference to overcome our deepest fears and learn enough so that during the next conference we can confidently go to the quantum sessions and actually understand some of what is going on (and maybe even ask a smart question or two).

Another reason for being scared about quantum is that the crypto systems we know and love might no longer be safe. But fear not: Oded Regev (https://cims.nyu.edu/~regev/) will give lectures on lattices and cryptography, explaining, among other things, how to survive in a post-quantum world. Other exciting topics that Oded will touch upon are the Learning with errors (LWE) problem and fully homomorphic encryption.

The summer school is primarily intended for PhD students, but postdocs and bright MSc students are also warmly welcome (and also faculty, subject to availability of slots).

The application deadline is April 20, 2018. Please see http://s3cs.eecs.kth.se/ for more information including instructions for how to apply. Any questions can be directed to s3cs-2018@kth.se.

Tuesday, March 06, 2018

Optimizing Learned Bloom Filters with Sandwiching

For the small-ish subset of people out there who care about "learned Bloom filters" (the subject of my last post), I have a small-ish update. I guess the data structure has been unconsciously sitting around in the back of my head, and when I paged it back into my conscious self, I noticed there was an (in hindsight) obvious improvement.

Recall that the learned Bloom filter uses a (magic) black-box predictor obtained from learning theory that predicts whether an element is in a set or not; to avoid false negatives, it then uses a backup Bloom filter to rescue any set elements that have been incorrectly predicted as not being in the set. This leads to two sources of false positives, from the predictor and the backup Bloom filter.

It turns out it is better to get rid of some of those false positives up front. That is, you get better performance if you have a regular old Bloom filter up front, followed by the learned predictor, followed by a (much smaller) backup Bloom filter to round up stray false negatives. Because you have the predictor sandwiched between two Bloom filters, I refer to this as the "sandwiched learned Bloom filter".

The math is surprisingly easily -- and at the same time, really beautiful in terms of the result. It turns out that, if you think of distributing "Bloom filter bits" out to the up-front Bloom filter and the backup Bloom filter, the right thing to do is first put bits on the backup Bloom filter, up to a certain fixed amount, which depends on the the parameters of the predictor (false positive rate, false negative rate). After that, all the bits should go to the up-front Bloom filter.

Like most Bloom filter optimizations, the gains are worthwhile but do not seem to be another-factor-of-two; it looks like it this can cut the size another 10-25% or so, depending on the settings, and it essentially free.

The whole writeup is less than 2 pages, so rather than write and explain it all here, if you're interested, check the arxiv draft. Hopefully, more fun in this setting will be forthcoming....

Thursday, January 25, 2018

Some Notes on "Learned Bloom Filters"

About a month ago, a draft paper was put out on arxiv called The Case for Learned Indexed Structures by the people at Google Brain. The paper has received some significant attention, perhaps because of its pedigree and perhaps because it makes some rather bold claims. For example, a line from the abstract states "In this exploratory research paper, we start from this premise and posit that all existing index structures can be replaced with other types of models, including deep-learning models, which we term learned indexes."

There has already been some pushback on some of the claims from the draft paper (see here for a response to the paper's discussion of B-trees, and here for a discussion regarding that cuckoo hash tables seem to do just as well as "learned" hash tables). I also read the paper and had some back and forth with the authors about some thoughts and concerns I had; they have been both responsive and thoughtful, and I think the conversation was fruitful on both sides. (At least, they helped me understand their work, and I hope my suggestions for them were useful.) In what follows, I'll explain some of specific thoughts and concerns regarding their proposed learned Bloom filters. I'll try to keep the blog post self-contained, but you may want to read their paper. Also, I've tried to write my ideas down more formally in a short write-up. Here's a link to a draft; this link may age out or disappear if/when I put this on arxiv.

What follows is my description (so any issues are mine, not the authors of the Learned Index paper).

Recall that a Bloom filter approximately represents a set S in small space. One can ask queries of the form: is y an element of S. If y is an element of S, the answer will always be yes (no false negatives), if not there is some chance of false positive. That is, it trades off some amount of error to save space. A key feature is that the probability that any given input y yields a false positive does not depend on y. (Well, it does after the Bloom filter has been instantiated -- some elements give false positives, some don't. But a priori, before you build the Bloom filter using hash functions, if you don't know the hash functions, you don't know what the false positives will be.)

For the learned Bloom filter, suppose we have a very good (black-box) predictor P, which takes an input x and returns P(x), which should be interpreted as an estimate of the probability that x is an element of S. That is, we have a function that can give a very good "guess" if an input x is an element of S. Then we can use P as a "pre-filter" to improve our Bloom filter as follows. When asked if x is an element of S, compute P(x), and simply return that x is in S if P(x) is larger than some suitable threshold z. This may introduce false positives, but it may also introduce false negatives; there may be elements x of S for which P(x) < z. To deal with those, we use a backup filter, which is just a standard Bloom filter that holds the false negatives from the pre-filter using the predictor P. So now we accept that x is an element of S if P(x) is above the threshold, or x obtains a positive answer from the backup filter. The backup filter also introduces false positives.

The authors of the learned index paper suggest that machine learning allows for good predictors P that require small space, thereby improving the Bloom filter in terms of space while achieving the same false positive probability. Note that the predictor does not necessarily have to be that great; even if half the elements of S are false negatives, the backup filter would then be a little bigger than half the size of a standard Bloom filter, so if the predictor P is suitably small, it is plausible that one could get the same performance with smaller space.

I don't know if this idea is new. I don't think I've seen a probabilistic pre-filter for a Bloom filter before; usually a Bloom filter is a pre-filter to some other part of the system. It seems like an interesting idea; certainly, the potential to layer filters is useful, and putting a filter before a Bloom filter is sensible.

A major issue, though, is that the nature of the guarantees one obtains are quite different for learned Bloom filters and standard Bloom filters, which is a subtlety that I worry may not be clear to all readers of the original paper. In particular, it seems in part that confusion might arise because the term "false positive rate" appears to have slightly different meanings in the Bloom filter community and the machine learning community; in the Bloom filter setting, false positive rate generally refers to that false positive probability, which is independent of the query. In the machine learning setting, the false positive rate is an empirical quantity, which depends on the input data and the queries.

An example may be helpful; here's one I used from the writeup. (This is a "thought-experiment" example, not based on actual experiments.) Suppose we are working with numbers in the range [0,1000000), and our learned Bloom filter is supposed to hold a set S that contains a 1000 elements. Half of those elements are random numbers in the range [1000,2000], and half those elements are random over the rest of the range. A good machine learning function might learn a predictor P that says that if x is in the range [1000,2000], then P(x) is "large" -- say around 1/2 -- and otherwise P(x) is small. It could use a threshold z of around 0.4, in which case about half the elements of S are false negatives, and must be stored in the backup filter. This seems like a reasonable outcome for learning.

What is the false positive rate now? Well, it depends on the queries. If all the queries are uniform over the range [1000,2000], the false positive rate would be quite high -- every query to a non-set element would be a false positive. If the queries are uniform over the range [0,1000000), the false positive rate is quite low, since one will rarely get unlucky and ask a query in the range [1000,2000], and otherwise the only false positives will come from the backup Bloom filter. The point is that the false positive probability depends on the distribution of the queries (as well as the predictor). But there may be many applications where the queries are naturally more often for elements that are "similar to" set elements -- that is, they have high P values, like the values in [1000,2000] that are not elements of S -- which may lead to high false positive rates. And if you don't know the query stream in advance, you don't even know what the false positive rate will be.

The response to this seems to be that if you have data about the queries, you can figure out the false positive rate empirically. That is, if you have a test set of queries that you assume is a sample from a distribution over all possible queries, you can determine the false positive rate of that test set, and use it to predict the false positive rate of future queries. My writeup provides basic definitions to formalize this idea, which seems correct. However, this still assumes that "future queries" will come from the same distribution as the test set. Returning to our example, if in our test set queries are uniform over [0,1000000), but then in the future queries change so that they are uniform over a smaller interval [0,100000), that significantly changes the false positive rate of the filter.

My takeaways were the following:

The idea of adding learning into data structures is certainly an interesting idea.
But it's still important to try to set up a theoretical framework, to understand what kinds of guarantees can be obtained.
At the moment, I can see applications where learned Bloom filters might be helpful -- if your data set has significant structure that can be useful to a machine learning function, if you don't have to deal with insertions and deletions frequently, if you have access to query data, and if you have reason to believe the query stream isn't changing (in a distributional sense).
At the same time, I can see applications where I would be very concerned about the use of learned Bloom filters, such as in a security setting. Also, modern Bloom filter alternatives, such as cuckoo filters (which use less space, and allow insertions/deletions more easily) should be considered and compared when thinking about using learned Bloom filters.
There's probably more to think about here. In particular, I wonder if there are other places where additional "filter layering" may be useful.

Tuesday, January 09, 2018

Double-blind, ALENEX

I wanted to point people to a pair of blog posts (Part 1, Part 2) by Suresh Venkatasubramanian at the geomblog discussing the experience of having the ALENEX conference go double-blind for reviewing this year, from the perspective of him and his co-chair Rasmus Pagh. The high order bits are that they thought it worked fine despite the additional work, and they recommend doing it again in future years.

I am in favor of double-blind reviewing, which is standard for many other conferences in many other areas, but is somehow considered impossible for CS theory. (Suresh does an excellent job addressing why that has been historically, and why the arguments against double-blind reviewing do not really hold up.) The theory community hasn't historically taken "conflicts of interest" issues seriously, as I've written about before (I guess a rather long time ago!). Double-blind reviewing helps with CoI issues, but as Suresh discusses, also deals with the huge implicit bias issues that I think are also a problem in the theory community.

(Except that, really, calling it implicit bias is something of a misnomer -- in many cases it's really explicit bias. As Suresh notes, in providing responses to common arguments:

But how am I supposed to know if the proof is correct if I don't know who the authors are.

Most theory conferences are now comfortable with asking for full proofs. And if the authors don't provide full proofs, and I need to know the authors to determine if the result is believable, isn't that the very definition of bias?

This is a real argument commonly presented at PC meetings. I've had people say, "I haven't verified the proofs, but XXX is one of the authors, so I'm sure it's all right.")

I encourage people to go read Suresh's discussion, and I would like to thank Suresh and Rasmus for taking it upon themselves to perform this experiment.

(If you're interested, you can also read the paper by Tomkins et al discussing double-blind reviewing experiences recently at WSDM as well.)

Thursday, November 16, 2017

Harvard CS Concentrators Jump Again

During my time as Area Dean for computer science, the number of computer science majors at Harvard more than doubled. Growth has continued, and according to the Crimson, we're now clearly the second largest concentration at Harvard. (It looks like we just barely surpassed Government last year, and now we're clearly larger. Economics is still bigger.)

When you consider that Applied Math is the 4th largest major group, and that's a group that is largely composed of people who do some mix of Math/Computer Science/Economics, I think the numbers actually underestimate the growth in CS at Harvard.

It's been an exciting time since I started at Harvard in 1999. One could see what was coming. Though, sadly, the administration still seems far from having caught up -- we (like many other CS departments) really could use some more resources coming our way. Universities are a bit slow to respond (sometimes for good reasons), and the speed of these changes is, in context, just remarkable. And, I'd have to say, have added to the fun of being a CS professor. Back in 1999, CS was incorrectly perceived to be a backwater at Harvard, small and unnoticed, but where there was a lot of excitement and energy. Fast forward less than two decades, and now we're clearly at the core of Harvard and its future.