My Biased Coin: open problems

Showing posts with label open problems. Show all posts

Wednesday, November 28, 2007

NSF FIND Working Meeting

I'm writing this while ~~stuck at~~ attending an NSF FIND (Future Internet Design) working meeting (Nov 27-28). Part of the FIND program is that there are 3 meetings per year where all the FIND PIs are supposed to get together, to talk about the new network that is the goal of the FIND program. This is my first meeting; I missed the last one. It seems to me that this idea of having PI meetings related to grant proposals is an NSF experiment worth examining, and there are some discussions going on here that may be of broader interest, so I thought I'd report. (If this reads as stream of consciousness, I'm generally typing as things go on.)

One positive for me is the chance to catch up with various networking people who I don't see so regularly. The socializing aspect -- just talking to people, finding out what they're working on, making sure they know Harvard is hiring, all this is good. (Jennifer Rexford and I chatted during a break, and when I sat down after the break, she had already sent me a paper to look at. Which I need to go ask her some questions about... I also finally caught up on some overdue researchy discussion with Dev Shah of MIT; yes, I have to go to DC to meet up with someone from MIT...) In this sense, it's like going to a networking conference, and obviously interacting with peers is one of the pleasures of a conference.

The invited presentations aren't doing much for me, however. They're not highly technical talks -- more high-level talks, connecting network research to network users. (We're hearing two talks right now about issues in the practice of emergency networks in disasters.) It's not that the talks are bad, but it's a bit hard to see their purpose. This is a room full of networking people -- they all have well-developed individual research agendas and know what they think the problems are. I don't think the talks are adding much new. (Let's just say I'm not the only one tapping away at the keyboard when I should be listening to the talk. So again, it's just like a conference.)

Besides invited presentations, there are break-out sessions. I'm sitting in on the "network management" breakout session, which far and away seems the largest. About 25-30 people or so in here. It's actually kind of like a conference panel, only a bit more chaotic with so many people. While the discussion is high-level, and I'm not sure we're headed anyplace specific, it's actually a pretty interesting discussion that highlights the scope of the problems of network management. (So many requirements, so many heterogeneous user classes, so many ill-defined goals, so many ill-defined or at least hard to measure quantities, etc.) Interestingly, the group as a whole seems quite convinced of the importance of what I call theory -- modeling, control, algorithms. (The pro-theory comments seemed to begin from Jennifer Rexford and Anja Feldmann, two incredibly pro-theory and powerful network researchers.) This gives me a good excuse to chime in and further emphasize the importance of theory to network management, which I manage to do repeatedly.

Ellen Zegura is giving a talk about GENI and the GENI Science Council. The GENI Science Council is supposed to come up with a "research plan" for GENI, an experimental facility/infrastructure for the development of the next generation Internet. (Network people rightfully complain it's hard to do experimental research on the current Internet at a large scale, so the plan is to design a research infrastructure for these sorts of problems.) There's a notable shortage of theorists on the council, but they seem to have noticed this and added Joan Feigenbaum and Michael Kearns recently. (From my perspective, if you're going to design a whole new network infrastructure, it seems to me you'd want some theorists to help make sure it's done right.) They're still putting together the research agenda that goes along with the planned facility, and they're still looking for input. Essentially, it seems like they're looking for research examples and research experiments that would utilize or benefit from a GENI-scale system, so that they can explain clearly why building a GENI-scale system is a good idea. So if you have ideas for experiments that you'd like to do on a network-wide scale, you might start looking at what GENI's up to. Overall, GENI seems like such a huge project, and it's still a bit amorphous at this point, so that people are a bit confused. Or maybe that's just me. (I did have to get up at 4 am to catch the early plane down here.)

I talked with Ellen later, and got clarification of the GENI state. I had thought this was well on its way to becoming a possibly multi-hundred-million dollar new NSF project, but apparently it's much less further along than I had thought. If GENI is going to happen, it apparently needs to regain some momentum. And GENI seems a key component of the FIND program; it's hard to develop next-generation network architectures if you don't have a network to try them out on.

Before dinner they have a poster session. It's well done for a poster session -- they have food and a cash bar, and food and drink always seem to be requirements for a poster session where people actually stay and look around. There's a nice mix of work; my favorite is an idea multiple groups seem to be working on that there shouldn't be just one network architecture, but multiple network architectures running over the same network substrate. My interpretation is that you have routers/other infrastructure that can run multiple architectures efficiently in parallel, so running side by side you might have our standard Internet with another network with much stronger (or weaker!) quality of service guarantees, or another network where connections are based on virtual circuits, and so on. This makes sense if you believe that a one-size-fits-all Internet is no longer the best idea.

Day 2 is mostly having the breakout session coalesce their ideas into short presentations. Pablo Rodriguez, of fame for example for his work on Avalanche/network coding at Microsoft, gave a nice talk about the economics of P2P and ISPs; he's left Microsoft to work for Telefonica Barcelona, and has developed some insight from the ISP point of view.

Overall, I ended up with mixed feelings about this working meeting. It's useful to get people together to find out about what everyone is doing and to talk about high-level issues. Conferences already serve some of this purpose, though. Some of the value I got from this meeting derives from the fact I haven't been to a networking conference for a while (I didn't feel like traveling to Japan for SIGCOMM this year...). High-level issues don't necessarily get discussed at conferences, but it's also hard to get 50+ people together to talk about high-level issues and get any sort of agreement that wasn't already obvious. I'm skeptical that 3 such meetings are needed each year. However, the plan seems to be for the next meeting to be a graduate student focused meeting, which may be a good idea -- getting graduate students together to meet and talk like this seems like a potentially very interesting idea.

Going to the meeting will pay off, however, if it leads to any new collaborations. If any FIND people are reading this, and you want to add some theory to your work, send me an e-mail. (Of course, you could also go down the hall and talk to the theorist(s) in your locale, which I'd really strongly recommend, unless I'd be interested, in which case try me first!)

Tuesday, October 09, 2007

The simplest insertion/deletion channel

The simplest binary insertion/deletion channel that I can think of is the following: with probability p, each bit independently results in two copies of itself. This is a special case of a subclass of channels that I have dubbed sticky channels, which are like sticky keyboards: each symbol can result in a random number of copies of that symbol.

Sticky channels have the nice property that contiguous blocks of (resp 1s) at the input correspond to contiguous blocks of 0s (resp 1s) at the output. This property makes sticky channels easier than more general insertion/deletion channels.

I've just had a paper on sticky channels accepted to Transactions on Information Theory; here's a link to a preprint. The main result is that for that simplest channel above, I can numerically obtain very tight bounds on the channel capacity. But of course I'd still like to know -- is there a simple formula that gives the capacity as a function of p? And is there are simple and efficient coding scheme that nearly reaches the capacity?

Thursday, September 06, 2007

Negative Dependence

I lost sleep last night trying to prove a collection of random variables were negatively associated.

Negative dependence is one of those nice tricks that hardly ever gets used because it usually ends up being harder than it should be. While there are many flavors of negative dependence, the most natural is probably "negative assocation". The intuition is simple -- given a collection of random variables, if when one goes up that means the others should go down, then they are negatively associated. More formally, given any monotone non-decreasing function f of a subset of the variables, and a monotone non-decreasing function g of another disjoint subset of the variables, if f goes up, g should go down. Proving this holds formally is often more difficult than one would expect.

Why should we care? Well, it turns out that if a collection of random variables are negatively associated, then even though they are dependent, you can just apply your standard Chernoff bounds to them, without a care. Chernoff bounds (and other tail probability bounds) pop up in many arguments, but they can be very difficult to deal with when the random variables are dependent. Usually, you have to switch to a martingale argument, since standard Chernoff bounds apply only to independent random variables. If you can get negative dependence, it's much cleaner.

The best introduction to negative dependence is probably this surveyish article by Dubhashi and Ranjan. The focus is on how balls and bins problems are a natural example of negative dependence -- if a ball lands in one bin, it can't be in any other! Naturally, this explains my interest.

For example, the following problem came up for me some years ago in this paper. Suppose we thrown n points uniformly at random on the boundary of the unit circle. We want bounds on the number of arcs of length larger than say c/n for some constant c. If arc lengths were independent, we could just apply a Chernoff bound, easy enough. But of course they're dependent -- the sum of the arc lengths is 1! Intuitively, though, if one arc gets longer, then the others must get shorter, so there should be negative dependence. We proved what we needed to get the Chernoff bound, but it wasn't pretty. (I've since seen that a stronger version of this result is given as an exercise on negative association in the very nice-looking draft monograph by Dubhashi and Panconesi; to tell the truth, I'd like to see it worked out, as it seems to me that the conditioning is a bit subtle, but then again, geometry confuses me.)

In fact, in that paper we actually wanted a similar result in the following setting. Suppose one throws n points uniformly at random into a fixed region (say, for example, the unit square, or to avoid boundary issues, the 2-d unit torus). We want bounds on the number of Voronoi cells of size larger than say c/n for some constant c. If Voronoi cell sizes were independent, we could just apply a Chernoff bound, easy enough. But of course they're dependent! Intuitively, though, if one cell gets bigger, then the others must get smaller, so there should be negative dependence. Or maybe that isn't the case. Now I can't remember if I ever found an example that led me to believe it wasn't the case... Anyhow, we couldn't prove negative dependence easily, so we ended up using an uglier martingale argument that sufficed.

I was surprised at the time that I couldn't find any reference to type of this problem in the geometric literature. If negative dependence of Voronoi regions in random settings is still open (and true!), I'd be happy to ponder it with anyone who has a better sense of geometry than I do. In general, the area of negative dependence seems like a promising area for additional results.

Friday, August 31, 2007

New Results in Trace Reconstruction

I've already talked a lot in this blog about deletion channels. Trace reconstruction involves a similar problem. We start with an original binary string X = X1,X2,...,Xn. A trace consists of a string Y1,Y2,...,Ym obtained from the original string by passing it through a deletion channel, where each bit is independently deleted with probability p. The trace reconstruction problem basically asks how many independent traces do you need to see to reconstruct the original string X with high probability. Unlike the coding setting, where X might be chosen from a codebook of our own design, in this setting two natural models to study are when X is uniform over binary strings (so the high probability is over the choice of X and the traces), and in the worst case (where the high probability is just over the traces). Variations of the problem include operations other than deletions (including, say, insertions and errors). As an example application, a set of sensors might be monitoring a sequence of events. Each individual sensor is weak and might miss a given event, in which case the question is how many sensors are needed to reconstruct the event sequence perfectly, with high probability.

Trace reconstruction has some history in the information theory community, and the first CS-style paper I saw on it was by Batu, Kannan, Khanna, and McGregor in SODA 2004. The main result of this paper dealt with random input X and considered p values that were O(1/log n). It seems to me much more natural for p to be constant, and it has remained an open problem to determine an efficient algorithm for constant p.

I mentioned this problem last time I visited Microsoft, and it seemed to resonate with some of the people there. Thomas Holenstein, Rina Panigrahy, Udi Wieder and I have a submission with several results, including an algorithm that for random X and sufficiently small constant probability p requires only a polynomial number of traces and polynomial time (with high probability).

The SODA 2004 uses a majority voting technique -- the bits are determined sequentially, with each string voting on the next bit. A key idea in our new algorithm is a "smart voting" technique. We only let traces vote if there is good reason (based on the already determined bits) to think that the trace has a good prediction for the subsequent bit. That is, only well-informed strings are allowed to vote. Feel free to make your own political analogies. My intuition is that this smart voting technique is a closer analogue to the full belief propagation (or Bayesian analysis) that we want to do than just majority voting. Because of this, I hope this "smart voting" technique is a general approach that will find other applications.

I don't yet have an analysis of a belief-propagation-based algorithm. Also, currently we can't analyze a maximum-likelihood algorithm, that finds the most likely original string X. I also don't know how to implement maximum-likelihood efficiently in this setting. So there's still plenty of open questions in this area.

Wednesday, August 22, 2007

Another Deletion Code Open Problem

In a binary symmetric error channel, n bits are sent, and the channel flips each bit independently with probability p. So, for example, the message sent might be 00110011 and the received message could be 01100011 if the 2nd and 4th bits were flipped. Now suppose the same message is sent through k independent channels, and the receiver sees all of results. (Here k should be thought of as a small constant.) The capacity of this channel can be computed; essentially, in this channel, each bit gets maps to a number in the range [0,k], corresponding to the number of 1's in the appropriate position. (Since all errors are independent, exactly which channels flip a specific bit doesn't matter, just the number of flips matter.) As a specific example, when k = 2, we can think in the following nice way -- if we see two 1's (resp 0's) in bit position i, we think the original bit was 1 (resp 0), and now we have an error with probability p^2. With probability 2p(1-p), we see a 1 and a 0 in the ith position -- this corresponds to an "erasure", since the bit is now equally likely to be a 1 and a 0. So we have a channel that gives errors with probability p^2 and erasures with probability 2p(1-p); we can find the capacity (and codes for) such a channel.

In a binary deletion channel, n bits are sent, and the channel deletes each bit independently with probability p. So, for example, the message sent might be 00110011 and the received message could be 010011 if the 2nd and 4th bits were deleted. Now suppose the same message is sent through k independent binary deletion channels, and the receiver sees all of results. Can we say anything useful here? The problem is automatically more challenging since we only have bounds and don't even know the capacity of the standard deletion channel (when k is 1). This is yet another simply stated question from the theory of coding for deletion channels in need of an idea.

Monday, August 06, 2007

Network Coding, Open Problems

I'm spending a couple of days at the DIMACS Tutorial on Algorithms for Next Generation Networks. Tracey Ho and Alex Sprintson and giving a talk on Network Coding, and since in the past I've promised a post on open questions in network coding, I'm interviewing them. I apologize that the questions might be vague; the problem with a new area is that it's still a little unclear what the right open problems are, and we all use different lingo. But here are a few...

For 3 source-receiver pairs and unicast, in an undirected network, with arbitrary coding, is there an advantage from coding over routing?(I think the reference paper is Li and Li, Network Coding: The Case of Multiple Unicast Sessions)?
General multiple unicast in directed networks: is there an algorithm that, given a network, computes the capacity? For acyclic networks, there's an implicit characterization, but not an explicit characterization; for cyclic networks, there's not even that.
What is the complexity of finding the optimal (in terms of capacity) non-linear code for various network coding problems? Specific problems include determining the truth/falsehood of the following statements:

Given an instance of the general network coding
problem in a directed graph G and a real number
r, there is a polynomial-time algorithm which
computes a solution achieving rate r if one exists,
and otherwise reports that this is impossible.
Given an instance of the k-pairs communication
problem in an undirected graph G and a real number
r, it is recursively undecidable to determine
whether the network coding rate is less than r.

For multicast networks, what is the minimum number of nodes that need to do encoding to make network encoding work? That is, can we minimize the coding complexity in terms of the number of nodes doing coding (or some other reasonable metric).
There must be open questions in the recent work by Koetter/Kschischang on coding for errors/erasures in random network coding. For example, they seem to give general (network-oblivious) bounds. Are there ways to improve their bounds by using knowledge of the network topology?
Is there an algorithm to find the minimum cost subnetwork that guarantees delivery using network coding subject to certain classes of adversarial (or non-adversarial) link failures.
A good place to look for CS-style open problems is probably also the Adler/Harvey/Jain/Kleinberg/Lehman paper On the Capacity of Information Networks, which has a nice section on open problems at the end.

Maybe I'll have more after the talk...

Monday, July 16, 2007

A Favorite Open Problem : Codes for a Poisson-repeat Channel

A (binary) Poisson-repeat channel works as follows: the sender sends a sequence of n bits, and the channel independently replaces each bit with a number of copies (or repeats) that has a discrete Poisson distribution with mean 1. That is, for each bit, the probability that it is replaced by k copies is e^{-1}/k!. Here k can be 0, in which case the bit is deleted. The receiver gets the resulting string after replacements. I'd like to find an efficient code for this channel that has a non-trivial constant rate. Any rate over 0.01 , for example, would be just fine. Of course, I'd like the bound on the rate to be provable, rather than just experimental, and I really would like the code to be practical, not just polynomial-time encodable/decodable.

What's the motivation? It turns out that the Poisson-repeat channel is closely tied to the deletion channel, where each bit is independently deleted with probability p. A code for the Poisson-repeat channel would immediately yield a good code for the deletion channel for values of p is close to 1; we showed the reduction in this paper.

Codes for insertion/deletion channels are hard; very little is known. Because a code for this specific channel would yield codes for a larger family of channels, I think it's an appropriate and intriguing target.

My Biased Coin