## Monday, May 07, 2012

### Attribute-Efficient Learning (Guest Post by Justin Thaler)

[Editor's comment:  Justin has been having quite a year;  after a paper in ITCS, he's recently had papers accepted to ISIT, ICALP, HotCloud, and COLT.  In this guest post, he tells us about the COLT result.]

Rocco Servedio, Li-Yang Tan, and I recently had a paper accepted to COLT on attribute-efficient learning, which is a clean framework introduced in 1990 by Avrim Blum that captures the challenging and important problem of learning in the presence of irrelevant information. Think of a infant trying to learn whether or not an object is safe to touch. The infant's brain may be inundated with sensory information, but there may only be a few characteristics of the object that matter (is it sharp? Is it sitting on a heat source?). Or think of a scientist trying to identify the genetic causes of a certain disease that depends on the (possibly complicated) interactions of a small number of genes: the scientist may collect a massive amount of genetic data from study participants, yet only a small amount of this information is actually relevant to the function being learned (i.e., the mapping of genes to a subject's phenotype).

Thus, the goal of an algorithm for attribute-efficient learning is to run in time polynomial in the total number of attributes, n, using a number of examples which is polynomial in the description length of the function f to be learned. The latter can be substantially smaller than n if most of the attributes are irrelevant.

In its most general form, the problem of learning in the presence of irrelevant information is captured by the "junta problem". In this setting, you make no assumption other than that the function being learned depends on only k << n of its n attributes, and your goal is to learn the function in time substantially smaller than the trivial O(n^k) obtained by exhaustively checking all k-element subsets of the attributes. The uniform-distribution variant of this problem has been called the "most important open question in uniform distribution learning" in the groundbreaking paper of Mossel, O'Donnell, and Servedio (see also Lipton's blog post on the subject here).  Greg Valiant has recently made progress on this problem here.

Our goal is simultaneously more and less ambitious than MOS. On the one hand, we want to attribute-efficiently learn under arbitrary distributions, not just the uniform distribution. On the other hand, we are willing to assume not just that our functions depend on few attributes, but that the relevant attributes interact in structured ways. Specifically, we focus on attribute-efficient learning of decision lists, a problem first posed by Blum in his 1990 paper. This natural function class is (one of the few) known to be efficiently PAC-learnable, but it seems to lie on the boundary of tractability in the more challenging attribute-efficient setting. The problem has already been studied by a number of authors, see e.g. here, here, here, and here.

What did we manage to show? We give both positive and negative results. On the positive side, we build on a construction of Klivans and Servedio to give an algorithm achieving new tradeoffs between the running time and sample complexity for learning length-k decision lists over n Boolean attributes. Like most known positive results on PAC-learning under arbitrary distributions, we achieve this by constructing "uncomplicated" polynomial threshold functions (PTFs) computing any decision list. (For the uninitiated, a polynomial threshold function is exactly what it sounds like: it is a function of the form sign(p(x)), where p is an n-variate polynomial, and the complicatedness of a PTF refers to the degree of p, and the size of its coefficients). For technical reasons we actually construct something slightly more expressive than a PTF, which we call a generalized PTF, and which is just as useful for learning purposes.

We have two results on the negative side. First, we give new lower bounds on the complexity of PTFs that compute a specific and important decision list called ODD-MAX-BIT, building upon an elegant argument due to Beigel. Our lower bound strongly suggests that brand new techniques will be needed to improve on our positive result. (Parenthetically, the ODD-MAX-BIT function has played important roles in complexity theory in the past, from oracle separations to circuit lower bounds -- you can learn about many of these prior results just by typing ODD-MAX-BIT into Google).  In brief, Beigel's lower bound is tight for relatively low-degree PTFs computing ODD-MAX-BIT, but is loose or vacuous for higher-degree PTFs. Our new lower bound significantly improves upon Beigel's at higher degrees and comes close to matching our upper bound (the main gap is that our upper bound uses generalized PTFs, while our lower bound applies only to honest-to-goodness PTFs).

Our final result is to extend Beigel's lower bound, which applied only to thresholds-of-monotone-conjunctions, to thresholds of significantly more general classes of functions.

In closing, I would like to say a little about about the ideas underlying our constructions, as there is a common intuition underlying both our positive and negative results. The positive result works by breaking the decision list into "chunks", closely approximating each chunk (in the $L_{\infty}$ norm) with a low-degree polynomial, and then putting the approximations together to get a PTF for the entire decision list. The negative result shows that this approach is intrinsic: it also breaks the decision list into chunks, and shows that you can take any PTF p for ODD-MAX-BIT and turn it into a low-degree polynomial q closely approximating each chunk. We then use a new Markov-type inequality to conclude that q has to be "complicated", and hence p has to be complicated as well.

Our Markov-type inequality seems quite natural, and similar statements have apparently been known to approximation theorists for some time now. I wanted to state the result here in the hopes that it will find some other applications in TCS.

Markov's classical inequality is a statement about the derivative of a univariate polynomial in terms of its degree. In contrast, we give a bound on the derivative of a univariate polynomial in terms of both its degree and the *size of its coefficients*. For comparison:

Markov's Theorem: Let p : [-1,1] --> [-1,1] be a real polynomial with deg(p) <= d. Then max  |p'(x)| <= d^2, where the maximum is taken over all x in the real interval [-1, 1].

Our Markov-type Inequality (what I state here is a bit of a simplification of what we actually show): Let p : [-1,1] -> [-1,1] be a real polynomial with deg(p) <= d and coefficients of absolute value at most W. If max |p(x)| >= 1/2, then max |p'(x)| = O(d*log(W)), where again the maximums are taken over all x in [-1, 1].

Notice our bound is tighter than Markov's when W is smaller than 2^d.

The way I like to think about our Markov-type inequality is as follows. Since the Chebyshev polynomials are a tight example for Markov's inequality, one can interpret Markov as saying that, in order to maximize the derivative of a polynomial P of degree d which stays bounded on the interval [-1, 1], the best approach is to "spend all the allowed degree on a Chebyshev polynomial." There is also a simple example showing that our new inequality is asymptotically tight for a wide range of W's -- the tight example involves composing a Chebyshev polynomial with a monomial.  So our new Markov-type inequality is saying that in order to maximize the derivative of a polynomial of degree d and weight W which stays bounded on the interval [-1,1], the best approach is to "spend all of the allowed weight on a Chebyshev polynomial T, and then spend any remaining allowed degree by composing T with a low-weight, high-degree polynomial (namely, a monomial)."  It actually turns out that this same intuition underlies our positive result, which allowed us to improve over the Klivans-Servedio construction when the degree of the (generalized) PTF is relatively high.