Sunday, June 23, 2013

How Should We Choose Students?

Some of my previous posts have led me to think about the following -- something I'm hoping to write a longer piece about in the near future.

In the past few weeks, at Harvard (and elsewhere) there have been reports about the "decline of the humanities".  (Whether these reports have any significant bearing in reality is not necessarily important to this post.)  But machine learning keeps getting better and better.  While we may never be able to predict the exact outcome for an individual student, statistically speaking, as the universities gather more data, they will get better at predicting, for example, what a student will major in.  Potentially, with the right sort of tracking, universities may be able to predict reasonably well what jobs students may go into -- heck, they may get a statistically meaningful prediction of their future net worth.*  In particular, if we wanted to choose students according to what they were going to major in, in order to keep the humanities supporters happy, we could;  while we can already kind of do that now (based on, for example, what student say they want to major in), we'll just keep getting better at it.

This will lead to all sorts of questions.  Or, perhaps better said, will make the questions that already to some extent exist more pronounced.  First, getting to to the humanities concern, how should we choose our students?  Should we have quotas by future major?  We could assign departments a permanent percentage (well, an "expected percentage") of the incoming graduates and accept students accordingly?  From some faculty members' and administrators' point of view, perhaps this makes sense;  we can guarantee a department size, and a suitable faculty/student ratio per department.  To me, it seems potentially disastrous, turning the university into a static entity, which perhaps would not in any sense limit any individual student in terms of what they want to study, but would create a less flexible global atmosphere.  Again, in some sense, this question exists today;  at least some people have responded to the "humanities crisis" by saying that how students are accepted should be changed (to give preference to humanities-interested students), but the question becomes an even more significant challenge once you assume you actually have very strong prediction methods that can allow you to select students in this way more accurately than has been the historical norm.   

Of course, going beyond the picayune issue of whether we should choose students according to what they might major in, there's the larger scale question of how we should choose students.  Indeed, this question lies at the heart of many an affirmative action lawsuit, with the "reverse affirmative action" side claiming that people of what I will call "white" descent are not admitted in favor of less qualified "non-white" students.  (The issue is obviously more complicated than this paragraph can do justice to;  for example, the issue of Asian American discrimination arises.)  In such discussions, one generally hears the term "merit" -- if only schools just took the top people according to merit and ignored race completely -- but what exactly is merit?  Legislators or judges seem to want some sort of formula (usually based on grades and or test scores -- except that, by studying their own big data, some at Google claim that "G.P.A.'s are worthless" for their hiring).  Let's suppose our machine learning tools are good enough to estimate merit quite accurately if we define the merit objective function for them.**  How should we define it?  One particularly intriguing question, is the "merit" of the class simply the sum of merits of the collected individuals -- in which case we should ignore things like what major they want to choose -- or is the merit of the sum different from the sum of the merits?  I have some of my own not-completely-worked-out ideas, but again, this seems worth writing a longer essay about to work through the possibilities and implications.  

A further interesting question that arises is what sort of information can and should universities gather about applicants, in order to make these predictions.  College applications already ask for a lot -- grades, lists of activities, essays, letters of recommendation, test scores, sometimes interviews.  Suppose, though, that we could much more clearly predict your "merit" as a future student by parsing your Facebook account, or better yet, your e-mail from the last 3 years.  Should we be able to ask for that?  Perhaps we can guarantee that our algorithms will return a score only and your actual e-mail will not be examined at all by any human beings.  Or, by the time we get to the point where our machine learning algorithms are ready for that data, privacy won't matter to anyone anyway, especially if providing access to the data is needed to get into their choice of school. 

In some sense, none of these questions are inherently new.  But they appear to become different in kind once you think about the power machine learning will give to systems that make decisions about things like who goes to what university.  While the university setting is arguably small, the themes seem quite large, and perhaps the university is the place where some of the thinking behind the larger themes needs to be taking place.  And taking place now, before the technology is here and being used without a lot of thought into how it really should be used.

* Obviously, there are countless other potentially more significant uses of machine learning technology.  But I work at a university, so this is what has come to mind recently.   

** As far as I know, the merit function for Harvard is not "how much will you or your family donate to Harvard in the future".  But it could be.  Even if we avoid the potential self-interest of universities, to what extent is net worth a suitable metric of merit?  I was an undergraduate at Harvard and am now a professor there;  Bill Gates was an undergraduate (who notoriously dropped out) and donated a large amount of money for the building I now work in, and apparently has had a few other successes.  Extreme cases, to be sure, but how would the merit objective function judge these outcomes?  

5 comments:

Dan Spielman said...

There is a lot more that universities could, and should, do with their admission data. For example, they could evaluate how well different admission criteria predict performance in classes. I know that MIT did some of this. I think Yale does not.

Starry D said...

This all seems a bit fanciful. Sure, we *might* be able to predict the job a student will end up in; but given our current inability to even predict whether a student will be capable of learning how to program, I doubt it.

In any case, the questions of what role universities play, who ought to study what etc. have been debated ad infinitum elsewhere - I can't see that some current success in data mining adds anything to the debate, though it might allow policies to be implemented more effectively. I'm happy to point you to some of the relevant literature if you're interested.

In any case, allocating students to courses may not be something a university can feasibly do in a free market society even if it wanted to. If one uni refuses to enroll a student in what they want to do, presumably another uni will.

I think a more interesting question is how *students* might behave if *they* had better information on their likely prospects.

Michael Mitzenmacher said...

Dan -- just because the universities could do this, should they? Where's the line of what they should do?

Starry D -- I think you both underestimate the power of machine learning, and don't give enough credit to the idea that there's a lot to be gained from statistical rather than exact learning. We might not be able to tell what job a specific student will end up in, but I might be able to get a good estimate of the distribution (20% chance lawyer, 30% consultant, 40% wall street, 10% other) and then use this statistical knowledge to set up a class with desirable statistical properties, for whatever we believe is desirable.

I agree that how students would be have if they had better information is part of the interesting questions this raises. For example, back to Dan's point: suppose you told a student in advance that based on their entry profile, they were 80% likely to get a B- or lower in their intro programming class. Should we tell students that, and potentially rob them of their chance (even if small) to become the next great computer scientist?

Yisong Yue said...

Machine learning isn't (currently) very good at extrapolation. To what extent are the things you're talking about extrapolation?

For instance, predicting who will create game-changing disruptive technology (such as the PC) smells like extrapolation to me.

Harry Lewis said...

Michael,

Just catching up here. I have a lot of thoughts on this, some of which you have heard from me. But a couple of obvious but unstated responses to your speculations. First, would your machine-learning model take into account that high school seniors are pretty good game theorists too, and would quickly figure out and adapt to whatever model was being used to predict their behavior? (We have already seen the phenomenon of insincere declarations of intended majors on application forms.) And second, if the desire were to produce certain outcomes, you'd have to model how changing the mix changes individual choices. For example, that some of us who intended to major in math wound up in CS because we did not want to be the 100th best math major in our class. Or that some people wind up as archaeologists because there are people here teaching archaeology, not because of any individual intent.

Sounds like it would have to be a pretty fancy model, and at least at places like ours, we'd be better off tipping for ambition, character, open mindedness, and a history of capitalizing on opportunities. And we will get better education if the faculty have an incentive to compete for students -- by using the right sort of competitive efforts, of course.