Massive data sets — a season’s worth of baseball statistics, for example, or health data from around the world — can contain some very revealing knowledge. The problem confronting researchers, though, is finding it.
That may be a little easier with some tools developed by scientists at Harvard University and the Broad Institute.
The suite of tools called “MINE” — Maximal Information-based Nonparametric Exploration — were revealed this week in an article published in the journal Science. What they do is allow researchers to find patterns and relationships in massive data sets that would be otherwise difficult or impossible to find.
“What makes MINE unique is its ability to find a very broad range of different types of patterns in data and to do that equally well,” one of the authors of the article, David Reshef, who is in a dual degree program at Harvard and MIT, told TechNewsWorld.
Coping With ‘Noise’
Another distinctive characteristic of MINE is its ability to balance generality and equitability in its results.
A statistic has generality when it captures a wide range of associations in a large data sample without being limited by linear, exponential periodic or other statistical functions.
It has equitability when scores assigned to data pairs described by the statistic are similar when the “noise” associated with the pairs is similar. Noise is what number crunchers call the amount of unexplained variation in a data sample.
“The reason that’s important is that if you have a method that gives patterns that look different from each other but have the same amount of noise different scores, then you can’t compare scores across different types of patterns,” Reshef explained.
That balancing of generality and equitability by MINE distinguishes it from other tools used for similar purposes, observed another of the article’s authors, Harvard Computer Science Professor Michael Mitzenmacher.
“Other similar data-mining techniques that we know of may have one or the other, but don’t appear to have both,” he told TechNewsWorld.
MINE could be a valuable tool for anyone analyzing large data sets, especially in scientific fields, such as biology, medicine and genetics.
“We’re looking at a large number of applications there because when you’re looking at genetic data or medical data, there are huge data sets with large numbers of variables, and that’s what our tool is designed to help with,” Mitzenmacher noted.
One hot area of interest in the life sciences now is the role of bacteria in human intestines, which is also one of the topics the researchers focused MINE on in their article, whose other authors included Reshef’s brother, Yakir, a Harvard student, and Pardis Sabeti, an assistant professor at the Center for Systems Biology at Harvard and a member of the Broad Institute.
Life scientists are interested in determining relationships between bacteria living in our guts and things like illness and obesity. “We’re starting to get a ton of data on this, and it’s an exciting example of where MINE could be useful because we don’t know what the patterns we’re looking for are going to look like,” Reshef explained.
When MINE was run on the bacteria problem, it identified 7,000 variables that produced 22.5 million relationships. Of those relationships, some 200 were “interesting.”
Working in Clusters
The researchers later found out from microbiologists that some of those interesting relationships were linked to diets — some diets repressed certain bacteria; others allowed those bacteria to flourish — while others depended on where the bacteria appeared in the intestinal track.
When crunching data, MINE works relatively fast, according to Reshef. A sample producing around 63,000 relationships took about 1.5 hours on a laptop.
But MINE doesn’t haven’t have to be restricted to a single computer. It can be operated on multiple computers working as a cluster. That’s what was done with the gut data. There hundreds of computers were used and a task that would have taken days was reduced to about two hours.
By using MINE on large data sets, analysts no longer have to eyeball reams of printouts looking for relationships. If the researchers printed on paper each potential relationship in the human gut bacteria data set, the stack of paper would be 1.4 miles high, or six times the height of the Empire State Building.
MINE’s application need not be restricted to scientific uses, added Mitzenmacher. “People are coming up with all sorts of other suggestions for uses,” he said. “In our paper we tried to show a range of things by also looking at baseball data, and I imagine there will be people who want to try to use the tool for financial analysis.”