Researchers at theMassachusetts Institute of Technology last week announced they had developed an algorithmic system to analyze big data that eventually might replace humans.
The system, called the “Data Science Machine,” designs the feature set and searches for patterns in big data.
The DSM’s first prototype was 96 percent as accurate as the winning submission by a human team in one competition to find predictive patterns in unfamiliar data sets, MIT said. In two others, it was 94 percent and 87 percent as accurate as the winning submissions.
“In effect, you’re replacing a person — a data scientist in this case — and they’re hard to get and spread very thinly,” said Rob Enderle, principal analyst at theEnderle Group.
“Even 87 percent is better than you’d likely get with someone untrained, and it could be close enough for a data scientist to refine the results, substantially cutting the time required for a project,” he told TechNewsWorld.
How DSM Works
Big data analysis searches for buried patterns and extrapolates from them to make predictions, but researchers first have to decide which features of a database to look at.
DSM aims to automate the selection of the feature set by conducting what’s known as “feature engineering.”
The researchers — graduate student Max Kanter and his thesis advisor, Kalyan Veeramachaneni, a research scientist at MIT’sComputer Science and Artificial Intelligence Laboratory — used various techniques in feature engineering.
One is to exploit structural relationships inherent in database design by tracking correlations between data in different tables.
The DSM imports data from one table into a second, looks at their associations, and executes operations to generate feature candidates. As the number of correlations increases, it layers operations on top of each other to discover things such as the minima of averages and the average of sums.
The DSM also looks for categorical data, which appears to be restricted to a limited range of values, such as brand names. It generates further candidates by dividing existing features across categories.
After a number of candidates have been produced, the DSM searches for correlations among them and winnows out those without correlations.
It then tests its reduced set of features on sample data, recombining them in various ways to optimize the accuracy of the resulting predictions.
“This is really about deep learning — the ability of server platforms to analyze data and develop intelligent algorithms,” remarked Jim McGregor, a principal analyst atTirias Research.
The DSM research “proves the value of the research being done by companies like Google, Baidu, Alibaba and Microsoft, and indicates some of the challenges,” he told TechNewsWorld.
Developing intelligent algorithms “is a science of learning,” McGregor said. “You don’t always get the right answer the first time, but accuracy improves over time and with additional feedback and more data.”
The potential for machine learning and deep learning is unlimited and “will change our industry and society by allowing both machines and humans to be more productive,” he predicted.
Whizzing Through Problems
Teams of humans typically take months to create prediction algorithms, whereas the DSM took two to 12 hours to produce each of its entries, MIT said.
Even though the DSM didn’t do as well as humans in the competitions, its conclusions still can be valuable.
“Think about what it would take to develop a drug for a supervirus. You don’t have months, you have days before a pandemic breaks out,” McGregor pointed out. In such a case, “it’s not about finding the right answers, but finding the potential answers while eliminating many or most of the wrong ones.”
Within a decade, such a system “should be able to compete with humans, matching or exceeding their accuracy if progress continues,” Enderle said.
The risk is that, as we increasingly rely on such automated systems, we may lose the skill sets needed to do the work ourselves and be unable to see the mistakes the systems make, he cautioned.
“A critical flaw in a future system could go unnoticed as a result and lead to catastrophic consequences,” Enderle posited.
Kanter is presenting his paper at the IEEE International Conference on Data Science and Advanced Analytics in Paris this week.