IBM Research will turn over its data search technology to the open source community, the company said today. The Unstructured Information Management Architecture (UIMA) searches store data not through keywords, but by analyzing the data within documents to see if they fit the concepts and facts the user is researching.
It will be made available through SourceForge, a repository for open-source code, by the end of the year, IBM said.
Nelson Mattos, IBM distinguished engineer and vice president strategy, WebSphere Information Integration Solutions, used the example of the word “rock,” which can mean a stone, a type of music or to move back and forth. Searching for the keyword “rock” will yield documents with all those definitions, but the UIMA search will be able to sort out the irrelevant data.
The ability to hunt through corporate data that is not stored within easily searched databases, called unstructured data, is becoming more and more important as employees communicate and conduct business through e-mail, word processing, Excel and PowerPoint.
“Employees spend about one-third of their time looking for relevant information to get their job done,” Mattos told TechNewsWorld. “Eight-five percent of data stored in corporate repositories today is unstructured. Only 15 percent is things you can represent as rows and columns and it is that 15 percent that companies use business intelligence to analyze.”
Many Practical Uses
Gathering and analyzing the vast majority of business data can drastically change how companies relate to their clients, because, for instance, they will be able to extract and analyze call center information much more quickly, Mattos said.
The technology has applications beyond enterprises. For example, government agencies could search through all available data, and medical researchers might be able to aggregate information on patients and/or medications and spot patterns earlier.
UIMA, which took four years from concept to inception, is incorporated into IBM’s WebSphere Information Integrator Omnifind Edition, WebSphere Portal Server and Lotus Workplace. IBM also has the support of Attensity, ClearForest, Cognos, Endeca, Factiva, Kana, Inquira, iPhrase, Inxight, nStein, QL2, SAS, Schemalogic, Semagix, SPSS and Temis, making UIMA a standard framework for searching and analyzing unstructured data.
“The framework will have broad applicability once you have companies building applications on it,” Mattos said about the decision to open source. Google, Microsoft and Yahoo — the major search engine competitors — all offer a desktop search feature, but they are driven by keywords. However, the potential is there, with UIMA being open-sourced, that any one of these companies could take the framework and build new search strategies onto it.
Thanks for the interesting article. Once again IBM is giving us a great vision about the future and how unstructured information can be searched.
InfoCodex already does all this today with the help of a linguistical database and synonym and/or similarity search across 5 languages (German, French, Italian, English and Spanish). With InfoCodex you can search for a block of text in one language and it will find you all the similar documents in the other languages as well. All of this is done without one single minute of training – because of the linguistical database that contains 2.9 Mio words and terms (i.e. "European Court of Justice" or "The President of the United States" are terms and reconized as such).
See the following links: