Imagine a world in which you could walk in your office in the morning and see a stoplight. Green would mean your company’s image on the Web was favorable, yellow would mean neutral, and red would mean big trouble — like one of your customers had just blogged that your product obliterates hard drives on contact.
Or imagine being able to specifically query your favorite search engine for “all merger events involving Microsoft in the last six years” or “all mentions of your company plus Company X mentioned within the span of a sentence” or “all research being done on X disease — in any language — that mentions this particular enzyme or related enzymes.”
Or imagine being able to get an automated summary of what your customers are saying about you and your competition in customer service calls, e-mail logs, blogs and newsgroups — delivered to your Inbox every day.
This is the new promise of today’s text mining solutions — solutions that bridge the gap between search and business intelligence (BI). The growing volumes of information freely waiting to be tapped on the public and Deep Web, combined with the growing sophistication of natural-language processing tools that can “read” documents and extract person names, companies, places and other vital information, means that for the first time, this promise is within our grasp.
At its core, search was designed to help locate documents based on keywords. “Where is that e-mail I sent last month about cattle futures?” “Give me documents about Google.”
Now, people are starting to stretch the boundaries of search. On a search for “Bill Gates,” it’s no longer enough to have to wade through a list of tens of thousands of documents that happen to mention him in some fashion. You want to know what people are saying about him on the blogs, what the latest coverage is of his Foundation, etc.
Or perhaps to discover your new competitors and challengers, you want to know what new companies are out there that are mentioned in conjunction with your own.
You want information, not just documents.
More Than Just Numbers
On the other end, business intelligence was primarily designed to deal with highly structured statistical information — sales figures, production numbers, etc. One business intelligence vendor actually said, “Facts and figures are all that is important. The rest is just noise.” Clearly, effective business intelligence is done on the basis of the “noise” — otherwise, we wouldn’t need managers — we would just have racks of computers making perfect decisions based on clean data. It’s not enough to know that sales are falling in Japan — we need to know why they are falling.
The gap between search and intelligence has resulted in a merging of sorts. Players such as Cognos are rolling out search engines in their BI solutions. Players such as Google are displaying structured information from BI systems in their user interfaces.
These are just baby steps, though — the combination of two worlds “on the glass level” without really being able to effectively utilize all the available information in a single, unified system.
To take advantage of the world’s information, an effective text mining solution needs to bridge this gap. Such a solution needs to encompass five basic elements: content acquisition, structure, vetting and validation, storage and visualization, and exploration and access.
Before you can analyze information for data mining, you have to be able to access it. Years after the promise of portals, corporate and government information is still trapped in silos. Search engines can get you to the open Web, but you still have to visit various other Web sites and systems to access subscription content, Deep Web sources such as patent databases and SEC filings, and internal information. A federated search solution can allow you to acquire information, cull it, and use only that information that is key to your business — without having to index the world’s information yourself.
The ability for a system to perform accurate, large-scale, rapid detection of people, companies, dates, concepts and other key information trapped in unstructured text is vital, especially for high-demand publishing, government and ASP applications. By automatically classifying, summarizing, and discovering the “who,” “what,” “where” and “when” of each document, publishers, government organizations, and enterprises can do more than ever before — on a massive scale.
Because no automated extraction system is perfect, a vetting and validation system allows human editors and analysts to provide feedback, which improves the accuracy of downstream text analytics and data mining processes.
Information that comes from text analytics systems is usually expressed in XML. A way to store information for easy access is therefore required. The kinds of storage required for unstructured data is vastly different from standard database storage. XML has emerged as the document mark-up standard with XQuery as the standard means for accessing collections of XML documents.
To give customers the ability to effectively handle large volumes of XML content, you need a system specifically designed to handle content (e.g., documents), not data. You also need a system that can leverage XQuery, which can respond to queries in milliseconds, and be able to handle content bases up to tens or hundreds of terabytes in size.
Once information is stored, users need to be able to query and explore the information. The key is to allow users to gain insight from text-based information without making them significantly change the way they do business today.
Search interfaces with entity clustering can provide one way of querying and exploring information — seeing at a glance the concepts, companies, people, and other things associated with a search term or company of interest. Visualizations allow users to quickly sift through and locate information and patterns in hierarchical, relational, tabular, or time-based data sets; exploring, for example, co-occurrence relationships between people or between events. E-mail alerts can instantly make users aware of events or trends of interest.
Accurate data mining in the future will take into account not just structured information, but unstructured information as well. This new generation of solutions will need to go deeper than keyword search — it will require a deep understanding of language, to lend structure to unstructured data for use in downstream analysis and assessment. This sort of a system will allow users in the government, OEM and the enterprise to gain significant business benefits and a deeper understanding of their business environment, taking into account not only facts and figures from their own organization, but also global information in many languages, contained in blogs, CRM fields, e-mails, analyst reports, newsgroups, and other sources.
The world’s data is at your fingertips — where it should be.
Catherine van Zuylen is evangelist at Inxight Software in Sunnyvale, Calif.