Big Data, Tough Questions
Vast databases of information being mined for emergent patterns and used to process simulations over and over are hardly new. How is it, then, that the security world has not previously tapped into this pool of expertise before now to help us glean the knowledge lying dormant within our vast supplies of data? Quite simply, it's because we still don't know what questions to ask in the first place.
04/20/12 5:00 AM PT
If you work with SIEM, you've been spending more than a couple of years, dealing with "Big Data" -- more lines of logs than any one person (or even a reasonable-sized team of people) could ever hope to keep up with reading through.
Our data is plenty big enough already. So why is everyone so hyped over Big Data now?
Because behind the buzzword are a whole host of technologies that enable us to do more, in less time, with more data. But first let's talk about what has led us to realize the power in Big Data.
Once we realized that log data was an incredibly rich source of information for detecting security intrusions, we developed a taste for more, more, more logs; overnight, we had more logs than we could ever review directly, but at least we were assured that, post facto, if we needed to search for something, we had some assurance it would be there to confirm our suspicions.
Security has always striven to be a little more proactive, however.
Log Correlation came next (and in many forms): The realization that individual log entries by themselves meant very little, but when placed into context against one another illustrated more than just system-level events. They illustrated behavioral context -- clusters of individual log lines could be translated into records of human-readable actions.
Security is still in the early days of the science and practice of event correlation: Methods and results are rarely shared with the community, the target for what is effective keeps moving, and yet we're already talking about Big Data.
Terror and Possibility
This is of course, the intersection of terror and possibility, as we transition from our first fumbling attempts to boil the ocean into a land populated by people who have been doing this stuff for a long time before us.
If information security has a singular conceit, it is that we are somehow pioneers in truly uncharted territory: As the field matures to the level where the problems we face begin to require the same computational solutions as the problems other fields face, this conceit must evaporate as we face the truth that other people may have (unwittingly) already solved our problems for us.
During the intro to this piece, I was reminiscing about information security's first tentative steps into data analytics as a necessary hurdle to the ever-increasing complexity of the nature of our field.
Yet this is where we find ourselves -- at the crux of an ever-increasing ocean of data to boil and our insistence that we are alone in our problem. We find ourselves in need of others' technology. Suddenly, we don't seem so alone in the computational problems facing us.
Vast databases of information being mined for emergent patterns and used to process simulations over and over are hardly new to the world -- the finance, medical and aerospace industries have spent years in this realm. How is it, then, that the security world has not previously tapped into this pool of expertise before now to help us glean the knowledge lying dormant within our vast supplies of data? Quite simply, it's because we still don't know what questions to ask in the first place.
What's Out There
At this point, it's worth performing a short recap on emerging Big Data technologies out there and why they differ from being just "large databases." Although there are many implementations of these technologies, they all derive from two core functions: NoSQL and MapReduce.
NoSQL is a difficult beast to define even among the experts in that field. What you need to know up front as a security practitioner, however, is that NoSQL can be defined by:
- Lack of strongly structured schemas. Unlike an RDBMS, where the schema must we well-defined before data is stored and changes to that schema when live data is present becomes increasingly more unfeasible, NoSQL data stores may freely adapt the nature of the records they store over time.
- They are optimized for rapid retrieval of information at the possible expense of consistency of data (they do not comply to ACID). To wit, they are excellent systems with which to do analytical work but have inherent issues if treated as the authoritative repository.
Accordingly for the same audience, MapReduce's key features are:
- The ability to perform information retrieval and calculation over a widely distributed data storage. A practical example would be that if individual devices had their log storage implemented in a MapReduce-capable manner, then a centralized log storage mechanism may no longer be required -- a single query could be performed across all logs on all devices simultaneously.
- Inversely, a centralized storage may still exist but spread out over a computing grid of commodity hardware (indeed, this was the reason for Google's creation of MapReduce).
- Generally speaking, there is comparatively little need for the end-user to optimize their query sets to take advantage of MapReduce's distributed nature.
So, we can immediately see some of the reasons these two technologies have raised excitement and promise to the information security world:
- Increased speed on complex queries across large quantities of data is a vital force-multiplier for security analysts; the ability to query every machine that has accessed a particular URI in the last 90 days in minutes (not hours or even days) cannot be overlooked.
- The flexibility to bring additional data to supplement existing records works in lockstep with the inherent nature of security information: that it is comparatively a domain of unstructured data. Freedom from data schemas that fail to take into account the information that is vital to the organization we are trying to defend will allow us to make better correlations and ask better questions from our data.
Between these two factors, we can see where the excitement comes from, and yet we still have to return back to the same issues we've struggled with before the advent of Big Data.
What Do You Want to Know?
We still aren't very good at asking the right questions from our data.
In security analytics, it's often the relations between the data (not the data itself) that is important. Just as detective work is a matter of "connecting the dots," so are the relations between our data points for the true information (Log Correlation itself is about looking for and exposing those relations).
Individually, we are excellent at this. The human brain is a wonderfully powerful pattern-recognition engine, and yet we never seem to be able to translate this to the level of better machine-level pattern recognition within security data.
As IT professionals, we share a particular reticence to trust anything we didn't do hands-on ourselves; as security professionals, this trait becomes magnified. Perhaps the fact that the concepts we are looking for (exposures, risks, threat surfaces) are so difficult to define that we are still stuck in the stone ages of bar charts and keyword searches when it comes to data analytics.
No amount of Big Data is going to save us until we can learn to formulate better questions for that data. Perhaps it's time that we accepted that the problems we're approaching now (trying to boil an ocean of data points into digestible information) is not unique to us. Information security as a discipline may have much to learn from other technology fields. It's a tough pill to swallow when you think of how much we collectively berate the rest of IT as being the source of all our issues in the first place.
I'll cut to the chase here: BioInformatics.
Bioinformatics places emphasis on discovering the nature of interactions and relations between their points of data, since this is intrinsic to how biology operates too. It won't take long before you find a plethora of advanced (and aesthetically pleasing) visualization techniques being used to present and explore data relations, like the CIRCOS system.
BioInformatics has made great strides in distilling down complex data relationships into advanced visualization techniques that maximize the ability of human pattern recognition abilities to discern inferences that are difficult to make programmatically.
Ask better questions, discover relationships, create hypotheses and test them against more data; rinse, repeat -- the scientific method.
Big Data will not magically enable us to discern better answers until we come up with better questions to explore the relationships between our data more thoroughly.
The field of log correlation could make great strides if were we to establish an open format for exchanging ideas for correlations in a vendor-neutral manner and collectively discuss what is effective within the field instead of how we operate today.
Information security is evolving into areas well explored within other fields. Our issues with discovering relations and implications from our oceans of unstructured data are at the heart of the field of complex event processing.
We're moving into territory where we are not as alone as we think; if we are going to reap the benefits that Big Data promises and not let this become another failed fad, then we have to start overcoming our isolationist attitude and start inviting experts from other disciplines to join us and teach us how to use this new toolset.