Over the past few decades, most IT shops have followed a somewhat similar trajectory: Starting from a centralized model (i.e., the mainframe days), computing resources, much like the cosmological Big Bang, have exploded outwards to become ever-more-distributed and decentralized. This makes sense given market dynamics. Computing platforms evolve quickly, so monolithic computing platforms that require heavy up-front investment are less efficient from a depreciation standpoint (i.e., from a MIPS per dollar per year point of view) than numerous, incremental investments in lower-powered devices.
So it’s natural that processing would decentralize. And in fact, there have been numerous technologies invented over the years to support exactly this paradigm.
By virtue of ever-more decentralized processing, it logically follows that storage would be (in general) decentralized as well. In fact, storage becomes a balancing act. Data is placed in such a way as to be centralized enough to be manageable, while still being distributed enough to be efficiently used by consumers of that data. That’s the paradigm of recent history. But this paradigm is changing — changing in a way that impacts how we manage IT overall from a security perspective. And that change is “big data.”
What Is ‘Big Data?’
This emerging paradigm — “big data” — is the logical outgrowth of increased use of virtualization technology, cloud computing and data center consolidation. All of these technologies have huge cost and efficiency benefits. And they all also leverage standardization, consolidation and centralization of resources to achieve economies of scale — part of what makes that cost benefit possible. And what organizations are finding as they deploy these technologies — centralizing resources like storage along the way — is that they’ve produced quite a lot of data … in some cases, even exabytes of data. To put that into perspective, the total number of words ever spoken by humans is estimated to be around 5 exabytes.
Smart folks (for example, observant engineers and scientists within the social network community) have discovered that having a lot of data in one place opens up opportunities to use that data to a productive purpose; it’s apparently an emergent property of large amounts of data. So as the volume of data compounds, so also do opportunities to leverage the data emerge. It’s starting to be transformative to business, telling us quite a bit about our customers, about how they use our services, and about how our businesses run in general.
Of course, for those of us who practice security, it goes without saying that this changes the landscape. There are upsides and downsides from a security standpoint to this shift. For example, on the one hand, it’s easier to protect data when you know where it is and it’s all in the same place; on the other hand, it makes for a bigger target from a hacker perspective. Going into each and every pro and con of big data from a security standpoint would likely take all day, but suffice it to say that the practice of information security as a discipline will change as a result of the transformation going on.
Why? Because the volume of data is increasing nonlinearly. But most of us don’t have the tools or processes designed to accommodate nonlinear growth. Meaning, looking down the road, we’re seeing scenarios play out (they’re starting already) where traditional tools — in particular, security tools — no longer provide the kind of value they have historically.
So for organizations looking to plan ahead for changes coming down the pike (or to use alternate phrasing, “not get clocked in the head”), it behooves them to think through now how they can get out in front of the changes. You wouldn’t buy a stockpile of charcoal briquettes the day before you go out and buy a propane grill, would you? So thinking about how this is likely to play out — and paying attention to where the industry is going — can pay off.
What Were Your Tools and Processes Designed to Do?
So, some folks might naturally ask the questions, “why does it matter?” or “who cares if there’s a lot of data; how can that possibly impact our security tools?” Stop for a moment and think about what tools you use right now to support security within your environment. Now reflect for a moment on how many of them presuppose a limited volume of data to search through or transform.
Consider, for example, how difficult it is to do a malware scan across a large networked attached storage volume, or SAN. How long do you suppose it would take if the dataset were 1,000 times larger? How about 100,000 times larger? What if it was growing at a geometric rate? Would it be feasible to scan through it all every day like we do now?
What about the case where data discovery is required to support data leak prevention (DLP) or regulatory compliance? What happens, for example, when your PCI auditor wants to conduct a regular expression search for credit card numbers across the data stored in your cardholder data environment — except the CDE contains of an exabyte of data? The search alone would be hard enough, let alone the manual post-scan vetting of gigabytes of false positives. These two controls just become non-viable — at least using the same methods we’ve always used in the past.
There are a number of scenarios where data size could be a factor in the proper operation of a security control or supporting process. Consider, for example, log parsing, file monitoring, encryption/decryption of stored data, and file-based data integrity validation controls. These all operate in a manner that is a function of data volume. These controls may very well need to change to continue to be viable. So just like clever folks are designing new tools (for example, databases) to make searching manageable in the large data-volume world, so also will the tools we use in security have to change to meet this new challenge as well.
While this change won’t happen overnight, it does pay for security professionals to start thinking about this now — if only to have it in the back of their minds when evaluating new tool purchases. Knowing that a geometric increase in data might be on the horizon, rolling out a new data discovery tool based on linear search might not necessarily be the best idea — at least not without some tough questions posed to the vendor in question. Conversely, it might hurry deployments of controls like file encryption that operate on data incrementally as it’s produced. Bulk encryption of an exabyte of data might not be easy to do when it’s produced en masse — however, if that control is put in place now before the data organically grows? Well, that might be a different story.
The good news is that we have time to prepare. We have time to make adjustments in our processes and controls before the problem gets intractable. But considering how quickly virtualization is happening in the industry, this issue could be on us sooner than one might think. So it’s a useful place to put some thought.