Modern endpoint backup means real-time data protection. Get it from Code42. Click here.
Welcome Guest | Sign In

Needle in a Haystack: Harnessing Big Data for Security

By Dan Hubbard
Sep 14, 2013 5:00 AM PT

The combination of the polymorphic nature of malware, failure of signature-based security tools, and massive amounts of data and traffic flowing in and out of enterprise networks is making threat management using traditional approaches virtually impossible.

Needle in a Haystack: Harnessing Big Data for Security

Until now, security has been based largely on the opinions of researchers who investigate attacks through reverse engineering, homegrown tools and general hacking. In contrast, the Big Data movement makes it possible to analyze an enormous volume of widely varied data to prevent and contain zero-day attacks without details of the exploits themselves. The four-step process outlined below illustrates how Big Data techniques lead to next-generation security intelligence.

Information Gathering

Malware is transmitted between hosts (e.g. server, desktop, laptop, tablet, phone) only after an Internet connection is established. Every Internet connection begins with the three Rs: Request, Route and Resolve. The contextual details of the three Rs reveal how malware, botnets and phishing sites relate at the Internet-layer, not simply the network- or endpoint-layer.

Before users can publish a tweet or update a status, their device must resolve the IP address currently linked to a particular domain name (e.g., within a Domain Name System record. With extremely few exceptions, every application, whether benign or malicious, performs this step.

Multiple networks then route this request over the Internet, but any two hosts never connect directly. Internet Service Providers connect the hosts and route data using the Border Gateway Protocol. Once the connection is established, content is transmitted.

If researchers can continuously store, process, and query data gathered from BGP routing tables, they can identify associations for nearly every Internet host and publicly routable network. If they can do the same for data gathered from DNS traffic, they can learn both current and historical Host IP Address/Host Name associations across nearly the entire Internet.

By combining these two Big Data sets, researchers can relate any host's name, address, or network to another host's name, address, or network. In other words, the data describes the current and historical topology of the entire Internet -- regardless of device, application, protocol, or port used to transmit content.

Extracting Actionable Information

While storing contextual details on a massive volume of Internet connections in real-time is no easy task, processing this data in order to extract useful information about an ever-changing threat landscape might be nearly impossible. There is an art to querying these giant data sets in order to find the needles in the haystack.

First, start with known threats. It's possible to learn about these from multiple sources, such as security technology partners or security community members that publicly share discoveries on a blog or other media site.

Second, form a hypothesis. Analyze known threats to develop theories on how criminals will continue to exploit the Internet's infrastructure to get users or their infected devices to connect to malware, botnets and phishing sites. Observing patterns and statistical variances regarding the requests, routes and resolutions for malicious hosts is one of the keys to predicting the presence and behavior of malicious hosts in the future.

Spatial patterns can reveal malicious hosts, since they often share a publicly routable network (aka ASN) with other malicious websites -- for example, same geographic location, same domain name, same IP address, same name server host storing the DNS record or other objects. Infected devices connect with these hosts more often than clean devices do.

Temporal patterns can be used to identify malicious hosts by showing evidence of irregular connection request volume or new domains with sudden high spikes in volume immediately after domain registration. Statistical variances, such as a domain name with abnormal entropy (gibberish), can also reveal malicious hosts.

Third, process the data -- repeatedly. On the Internet, threats are always changing. Processing a constant flow of new data calls for a real-time adaptable machine-learning system. It needs classifiers that are based on a hypothesis. Alternatively, the data can be clustered based on general objects and elements, and training algorithms can collect a positive set of known malicious hosts as well as a negative set of known benign hosts.

Fourth, run educated queries to reveal patterns and test hypotheses. After processing, the data becomes actionable, but there may be too much information to effectively validate hypotheses. At this stage, visualization tools can help to organize the data and bring meaning to the surface.

For instance, a researcher may query one host attribute, such as its domain name, but receive multiple scored features outputted by each classifier. Each score or score combination can be categorized as malicious, suspicious or benign and then fed back into the machine-learning system to improve threat predictions.

When a host is categorized as "suspicious," there is a possibility of a false positive, which could result in employee downtime for customers of Internet security vendors. Therefore, continuous training and retraining of the machine-learning system is required to positively determine whether a host is malicious or benign.

Host Validation

The process of determining whether suspicious hosts are malicious or benign can be cost- and resource-prohibitive. To validate threats across the entire Internet would require an army of analysts. The good news is that there are thousands of potential analysts in the security community, including security-savvy customers. The bad news is that security vendors typically keep their threat intelligence to themselves and guard it as core intellectual property.

A different approach is to move from unidirectional relationships with customers to multidirectional communication and communities. Crowdsourcing threat intelligence requires an extension of trust to customers, partners and other members of a security vendor's ecosystem, so the vendor must provide dedicated support to train and certify the crowdsourced researchers.

However, the upside potential is significant. Given an anointed team of researchers across the globe, the reach and visibility into real-time threats will expand, along with the ability to quickly and accurately respond, minute by minute, day by day, to evolving threats.

As for tactical requirements, the community needs access to query tools similar to those used by the vendor's own expert researchers. The simpler interface would display threat predictions with all the relevant security information, related meta-scores and data visualizations, and allow the volunteer to confirm or reject a host as malicious.

Applying Threat Intelligence

Threat intelligence derived from Big Data can prevent device infections, network breaches and data loss. As advanced threats continue to proliferate at an uncontrollable rate, it becomes vital that the security industry evolve to stay one step ahead of criminals.

The marriage of Big Data analytics, science and crowdsourcing is making it possible to achieve near real-time detection and even prediction of attacks. Big Data will continue to transform Internet security, and it's up to vendors to build products that effectively harness its power.

Dan Hubbard is a noted information security researcher and chief technology officer for OpenDNS, provider of the cloud delivered Umbrella Web security service.

Facebook Twitter LinkedIn Google+ RSS
Windows 10 is no longer free...
I have it, and I like it.
I have it, but I'm not sold yet.
I'm considering paying to get it.
I tried to upgrade, but it didn't work.
I'm happy with the Windows version I have.
I don't use Windows.