Threat Intelligence Indicators are not Signatures

// Sat 13 February 2016

I have recently participated in a Black Hat webcast with Bhaskar Karambelkar, which was sponsored by ThreatConnect. This was related to the Black Hat 2015 session called Data-Driven Threat Intelligence: Metrics On Indicator Dissemination And Sharing, which I had the pleasure to co-present with my good friend Alex Pinto.

At the end of the webcast, someone asked me a question about a comment I had made on how threat intelligence indicators have multiple uses, but should not be used as signatures. One of the audience members was a bit baffled by this, and I am sure he is not alone.

So let's focus on automating the use of simple network indicators (IP addresses, domain names and URLs, mostly) that most companies will obtain from public or private threat intelligence feeds. Let me show why using them directly as signatures, such as automatically generating IDS signatures and/or SIEM rules to alert or block on direct matches to them, is very troublesome. Organizations that do that will, in my experience, be most likely flooded with false positives.

False positives, false positives everywhere!

Let me go over a (far from complete) list of reasons why.

Affirming the Consequent Fallacy

In order to reliably generate an alert based on network traffic, you need to identify a situation in which the probability of that traffic being malicious is reasonably high. How high it needs to be depends on your organization's tolerance to false positives. So you need to satisfy a requirement P(malicious | traffic) > threshold.

Threat intelligence network indicators are data points that say we observed a threat actor, malware or tool A generate traffic with the characteristics M, N and O and to the external locations X, Y and Z. Notice how that alone does not equal any of the following claims:

Most or all of the traffic with characteristics M, N or O to destinations X, Y, Z are caused by A;
A always causes traffic with characteristics M, N or O to destinations X, Y, Z.

Do you notice the mismatch? Most people will erroneously equate the claim attackers do X to I can safely alert when X occurs in my environment. It's related to the affirming the consequent fallacy, but it's even more striking because the feeds are not even making claim 2 above, which would be required for the classic form of the fallacy.

To give you an example, you could find a perfectly valid indicator saying a piece of malware uses something like a public API from Google, Dropbox or anyone else just to verify whether it can connect to the Internet. Or it could use some publicly available service to identify which public IP address it is reaching the Internet from, and its geolocation, in particular in the case of targeted attacks.

This sort of indicator can still be really useful if you are doing DFIR or hunting, as it allows you to narrow down compromised machines on the network, or let you know which forensic data to investigate first. But it should be obvious by now that generating an IDS or SIEM alert for every machine on your network that behaves in a similar manner would be a really bad idea.

What should I match it against?

When you get an intelligence feed, it might contain indicators that are indicative of several different kinds of malicious behaviors. In particular the paid feeds will contain a mix of human-readable context in the form a report, and also the machine-readable indicators associated with each report.

The problem is that it can be very hard to automatically determine in which context each technical indicator can be applied. In the case of IP addresses, for example, very rarely does the machine-readable data allow you to unambigously determine something as simple as whether it is associated with inbound traffic, outbound traffic or both.

In case you are not familiar with the terminology, the definition of inbound and outbound I'm referring to is the one used in combine and tiq-test. Keeping your organization as the point of reference, inbound indicators would refer to traffic originating from the open Internet towards your organization's public assets: port scanning, credential brute forcing, automated or manual exploitation of Internet-facing services, etc. Outbound indicators, on the other hand, would be associated with traffic originating from inside your organization's network towards the open Internet, such as data exfiltration, C&C traffic, downloading of malware or client-side exploits.

So knowing which traffic direction each indicator applies to would be the most basic way to reduce false positives, and it is often not available for use on an automated fashion.

"Helpful" Feed Providers

Imagine an analyst reverses a new malware or RAT sample and identifies that it uses a particular URL to talk back to its creator. He will of course generate an indicator in the report for that URL. However, sometimes the feed provider will go one step further and think "but what about people that want to match this in netflow, firewall or DNS logs?", and do you the favor of also generating indicators for:

The hostname of the URL, so you can match this on DNS;
The IP addresses that hostname resolved to at the time of the analysis.

This creates all sort of problems and piles onto the false positives.

First, extracting the hostname is not always appropriate. Attackers might control the entire content served under that hostname (think DGAs, a single compromised web server that attacker fully controls). However, it might also be a portal with completely independent sub-sites hosted under different paths that share no infrastructure except for a load balancer that routes requests appropriately. Or it might be something like a Google Drive or Dropbox link, or a URL shortener. So considering the entire domain to be compromised / malicious because of a few URLs within it is often a step too far.

It's even worse when resolving domain names to IP addresses, even for domains completely dedicated to malicious purposes. Firstly, we know that miscreants can and will switch the IP addresses a domain resolves to often, and the IPs you receive will most likely be outdated by the time you get to use them. Second, it's not uncommon for malicious domains to be using a service like CloudFlare, Incapsula or a shared hosting service. So if you resolve the domain, you'll get an IP address that's shared with possibly hundreds of benign websites. Or it could be temporarily parked at a benign IP address such as 8.8.8.8. Again, knowing that a domain is malicious does not necessarily mean that all of the IPs it resolves to are mostly or completely malicious as well.

Feed providers, take note: having this extra information would be more helpful if it was possible to distinguish the principled indicators from the derived ones. But alas, this information is often not present in the computer-readable indicators.

Conclusion

I hope this article helps people realize that organizations need to put proper processes and/or automation in place to overcome the problems identified above if they decide to use threat intelligence indicators for detection. Even though threat intelligence indicators are really valuable allies to information security monitoring and DFIR initiatives, they are not signatures and should be used with appropriate care.

Go Top