Honeypot Location Diversity and Data Quality

GreyNoise shows how they go about provisioning honeypot sensors to achieve a view of global internet traffic.
honeypots
data-science
Author

Daniel Grant

Published

October 6, 2023

Honeypot Collection

Imagine you had one honeypot. You’re sitting on your IP address and collecting traffic as it comes. Perhaps Shodan.io hits you in their global scans. Perhaps someone Nmaps you. You might get lucky and get traffic targeted to your country or ASN. The problem is that you’re trying to look at a galaxy from a pinhole telescope. There is a ton of background traffic bouncing around the internet everyday eating up our mental bandwidth and actual bandwidth. To try to filter out that traffic, you have to have a global view of the internet.

Additionally, if you’re running your honeypot in the same location/network as your real traffic, you will get a biased view and not be able to discern what is noise and what is a bit more targeted. It will represent a weird combination of targeted-but-unsophisticated attackers, regular business users, stale resource pointers (like old DNS, dead links), macro-targeted attackers, bug bounty hunters, and pentesters.

To get an unbiased (as much as possible) view of the internet and what is bouncing around, a good technique is to stand up your honeypots in Cloud Service Providers, as it is much harder to discover and target specific IP ranges related to one user in that provider.

Provider Scaling

Now you have a honeypot not broadcasting anything and just sitting there collecting traffic on an IP address in a provider. The next step might be to set up more honeypots to get more traffic. This is a great idea!

In practice, we’ve found that honeypots that are located in IP space near each other still provide value but have diminishing returns. For example, we host honeypots in AWS, Azure, and Digital Ocean (along with dozens of other providers). We can calculate how many IPs we see hitting each of our honeypots, and with a little math, figure out how many extra IPs we see for each additional honeypot we stand up.

Figure 1: Marginal new IPs per additional sensor in AWS

Figure 2: Marginal new IPs per additional sensor in Digital Ocean

As you can see, the new IPs drop off dramatically and reach somewhat of a steady state marginal utility.

Provider Diversity

You have more traffic and more data now, congrats! However, you’re still restricted to the IP space assigned by a single provider. Sophisticated attackers could identify that and ignore the provider in their scans. All bad guys have to do is realize one time that these handful of CIDR blocks are being used for internet background noise collection and they can freely blacklist those IPs in their Internet-wide scans and attacks (and completely blind the your network) or specifically target those CIDR blocks (completely fooling the your network into thinking that the Internet is exploding with activity and nobody else sees, eroding customer trust in your network. Additionally, some traffic that might be more targeted or not scanning every IP on the internet might be completely missed by your honeypots since you aren’t where the traffic is going.

We’ve been hosting honeypots in many providers and many many regions around the world, and with that history of data, we can actually calculate how much traffic we see from individual providers and, more importantly, how much traffic we would miss if we were only in a single provider.

Provider % Total Unique
IPv4s Seen
GCP 20.2%
Digital Ocean 18.1%
AWS 15.8%
Azure 9.6%

This shows that if we limited ourselves to a single provider we would be missing 80-90% of all traffic compared to our distributed network.

Your data - our data = specific insights

We’ve now established a solid way to create a largely unbiased and distributed view of internet background traffic. It’s a lot, but luckily you don’t have to build this yourself, since we already did it for you! A lot of the value GreyNoise provides is letting you see if something hitting your network is “just me” or if it’s commonplace so you can better figure out the threat posed.

That check has largely been a manual or API driven workflow that you have to integrate into. But, we’re in the early stages of letting you host your own GreyNoise sensors on your own network, more to come! Combining our globally collected data and your pinhole view of the internet, we can better figure out what you’re seeing that no one else is, what is targeting your geolocation/industry vertical/technologies/etc, and help you eliminate the noise and focus in on the things that matter to your security.