Introduction

In the wake of recent news about Israel being targeted by an ICS/SCADA specific cyber attack, the team at GreyNoise found ourselves in a unique position. We had sensors in the region collecting honeypot data. This gave us a chance to evaluate the unsolicited broadcast traffic, also known as Internet Background Noise. Due to our existing network, we were presented with an opportunity to gain insights into the nature of the traffic targeting the region.

However, we faced a challenge: we didn’t have a framework in place that was focused on ICS/SCADA specific protocols. This made it difficult to quickly evaluate the GreyNoise data to determine what was targeting nodes in that geographic area specifically. We were also limited in our ability to rapidly change the infrastructure in the region in order to put more traditional tooling in place that was focused on ICS/SCADA specifically.

To overcome this, we extended an internal research tool called precursor to help meet this need. This tool was under development to help label arbitrary input data based on PCRE2 matches with a global capture group name. This allowed us to filter inputs to an O(2^n) operation for locality sensitive hashing, enabling us to find related payloads with a clustering technique and work with data in a variety of shapes.

Precursor

Precursor is a regex PCRE2 and locality sensitive hashing (TLSH) tool for labeling and finding similarities between text, hex, or base64 encoded data. It uses the Rust programming language for efficient and fast processing. It’s intended to be general purpose in order to support analysis of network packet data, firmware, text, etc.

Process

We used an existing GreyNoise Labs command in the greynoiselabs CLI to pull down the aggregate distinct TCP/UDP payloads and the metadata about them. This gave us information about how pervasive a payload was and what source IPs sent the payload, without overwhelming Precursor with tons of duplicate payloads. You can also obtain this information by installing greynoiselabs and running greynoiselabs payloads --protocol TCP or greynoiselabs payloads --protocol UDP.

Initially, we considered using ripgrep, a fantastic tool for searching through large volumes of data. However, we realized that it was beyond the scope of ripgrep to support the PCRE2 capture group label injection that we needed and would otherwise limit the PCRE2 formats for cases where you didn’t want label injection. So, we decided to build a CLI around rust-pcre2, maintained by BurntSushi, and also used with the PCRE2 mode of ripgrep. By combining the power of the PCRE2 engine, the JSON parsing of JAQ (a rust port of the popular jq JSON parer), and the similarity engine of locality sensitive hashing with TLSH, we were able to slice and dice arbitrary payload data very quickly to uncover clusters of interesting payloads.

The PCRE2 pattern file we used was a rough concept for pairing down protocols based on PCRE2 matches. It likely had some false positive hits, but it was a starting point for our analysis.

Here is a small snippet of the PCRE2 pattern file so you can see the approach for labeling.

(?s)(?<bacnet>.*\x81\x0a\x00\x11\x01\x00.*)
(?s)(?<bacnet>.*\x81\x0a\x00\x11\x01\x04.*)
(?s)(?<bacnet>.*\x81\x0b\x00\x12\x01\x05.*)
(?s)(?<ethernetip>.*\x05\x64\x00\x00\x00\x05\xff\x2b\x0e\x03\x00.*)
(?s)(?<ethernetip>.*\x21\x00\x00\x00\x00\x06\x01\x04\x00\x01\x00\x00.*)
(?s)(?<fox>.*\x0afox.version=.*)
(?s)(?<fox>.*\x66\x6f\x78\x20\x61\x20\x31\x20\x2d\x31\x20\x66\x6f\x78\x20\x68\x65\x6c\x6c\x6f\x0a\x7b\x0a.*)
(?s)(?<modbus>.*\x00\x00\x00\x00\x00\x02\x01\x11.*)
(?s)(?<modbus>.*\x00\x00\x00\x00\x00\x05\x01\x2b\x0e\x01\x00.*)
(?s)(?<modbus>.*\x00\x00\x00\x00\x00\x06\x01\x03\x00\x00\x00\x01.*)
(?s)(?<modbus>.*\x00\x00\x00\x00\x00\x06\x01\x03\x00\x01\x00\x01.*)
(?s)(?<modbus>.*\x21\x00\x00\x00\x00\x06\x01\x04\x00\x01\x00\x00.*)
(?s)(?<modbus>.*\x44\x62\x00\x00\x00\x05\xff\x2b\x0e\x03\x00.*)
(?s)(?<modbus>.*\x5a\x47\x00\x00\x00\x05\x00\x2b\x0e\x01\x00.*)

We’ve been able to toss together C2 detections and other mechanisms with this same approach. For example, converting Florian Roth’s Sigma rule for suspicious linux commands.

(?s)(?<wget>wget \S+ - http\S+ \| sh)
(?s)(?<wget>wget \S+ - http\S+ \| bash)
(?s)(?<simple_server>python -m SimpleHTTPServer)
(?s)(?<http_server>-m http.server)
(?s)(?<pty_spawn>import pty; pty.spawn\S*)
(?s)(?<socat>socat exec:\S*)
(?s)(?<socat>socat -O \/tmp\/\S*)
(?s)(?<socat>socat tcp-connect\S*)

It is worth mentioning there are a number of industry-standard tools that can perform protocol parsing / network packet introspection, and analysis much faster than Precursor. These tools also have very robust rule syntax and management frameworks. However, given the time constraint, and the unknown unknowns, it was nice to use a generic REGEX/PCRE2 approach to start loosely labeling this data in a way that wasn’t constrained by built-in protocol parsers. This concept of loosely labeled protocols also supports areas where an exploit is not properly handled by a protocol parser because it is leveraging some weakness in the protocol implementation and/or parsers themselves.

How it works

Precursor uses PCRE2 patterns to match payloads and then calculates a TLSH hash for each payload. It then performs TLSH distance calculations between every input provided that has fallen through the PCRE2 filter. This is an expensive O(2^n) operation and can consume significant amounts of memory. You can optimize this by using appropriate PCRE2 pre-filters and choosing a more optimal TLSH algorithm depending on the inputs.

Precursor has a mix of different modes currently, but the -j option supports parsing input JSON for the payload. In our case this was base64 encoded. In this mode Precuror is able to extend the JSON input. With this approach we can quickly inject the PCRE2 capture group names as an array of “tags” in addition to the calculated similarity hash and distance matches. This is important because a single payload could contain data that matches across multiple PCRE2 patterns.

Here is the current CLI usage just to show some of the options used with the tool.

Precursor is a regex (PCRE2) and locality sensitive hasing (TLSH) tool for labeling and finding similairites between text, hex, or base64 encoded data.

Usage: precursor [OPTIONS] [pattern]

Arguments:
  [pattern]  Specify the PCRE2 pattern to be used, it must contain a single named capture group.

Options:
  -f, --input-folder <input-folder>
          Specify the path to the input folder.
  -z, --input-blob
          Process input as single blob instead of splitting on newlines.
  -p, --pattern-file <pattern-file>
          Specify the path to the file containing PCRE2 patterns, one per line, each must contain a single named capture group.
  -t, --tlsh
          Calculate payload tlsh hash of the input payloads.
  -a, --tlsh-algorithm <tlsh-algorithm>
          Specify the TLSH algorithm to use. The algorithms specify the bucket size in bytes and the checksum length in bits. [default: 48_1] [possible values: 128_1, 128_3, 256_1, 256_3, 48_1]
  -d, --tlsh-diff
          Perform TLSH distance calculations between every line of input provided. This is an expensive O(2^n) operation and can consume significant amounts of memory. You can optimize this by using appropriate PCRE2 pre-filters and chosing a smaller TLSH algorithm.
  -y, --tlsh-sim-only
          Only output JSON for the payloads containing TLSH similarities.
  -x, --tlsh-distance <tlsh-distance>
          Specify the TLSH distance threshold for a match. [default: 100]
  -l, --tlsh-length
          This uses a TLSH algorithm that considered the payload length.
  -m, --input-mode <input-mode>
          Specify the payload mode as base64, string, or hex for stdin. [default: base64] [possible values: base64, string, hex]
  -j, --input-json-key <input-json-key>
          Specify the JQ-like pattern for parsing the input from the JSON input.
  -s, --stats
          Output statistics report.
  -h, --help
          Print help (see more with '--help')

Findings

This blog post comes after very preliminary investigation of this data as we wanted others to have an understanding of how we looked into this data and share the strategy on how we were able to quickly work within the constraints of the data and tooling at our disposal.

Thus far we were able to pivot off of the distinct payloads into raw GreyNoise sensor data and essentially filter for source_ips that sent these payloads to our sensors in Israel but nowhere else. The list here is only inclusive of the IPs that completed a three-way handshake, as the others may have been spoofed.

Preliminary analysis of the payloads indicate that the majority of these IPs were performing traditional scanning and enumeration consistent with the built-in scripting available with NMAP, ZGrab2, or similar tools.

We’re continuing to monitor this as we expand on the protocol coverage and will update any findings in a follow-up post. At this point other than the IP addresses and other indicators found below, we don’t have any reliable signature mechanisms for these actors.

You can find out more about what GreyNoise has seen from these IPs here GreyNoise query.

Additionally, we’ve output the raw data for UDP payloads as a Github Gist.

Here is an example of some of the statistics Precursor generates on each run from a smaller subset of payloads. As we can see within this example, indicators for fourteen different ICS/SCADA protocols were matched. In the full Precursor output, the payload JSON is output to STDOUT and the statistics shown below are output to STDERR, making it easy to pass Precursor output onto other tooling that can ingest JSON.

{
  "---PRECURSOR_STATISTICS---": "",
  "Input": {
    "Count": 500000,
    "Unique": 499949,
    "AvgSize": "151",
    "MinSize": 0,
    "MaxSize": 768,
    "P95Size": 201,
    "TotalSize": "71MB"
  },
  "Match": {
    "Patterns": 93,
    "TotalMatches": 250,
    "Matches": [
      {
        "Name": "pcworx",
        "Matches": 1
      },
      {
        "Name": "moxa",
        "Matches": 1
      },
      {
        "Name": "omrontcp",
        "Matches": 2
      },
      {
        "Name": "fox",
        "Matches": 6
      },
      {
        "Name": "omronudp",
        "Matches": 1
      },
      {
        "Name": "modbus",
        "Matches": 2
      },
      {
        "Name": "cspv4",
        "Matches": 1
      },
      {
        "Name": "dnp3",
        "Matches": 8
      },
      {
        "Name": "codesys",
        "Matches": 2
      },
      {
        "Name": "melsecq",
        "Matches": 2
      },
      {
        "Name": "modicon",
        "Matches": 2
      },
      {
        "Name": "bacnet",
        "Matches": 219
      },
      {
        "Name": "proconos",
        "Matches": 1
      },
      {
        "Name": "enip",
        "Matches": 2
      }
    ],
    "HashesGenerated": 11,
    "AvgSize": "46",
    "MinSize": 7,
    "MaxSize": 768,
    "P95Size": 52,
    "TotalSize": "11KB"
  },
  "Compare": {
    "Similarities": 19,
    "AvgDistance": "113",
    "MinDistance": 0,
    "MaxDistance": 234,
    "P95Distance": 195
  },
  "Environment": {
    "Version": "0.1.0",
    "DurationSeconds": "15.17",
    "ProcessingRate": "4MB/s",
    "InputMode": "base64",
    "HashFunction": "48_1",
    "DistanceThreshold": 100,
    "DiffEnabled": true,
    "OnlyOutputSimilar": false,
    "LengthEnabled": false,
    "InputJSONKey": ".payload_b64"
  }
}

We would like to credit the following Open-Source projects for their contribution to the development of precursor and plan to soon have more formal recognition with an open-source precursor tool.

Future

We plan to release the Precursor tool in a follow-up post once we’ve had some time to add tests and improve error handling. Stay tuned for more updates on this exciting new tool and the insights it has helped us uncover.

At a very high level we have a few areas we’d like to add or improve with Precursor:

Error handling, tests, cross-platform packaging, benchmarking, and performance tuning.
Ability to support arbitrary similarity algorithms which generate a digest and support distance calculations (MRSHv2, SSDEEP, etc).
Generic similarity vector output in order to be ingested by a machine learning process.
Binary input mode for firmware/malware analysis.
Automate some of the protocol indicators from existing libraries into PCRE2 patterns where applicable.
Ability to support a training mode where it can automatically configure the optimial similarity algorithm and distance thresholds.