Consider the following: Each of the following lines is a base64 encoded string.

MzJjNVg0UlI0NWloTnFGYW9RcGkK
eUE3MmUwVFp3d3NqdUlRSEhnaE0K
RmFUM2ZWN0lGSU5ETUVQTFpvbjEK
eUQzaVo0ZENsMEVGSXBCV05DQ0kK

Your task, should you accept it: Identify which one contains the decoded string “FINDMEPLZ” without decoding the stream.

A common task in cybersecurity is decoding a base64 blob and checking if it’s contents contain anything “spicy”. For one-off tasks, this is simple enough. However, when building automated systems to identify base64 encoded content it can quickly incur unecessary computational overhead when following the steps:

Find a base64 encoded block of data in a larger stream
Decode the content
Search the content for a matching string

The following document aims to describe a shortcut that allows for directly searching for a base64 encoded value without needing to parse or decode it’s contents.

Base64 Primer

Base64 encoding takes a stream of bytes (8-bits) and encodes them using 64 different values. Typically encodings use [A-Za-z0-9] for the first 62 characters with the remaining 2 characters varying on implementation.

If a single byte (8-bits) is base64 encoded, each base64 character represents 6-bits of the original input. Using Least Common Mulitple (LCM) we can determine the number of bits needed for these two sliding windows of bits will align to represent all of the data as a multiple of 8-bits.

import base64
import math
from tabulate import tabulate
from IPython.display import Markdown


def lcm(a, b):
    return abs(a*b) // math.gcd(a, b)


x = lcm(8, 6)

print("lcm(8, 6) =", x)
print("Character alignment:", x / 6)

lcm(8, 6) = 24
Character alignment: 4.0

Above we can see that the length of a base64 encoded value should a multiple of 4. This behavior can be observed by padding characters (commonly =) which are used to round out input streams whose length does not match a specific alignment.

table = []
for i in range(1, 5):
    source = "A"*i
    enc = base64.b64encode(b"A"*i).decode("ascii")
    l = len(enc)
    table.append([source, enc, l])

Markdown(tabulate(
    table,
    headers=["Source", "Encoded", "Encoded Length"]
))

24-bit Padding Example
Source	Encoded	Encoded Length
A	QQ==	4
AA	QUE=	4
AAA	QUFB	4
AAAA	QUFBQQ==	8

This also demonstrates that for every 4 base64 characters, a maximum of 3-bytes of input can be wholly represented.

Patterns Emerge

Now that we’ve observed this behavior on a small scale, let’s scale this up a bit. We’ll slide a 6-byte block across a 24-byte input at first to avoid having to think about padding or alignment too much to start with.

blockInputLength = 24
windowLength = 6

patternTable = []
for i in range(6, 25):
    source = bytes("A"*(blockInputLength - i), "ascii")
    source += bytes("Z"*windowLength, "ascii")
    source += bytes("A"*(blockInputLength -
                    abs(i-blockInputLength)-windowLength), "ascii")
    enc = base64.b64encode(source).decode("ascii")
    patternTable.append([source, enc, len(enc)])

patternTable.reverse()

Markdown(tabulate(
    patternTable,
    headers=["Source", "Encoded", "Encoded Length"]
))

Block Sliding Example
Source	Encoded	Encoded Length
ZZZZZZAAAAAAAAAAAAAAAAAA	WlpaWlpaQUFBQUFBQUFBQUFBQUFBQUFB	32
AZZZZZZAAAAAAAAAAAAAAAAA	QVpaWlpaWkFBQUFBQUFBQUFBQUFBQUFB	32
AAZZZZZZAAAAAAAAAAAAAAAA	QUFaWlpaWlpBQUFBQUFBQUFBQUFBQUFB	32
AAAZZZZZZAAAAAAAAAAAAAAA	QUFBWlpaWlpaQUFBQUFBQUFBQUFBQUFB	32
AAAAZZZZZZAAAAAAAAAAAAAA	QUFBQVpaWlpaWkFBQUFBQUFBQUFBQUFB	32
AAAAAZZZZZZAAAAAAAAAAAAA	QUFBQUFaWlpaWlpBQUFBQUFBQUFBQUFB	32
AAAAAAZZZZZZAAAAAAAAAAAA	QUFBQUFBWlpaWlpaQUFBQUFBQUFBQUFB	32
AAAAAAAZZZZZZAAAAAAAAAAA	QUFBQUFBQVpaWlpaWkFBQUFBQUFBQUFB	32
AAAAAAAAZZZZZZAAAAAAAAAA	QUFBQUFBQUFaWlpaWlpBQUFBQUFBQUFB	32
AAAAAAAAAZZZZZZAAAAAAAAA	QUFBQUFBQUFBWlpaWlpaQUFBQUFBQUFB	32
AAAAAAAAAAZZZZZZAAAAAAAA	QUFBQUFBQUFBQVpaWlpaWkFBQUFBQUFB	32
AAAAAAAAAAAZZZZZZAAAAAAA	QUFBQUFBQUFBQUFaWlpaWlpBQUFBQUFB	32
AAAAAAAAAAAAZZZZZZAAAAAA	QUFBQUFBQUFBQUFBWlpaWlpaQUFBQUFB	32
AAAAAAAAAAAAAZZZZZZAAAAA	QUFBQUFBQUFBQUFBQVpaWlpaWkFBQUFB	32
AAAAAAAAAAAAAAZZZZZZAAAA	QUFBQUFBQUFBQUFBQUFaWlpaWlpBQUFB	32
AAAAAAAAAAAAAAAZZZZZZAAA	QUFBQUFBQUFBQUFBQUFBWlpaWlpaQUFB	32
AAAAAAAAAAAAAAAAZZZZZZAA	QUFBQUFBQUFBQUFBQUFBQVpaWlpaWkFB	32
AAAAAAAAAAAAAAAAAZZZZZZA	QUFBQUFBQUFBQUFBQUFBQUFaWlpaWlpB	32
AAAAAAAAAAAAAAAAAAZZZZZZ	QUFBQUFBQUFBQUFBQUFBQUFBWlpaWlpa	32

A keen eye will start to spot the pattern, but let’s employ some code to segment the above encoded values into 4 character blocks.

import re

blockInputLength = 24
windowLength = 6

samples = []
blockTable = []

for i in range(windowLength, 25):
    source = bytes("A"*(blockInputLength - i), "ascii")
    source += bytes("Z"*windowLength, "ascii")
    source += bytes("A"*(blockInputLength -
                    abs(i-blockInputLength)-windowLength), "ascii")
    enc = base64.b64encode(source).decode("ascii")
    samples.append(enc)

samples.reverse()

for sample in samples:
    #Grab blocks of 8 characters (2 chunks)
    blockTable.append(re.findall('....?', sample))

Markdown(tabulate(
    blockTable,
    headers=["0", "1", "2", "3", "4", "5", "6", "7"]
))

Block Sliding Visual Edits
0	1	2	3	4	5	6	7
Wlpa	Wlpa	QUFB	QUFB	QUFB	QUFB	QUFB	QUFB
QVpa	Wlpa	WkFB	QUFB	QUFB	QUFB	QUFB	QUFB
QUFa	Wlpa	WlpB	QUFB	QUFB	QUFB	QUFB	QUFB
QUFB	Wlpa	Wlpa	QUFB	QUFB	QUFB	QUFB	QUFB
QUFB	QVpa	Wlpa	WkFB	QUFB	QUFB	QUFB	QUFB
QUFB	QUFa	Wlpa	WlpB	QUFB	QUFB	QUFB	QUFB
QUFB	QUFB	Wlpa	Wlpa	QUFB	QUFB	QUFB	QUFB
QUFB	QUFB	QVpa	Wlpa	WkFB	QUFB	QUFB	QUFB
QUFB	QUFB	QUFa	Wlpa	WlpB	QUFB	QUFB	QUFB
QUFB	QUFB	QUFB	Wlpa	Wlpa	QUFB	QUFB	QUFB
QUFB	QUFB	QUFB	QVpa	Wlpa	WkFB	QUFB	QUFB
QUFB	QUFB	QUFB	QUFa	Wlpa	WlpB	QUFB	QUFB
QUFB	QUFB	QUFB	QUFB	Wlpa	Wlpa	QUFB	QUFB
QUFB	QUFB	QUFB	QUFB	QVpa	Wlpa	WkFB	QUFB
QUFB	QUFB	QUFB	QUFB	QUFa	Wlpa	WlpB	QUFB
QUFB	QUFB	QUFB	QUFB	QUFB	Wlpa	Wlpa	QUFB
QUFB	QUFB	QUFB	QUFB	QUFB	QVpa	Wlpa	WkFB
QUFB	QUFB	QUFB	QUFB	QUFB	QUFa	Wlpa	WlpB
QUFB	QUFB	QUFB	QUFB	QUFB	QUFB	Wlpa	Wlpa

Do you see the pattern emerging in the “Edits”?

Starting in the top-left and following the diagonal to the bottom-right. An initial conclusion may be that Wlpa is a viable value to check for the prescence of in order to determine if the decoded text would contain ZZZZZZ.

While that may hold true in the specifically crafted example above, let’s see what happens when we shift 8 Z’s across our samples.

import re

blockInputLength = 24
windowLength = 8

samples = []
blockTable = []

for i in range(windowLength, 25):
    source = bytes("A"*(blockInputLength - i), "ascii")
    source += bytes("Z"*windowLength, "ascii")
    source += bytes("A"*(blockInputLength -
                    abs(i-blockInputLength)-windowLength), "ascii")
    enc = base64.b64encode(source).decode("ascii")
    samples.append(enc)

samples.reverse()

for sample in samples:
    #Grab blocks of 8 characters (2 chunks)
    blockTable.append(re.findall('....?', sample))

Markdown(tabulate(
    blockTable,
    headers=["0", "1", "2", "3", "4", "5", "6", "7"]
))

Block Sliding Visual Unaligned
0	1	2	3	4	5	6	7
Wlpa	Wlpa	WlpB	QUFB	QUFB	QUFB	QUFB	QUFB
QVpa	Wlpa	Wlpa	QUFB	QUFB	QUFB	QUFB	QUFB
QUFa	Wlpa	Wlpa	WkFB	QUFB	QUFB	QUFB	QUFB
QUFB	Wlpa	Wlpa	WlpB	QUFB	QUFB	QUFB	QUFB
QUFB	QVpa	Wlpa	Wlpa	QUFB	QUFB	QUFB	QUFB
QUFB	QUFa	Wlpa	Wlpa	WkFB	QUFB	QUFB	QUFB
QUFB	QUFB	Wlpa	Wlpa	WlpB	QUFB	QUFB	QUFB
QUFB	QUFB	QVpa	Wlpa	Wlpa	QUFB	QUFB	QUFB
QUFB	QUFB	QUFa	Wlpa	Wlpa	WkFB	QUFB	QUFB
QUFB	QUFB	QUFB	Wlpa	Wlpa	WlpB	QUFB	QUFB
QUFB	QUFB	QUFB	QVpa	Wlpa	Wlpa	QUFB	QUFB
QUFB	QUFB	QUFB	QUFa	Wlpa	Wlpa	WkFB	QUFB
QUFB	QUFB	QUFB	QUFB	Wlpa	Wlpa	WlpB	QUFB
QUFB	QUFB	QUFB	QUFB	QVpa	Wlpa	Wlpa	QUFB
QUFB	QUFB	QUFB	QUFB	QUFa	Wlpa	Wlpa	WkFB
QUFB	QUFB	QUFB	QUFB	QUFB	Wlpa	Wlpa	WlpB
QUFB	QUFB	QUFB	QUFB	QUFB	QVpa	Wlpa	Wlpa