Base64 Tricks

A case study of searching for strings in base64, without decoding them
blueprint
Author

remy

Published

April 19, 2023

Consider the following: Each of the following lines is a base64 encoded string.

MzJjNVg0UlI0NWloTnFGYW9RcGkK
eUE3MmUwVFp3d3NqdUlRSEhnaE0K
RmFUM2ZWN0lGSU5ETUVQTFpvbjEK
eUQzaVo0ZENsMEVGSXBCV05DQ0kK

Your task, should you accept it: Identify which one contains the decoded string “FINDMEPLZ” without decoding the stream.

A common task in cybersecurity is decoding a base64 blob and checking if it’s contents contain anything “spicy”. For one-off tasks, this is simple enough. However, when building automated systems to identify base64 encoded content it can quickly incur unecessary computational overhead when following the steps:

  1. Find a base64 encoded block of data in a larger stream
  2. Decode the content
  3. Search the content for a matching string

The following document aims to describe a shortcut that allows for directly searching for a base64 encoded value without needing to parse or decode it’s contents.

Base64 Primer

Base64 encoding takes a stream of bytes (8-bits) and encodes them using 64 different values. Typically encodings use [A-Za-z0-9] for the first 62 characters with the remaining 2 characters varying on implementation.

If a single byte (8-bits) is base64 encoded, each base64 character represents 6-bits of the original input. Using Least Common Mulitple (LCM) we can determine the number of bits needed for these two sliding windows of bits will align to represent all of the data as a multiple of 8-bits.

import base64
import math
from tabulate import tabulate
from IPython.display import Markdown


def lcm(a, b):
    return abs(a*b) // math.gcd(a, b)


x = lcm(8, 6)

print("lcm(8, 6) =", x)
print("Character alignment:", x / 6)
lcm(8, 6) = 24
Character alignment: 4.0

Above we can see that the length of a base64 encoded value should a multiple of 4. This behavior can be observed by padding characters (commonly =) which are used to round out input streams whose length does not match a specific alignment.

table = []
for i in range(1, 5):
    source = "A"*i
    enc = base64.b64encode(b"A"*i).decode("ascii")
    l = len(enc)
    table.append([source, enc, l])

Markdown(tabulate(
    table,
    headers=["Source", "Encoded", "Encoded Length"]
))
24-bit Padding Example
Source Encoded Encoded Length
A QQ== 4
AA QUE= 4
AAA QUFB 4
AAAA QUFBQQ== 8

This also demonstrates that for every 4 base64 characters, a maximum of 3-bytes of input can be wholly represented.

Patterns Emerge

Now that we’ve observed this behavior on a small scale, let’s scale this up a bit. We’ll slide a 6-byte block across a 24-byte input at first to avoid having to think about padding or alignment too much to start with.

blockInputLength = 24
windowLength = 6

patternTable = []
for i in range(6, 25):
    source = bytes("A"*(blockInputLength - i), "ascii")
    source += bytes("Z"*windowLength, "ascii")
    source += bytes("A"*(blockInputLength -
                    abs(i-blockInputLength)-windowLength), "ascii")
    enc = base64.b64encode(source).decode("ascii")
    patternTable.append([source, enc, len(enc)])

patternTable.reverse()

Markdown(tabulate(
    patternTable,
    headers=["Source", "Encoded", "Encoded Length"]
))
Block Sliding Example
Source Encoded Encoded Length
ZZZZZZAAAAAAAAAAAAAAAAAA WlpaWlpaQUFBQUFBQUFBQUFBQUFBQUFB 32
AZZZZZZAAAAAAAAAAAAAAAAA QVpaWlpaWkFBQUFBQUFBQUFBQUFBQUFB 32
AAZZZZZZAAAAAAAAAAAAAAAA QUFaWlpaWlpBQUFBQUFBQUFBQUFBQUFB 32
AAAZZZZZZAAAAAAAAAAAAAAA QUFBWlpaWlpaQUFBQUFBQUFBQUFBQUFB 32
AAAAZZZZZZAAAAAAAAAAAAAA QUFBQVpaWlpaWkFBQUFBQUFBQUFBQUFB 32
AAAAAZZZZZZAAAAAAAAAAAAA QUFBQUFaWlpaWlpBQUFBQUFBQUFBQUFB 32
AAAAAAZZZZZZAAAAAAAAAAAA QUFBQUFBWlpaWlpaQUFBQUFBQUFBQUFB 32
AAAAAAAZZZZZZAAAAAAAAAAA QUFBQUFBQVpaWlpaWkFBQUFBQUFBQUFB 32
AAAAAAAAZZZZZZAAAAAAAAAA QUFBQUFBQUFaWlpaWlpBQUFBQUFBQUFB 32
AAAAAAAAAZZZZZZAAAAAAAAA QUFBQUFBQUFBWlpaWlpaQUFBQUFBQUFB 32
AAAAAAAAAAZZZZZZAAAAAAAA QUFBQUFBQUFBQVpaWlpaWkFBQUFBQUFB 32
AAAAAAAAAAAZZZZZZAAAAAAA QUFBQUFBQUFBQUFaWlpaWlpBQUFBQUFB 32
AAAAAAAAAAAAZZZZZZAAAAAA QUFBQUFBQUFBQUFBWlpaWlpaQUFBQUFB 32
AAAAAAAAAAAAAZZZZZZAAAAA QUFBQUFBQUFBQUFBQVpaWlpaWkFBQUFB 32
AAAAAAAAAAAAAAZZZZZZAAAA QUFBQUFBQUFBQUFBQUFaWlpaWlpBQUFB 32
AAAAAAAAAAAAAAAZZZZZZAAA QUFBQUFBQUFBQUFBQUFBWlpaWlpaQUFB 32
AAAAAAAAAAAAAAAAZZZZZZAA QUFBQUFBQUFBQUFBQUFBQVpaWlpaWkFB 32
AAAAAAAAAAAAAAAAAZZZZZZA QUFBQUFBQUFBQUFBQUFBQUFaWlpaWlpB 32
AAAAAAAAAAAAAAAAAAZZZZZZ QUFBQUFBQUFBQUFBQUFBQUFBWlpaWlpa 32

A keen eye will start to spot the pattern, but let’s employ some code to segment the above encoded values into 4 character blocks.

import re

blockInputLength = 24
windowLength = 6

samples = []
blockTable = []

for i in range(windowLength, 25):
    source = bytes("A"*(blockInputLength - i), "ascii")
    source += bytes("Z"*windowLength, "ascii")
    source += bytes("A"*(blockInputLength -
                    abs(i-blockInputLength)-windowLength), "ascii")
    enc = base64.b64encode(source).decode("ascii")
    samples.append(enc)

samples.reverse()

for sample in samples:
    #Grab blocks of 8 characters (2 chunks)
    blockTable.append(re.findall('....?', sample))

Markdown(tabulate(
    blockTable,
    headers=["0", "1", "2", "3", "4", "5", "6", "7"]
))
Block Sliding Visual Edits
0 1 2 3 4 5 6 7
Wlpa Wlpa QUFB QUFB QUFB QUFB QUFB QUFB
QVpa Wlpa WkFB QUFB QUFB QUFB QUFB QUFB
QUFa Wlpa WlpB QUFB QUFB QUFB QUFB QUFB
QUFB Wlpa Wlpa QUFB QUFB QUFB QUFB QUFB
QUFB QVpa Wlpa WkFB QUFB QUFB QUFB QUFB
QUFB QUFa Wlpa WlpB QUFB QUFB QUFB QUFB
QUFB QUFB Wlpa Wlpa QUFB QUFB QUFB QUFB
QUFB QUFB QVpa Wlpa WkFB QUFB QUFB QUFB
QUFB QUFB QUFa Wlpa WlpB QUFB QUFB QUFB
QUFB QUFB QUFB Wlpa Wlpa QUFB QUFB QUFB
QUFB QUFB QUFB QVpa Wlpa WkFB QUFB QUFB
QUFB QUFB QUFB QUFa Wlpa WlpB QUFB QUFB
QUFB QUFB QUFB QUFB Wlpa Wlpa QUFB QUFB
QUFB QUFB QUFB QUFB QVpa Wlpa WkFB QUFB
QUFB QUFB QUFB QUFB QUFa Wlpa WlpB QUFB
QUFB QUFB QUFB QUFB QUFB Wlpa Wlpa QUFB
QUFB QUFB QUFB QUFB QUFB QVpa Wlpa WkFB
QUFB QUFB QUFB QUFB QUFB QUFa Wlpa WlpB
QUFB QUFB QUFB QUFB QUFB QUFB Wlpa Wlpa

Do you see the pattern emerging in the “Edits”?

Starting in the top-left and following the diagonal to the bottom-right. An initial conclusion may be that Wlpa is a viable value to check for the prescence of in order to determine if the decoded text would contain ZZZZZZ.

While that may hold true in the specifically crafted example above, let’s see what happens when we shift 8 Z’s across our samples.

import re

blockInputLength = 24
windowLength = 8

samples = []
blockTable = []

for i in range(windowLength, 25):
    source = bytes("A"*(blockInputLength - i), "ascii")
    source += bytes("Z"*windowLength, "ascii")
    source += bytes("A"*(blockInputLength -
                    abs(i-blockInputLength)-windowLength), "ascii")
    enc = base64.b64encode(source).decode("ascii")
    samples.append(enc)

samples.reverse()

for sample in samples:
    #Grab blocks of 8 characters (2 chunks)
    blockTable.append(re.findall('....?', sample))

Markdown(tabulate(
    blockTable,
    headers=["0", "1", "2", "3", "4", "5", "6", "7"]
))
Block Sliding Visual Unaligned
0 1 2 3 4 5 6 7
Wlpa Wlpa WlpB QUFB QUFB QUFB QUFB QUFB
QVpa Wlpa Wlpa QUFB QUFB QUFB QUFB QUFB
QUFa Wlpa Wlpa WkFB QUFB QUFB QUFB QUFB
QUFB Wlpa Wlpa WlpB QUFB QUFB QUFB QUFB
QUFB QVpa Wlpa Wlpa QUFB QUFB QUFB QUFB
QUFB QUFa Wlpa Wlpa WkFB QUFB QUFB QUFB
QUFB QUFB Wlpa Wlpa WlpB QUFB QUFB QUFB
QUFB QUFB QVpa Wlpa Wlpa QUFB QUFB QUFB
QUFB QUFB QUFa Wlpa Wlpa WkFB QUFB QUFB
QUFB QUFB QUFB Wlpa Wlpa WlpB QUFB QUFB
QUFB QUFB QUFB QVpa Wlpa Wlpa QUFB QUFB
QUFB QUFB QUFB QUFa Wlpa Wlpa WkFB QUFB
QUFB QUFB QUFB QUFB Wlpa Wlpa WlpB QUFB
QUFB QUFB QUFB QUFB QVpa Wlpa Wlpa QUFB
QUFB QUFB QUFB QUFB QUFa Wlpa Wlpa WkFB
QUFB QUFB QUFB QUFB QUFB Wlpa Wlpa WlpB
QUFB QUFB QUFB QUFB QUFB QVpa Wlpa Wlpa