So back in December I read a blog from Akamai explaining the complexities of process injection on Linux. My first instinct reading it was, surely it’s not that hard! gdb can just call functions, why don’t we just do what gdb does??

So I thought, Duty Calls! I’m going to figure out how gdb does it and tell the author - Ori David - why they’re WRONG! There’s no more powerful motivating force for research than being sure somebody else is wrong!

I finally got around to experimenting with it last week, and I’m happy to say: they were actually correct about everything. But I learned a bunch along the way, and expanded on their work a bit, so I thought I’d share my perspective!

In this post, I’ll show you how I developed a tool to load an arbitrary shared library (.so) file into another process’s memory space. I should be very clear that this isn’t a security bypass of any kind - you have to have access to the system and permission to debug the process - it’s just an interesting way to tinker with your own software.

But why?

(Note that this section is about Windows, and the rest of this blog will be about Linux - this is just an origin story.)

One of the first things I ever learned in security (like EVER) was how to forcibly load a .dll file into a program’s memory space on Windows. Why, you ask? To cheat at videogames, of course!

I was in highschool and people I knew were doing these awesome hacks in Starcraft where they could have a custom HUD and flip bits in memory and even customize AI. I was never the type to ask for tools, but I did ask them how to do that sorta thing, and they said they injected their .dll into the process’s memory and hooked functions to go through their code before the real code, and from the process’s memory space you can call their functions to write to the screen (for example).

I thought that was the COOLEST and set out to learn how!

I picked up a book (I don’t currently remember the name, but hopefully by the time I publish this I’ll remember! - as I’m editing, I still can’t figure out the book and it’s driving me crazy - actually, see update at the bottom of this section!) and read it. They explained a bunch of different ways to do it, but the most straightforward was to use a fairly simple three-step process:

Use VirtualAlloc() to allocate read/write memory in a foreign process
Use WriteProcessMemory() to write the full path of your .dll file to that memory
Use CreateRemoteThread() to start a new thread in the foreign process, with a starting point of LoadLibrary, and the first argument pointing to the allocated memory

That would effectively call LoadLibrary(<.dll file>) in the process’s context. When the .dll loads, its DllMain() function is called and it can do anything it wants in the context of the foreign process! This is used by antivirus and other tools.

I doubt it works anymore, but I actually wrote a tool to do this. I also since published my old hacks, which haven’t worked in 20 years, but you can check out this one for some idea of what I was up to in the early 2000s.

This is all Windows, though, and I want to do the same thing on Linux!

Update: On my second edit pass, this was driving me crazy. And speaking of crazy, I have DM logs from the olden days when I used to work on this stuff, and I actually dug into ICQ logs from 2002, where I found some code I’d copied from the book:

Session Start (ICQ - 96228890:Elliott): Sat Apr 13 14:37:35 2002

[14:37:56] Ron: One problem with my program is that I only have access to windows xp computers so there’s no guarantee that it’ll even work on windows 98.. :-/

[14:38:16] Elliott: :(

Session Close (Elliott): Sat Apr 13 14:40:03 2002

Session Start (ICQ - 96228890:Elliott): Sat Apr 13 15:00:39 2002

[15:00:42] Ron:
#include "..\CmnHdr.h"     /* See Appendix A. */ 
#include <WindowsX.h> 
#include <tchar.h> 
#include <stdio.h> 
[...]

Which, thanks to Googling, was enough to find an archive of the book: Programming Applications in Microsoft Windows by Jeffrey Richter! I knew I was keeping those logs for a reason!

Okay but WHY??

Okay, enough driving down memory lane!

The point of process injection is that you can run your own custom code in the context of another process - that means you have access to the process’s virtual memory, file handles, and even binary code. From there, you can do a lot of tinkering with the process’s internal state, including redirecting calls (a la LD_PRELOAD, but more powerful! If you want to know about LD_PRELOAD techniques, I wrote about it here, among other places).

In the context of malware, you can use it to hide code in a running process (think Meterpreter)

In the context of cheating, you can modify how a game works in subtle ways (like I talked about above).

But in the context of reverse engineering, which is what I care about, you can do cool instrumentation stuff to a process to change how it works and perhaps test things. Maybe you can capture / modify decrypted traffic before it’s processed! The sky’s the limit!

Honestly, I don’t know if this is THAT useful, because you can do any of this on Linux with gdb or LD_PRELOAD, but I wanted to do it myself and now you’re going to learn about it!

Before we start

We need a couple things first - a target and a library, specifically. Let’s look at those first!

In case it matters, I’m testing all of this on Fedora 40, but it should work on any version of Linux that can run gcc / gdb / strace (ie, all of them).

Target

I wrote the world’s simplest C program to serve as a target:

#include <stdio.h>
#include <unistd.h>

int main(int argc, char *argv[]) {
  int i;
  for(i = 0; ; i++) {
    printf("%d\n", i);
    sleep(1);
  }
}

It just counts:

$ gcc -o target ./target.c
$ ./target
0
1
2

It’s beautiful! We’re going to be using this throughout.

`mysolib.so`

I need a shared library that does something visible, because we’re going to load it into a foreign process and we want to know that it worked, so I created this:

#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>

static __attribute__((constructor)) void init_method(void) {
  pid_t parent = getpid();
  printf("Parent = %d\n", parent);

  if(fork() == 0) {
    for(;;) {
      if(getpgid(parent) < 0) {
        printf("Goodbye parent!\n");
        exit(0);
      }

      printf("Test!\n");
      system("sleep 1");
    }
  }
}

The idea is that once it starts, it forks, then loops forever printing something until the parent closes. It uses the constructor syntax, which is a Linux equivalent to the DllMain() function in a Windows .dll file.

In a real situation, doing a fork() would probably defeat the purpose, because now you’re in another process, but I just wanted to simplify.

(Also, forgive the system() call - I’ll explain that at the end! :) ).

How does `gdb` do it?

My gut reaction to the original blog was, why do you have to overcomplicate things? Just do what gdb does! Let’s find out what that is.

Playing with `gdb`

With target running, you can find the process id (pid) using pgrep:

$ pgrep target
23552

Then attach a debugger:

$ gdb -p 23552
[...]
Using host libthread_db library "/lib64/libthread_db.so.1".
__GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=req@entry=0x7fff962cd310, rem=rem@entry=0x7fff962cd310)
    at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:71
71        return -r;
(gdb)

Once gdb is attached to a process, you can literally just call a function with call (or print or a few other keywords):

(gdb) call printf("hi\n")
$1 = 3

And observe the output in the target process:

49
50
hi

We kinda just injected code. Cool, isn’t it? We can also, say, allocate memory (just like using VirtualAlloc() on Windows):

(gdb) call malloc(128)
$2 = (void *) 0x22aa36d0

Populate it using strcpy (or other techniques) (just like using WriteProcessMemory() on Windows):

(gdb) call (void)strcpy((char*)0x22aa36d0, "/home/ron/projects/process-injection-experiments/linjector/mysolib.so")
(gdb) x/s 0x22aa36d0
0x22aa36d0:     "/home/ron/projects/process-injection-experiments/linjector/mysolib.so"

Then run dlopen, using that memory as a parameter (just like using LoadLibrary on Windows):

(gdb) call dlopen(0x22aa36d0, 0x102)
[Attaching after Thread 0x7f10279ab740 (LWP 23552) fork to child process 23706]
[New inferior 2 (process 23706)]
[...]

And observe the results (using the .so file above):

65
66
Parent = 23552
Test!
fish: Job 1, './target' terminated by signal SIGSEGV (Address boundary error)

I’m not 100% why it crashes, but it doesn’t really matter. The point is, if gdb can just call those functions in another process, why can’t I? Clearly the blog is wrong! (Spoiler: it’s not)

What `gdb` is doing

Many (smarter) people would probably read the gdb source (which has some hilarious comments) or documentation, but I got bored realllllly quickly. So I did things my own way - experimentation.

How does somebody like me learn what gdb is doing? Debug the debugger, of course! It’s just crazy enough to work!

Start the target again, attach a debugger again, get back to where we were before crashing it:

$ gdb -p $(pgrep target)
[...]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
__GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=req@entry=0x7ffdcb59ec20, rem=rem@entry=0x7ffdcb59ec20)
    at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:71
71        return -r;
(gdb)

You can even type in the command so it’s ready to go:

(gdb) call malloc(0x1337)

Now in ANOTHER window, use strace to debug gdb (isn’t this fun??):

$ strace -p $(pgrep gdb) -o strace.out
strace: Process 24189 attached

Then run the command:

(gdb) call malloc(0x1337)
$1 = (void *) 0x2c2c56b0

And observe the strace output, which is written to strace.out (and is also enormous):

restart_syscall(<... resuming interrupted poll ...>) = 1
[...]
pwrite64(11, "\314", 1, 140728015121279) = 1
ptrace(PTRACE_GETREGS, 23965, {r15=0x403e00, r14=0x7faa850fb000, r13=0, r12=0, rbp=0x7ffdcb59ec10, rbx=0xffffffffffffff88, r11=0x202, r10=0x7ffdcb59ec20, r9=0xfffffffd, r8=0x64, rax=0xfffffffffffffdfc, rcx=0x7faa84f97457, rdx=0x7ffdcb59ec20, rsi=0, rdi=0, orig_rax=0xe6, rip=0x7faa84f97457, cs=0x33, eflags=0x202, rsp=0x7ffdcb59ec08, ss=0x2b, fs_base=0x7faa84eb0740, gs_base=0, ds=0, es=0, fs=0, gs=0}) = 0
ptrace(PTRACE_ETREGS, 23965, {r15=0x403e00, r14=0x7faa850fb000, r13=0, r12=0, rbp=0x7ffdcb59ec10, rbx=0xffffffffffffff88, r11=0x202, r10=0x7ffdcb59ec20, r9=0xfffffffd, r8=0x64, rax=0xfffffffffffffdfc, rcx=0x7faa84f97457, rdx=0x7ffdcb59ec20, rsi=0, rdi=0x1337, orig_rax=0xe6, rip=0x7faa84f97457, cs=0x33, eflags=0x202, rsp=0x7ffdcb59ec08, ss=0x2b, fs_base=0x7faa84eb0740, gs_base=0, ds=0, es=0, fs=0, gs=0}) = 0

[...]

pread64(11, "\220", 1, 140370351439267) = 1
pwrite64(11, "\314", 1, 140370351439267) = 1
pread64(11, "\220", 1, 140370351439399) = 1
pwrite64(11, "\314", 1, 140370351439399) = 1
pread64(11, "\220", 1, 140370352382777) = 1
pwrite64(11, "\314", 1, 140370352382777) = 1
pread64(11, "\220", 1, 140370353357258) = 1
pwrite64(11, "\314", 1, 140370353357258) = 1
pread64(11, "\220", 1, 140370353402231) = 1
pwrite64(11, "\314", 1, 140370353402231) = 1
pread64(11, "\220", 1, 140370353480957) = 1
pwrite64(11, "\314", 1, 140370353480957) = 1
pread64(11, "\314", 1, 140728015121279) = 1
pwrite64(11, "\314", 1, 140728015121279) = 1
ptrace(PTRACE_CONT, 23965, 0x1, 0)      = 0

[...]

rt_sigprocmask(SIG_BLOCK, [INT ALRM TERM CHLD WINCH], [], 8) = 0
pipe2([12, 13], O_CLOEXEC)              = 0
fcntl(12, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
fcntl(13, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
poll([{fd=12, events=POLLIN}], 1, 0)    = 0 (Timeout)
read(12, 0x7ffcfa320b87, 1)             = -1 EAGAIN (Resource temporarily unavailable)
write(13, "+", 1)                       = 1
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_TRAPPED, si_pid=23965, si_uid=1000, si_status=SIGSEGV, si_utime=0, si_stime=2 /* 0.02 s */} ---
read(12, "+", 1)                        = 1
read(12, 0x7ffcfa31fde7, 1)             = -1 EAGAIN (Resource temporarily unavailable)
write(13, "+", 1)                       = 1
rt_sigreturn({mask=[]})                 = 0
read(5, 0x7ffcfa321267, 1)              = -1 EAGAIN (Resource temporarily unavailable)
read(12, "+", 1)                        = 1
read(12, 0x7ffcfa3208b7, 1)             = -1 EAGAIN (Resource temporarily unavailable)

[...]

ptrace(PTRACE_PEEKUSER, 23965, 8*SS+8, [0x7faa84eb0740]) = 0
pread64(11, "@\7\353\204\252\177\0\0\340\20\353\204\252\177\0\0@\7\353\204\252\177\0\0\0\0\0\0\0\0\0\0"..., 2368, 140370351163200) = 2368
ptrace(PTRACE_GETREGS, 23965, {r15=0x403e00, r14=0x7faa850fb000, r13=0, r12=0, rbp=0x7ffdcb59eb68, rbx=0xffffffffffffff88, r11=0x202, r10=0x4, r9=0x1, r8=0, rax=0x2c2c69f0, rcx=0x2c2c69f0, rdx=0, rsi=0x1337, rdi=0x2c2c69f0, orig_rax=0xffffffffffffffff, rip=0x7ffdcb59eb7f, cs=0x33, eflags=0x10206, rsp=0x7ffdcb59eb70, ss=0x2b, fs_base=0x7faa84eb0740, gs_base=0, ds=0, es=0, fs=0, gs=0}) = 0
newfstatat(AT_FDCWD, "/proc/23965/fd/0", {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0x3), ...}, 0) = 0
fstat(0, {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0x1), ...}) = 0

[...]

For the FULL output, you can strace the whole gdb execution - it’s much, much bigger, but you’ll see extra bits like where it reads the memory file:

$ strace -o strace.out gdb -p $(pgrep target)

What I actually learned from all this was:

gdb uses /proc/<pid>/mem to access memory, and pread() / pwrite() to read and edit it (though in the source, they note that there are other techniques that are less efficient for OSes that don’t have that file),
gdb uses ptrace(PTRACE_GETREGS, ...) and ptrace(PTRACE_SETREGS, ...) to change registers, and
gdb seems to use ptrace(PTRACE_CONT, ...) to call the function, which means it’s presumably just moving rip and calling new code from somewhere, but I never figured out how it actually did that.

So basically, gdb is doing exactly what the blog explained. Huh!

So what else can we do?

Okay, so I proved that I was wrong (or at least that the person I thought was wrong was actually right). But that’s not a satisfying ending, so let’s see if we can take this a step or two further!

I’m sure there are tons of way to do this, and in particular I’m sure that whatever gdb is doing is better that what I’m going to do, but let’s look at one technique to load a custom .so file into a Linux process.

I published the final version on Github, but we’ll build a tool step by step until it gets too complex.

Debugging a remote process

Here’s the simplest possible debugger:

#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

#include <sys/ptrace.h>
#include <sys/wait.h>

int main(int argc, char *argv[]) {
  if(argc < 2) {
    fprintf(stderr, "Usage: %s <pid>\n", argv[0]);
    exit(1);
  }

  pid_t pid = atol(argv[1]);

  if(ptrace(PTRACE_ATTACH, pid, 0, 0) < 0) {
    fprintf(stderr, "Error attaching: %d\n", errno);
    exit(1);
  }

  // Wait for the attach to complete
  int status;
  waitpid(pid, &status, WSTOPPED);

  printf("Done!\n");

  ptrace(PTRACE_DETACH, pid, 0, 0);

  return 0;
}

It requires no special compile flags, either:

$ gcc -o test test.c
$

In general, ptrace(PTRACE_ATTACH, ...) and ptrace(PTRACE_SEIZE, ...) are the two ways to attach to an already-running process; from the manpage ptrace(2):

PTRACE_ATTACH

       Attach  to  the process specified in pid, making it a tracee of the calling process.  The
       tracee is sent a SIGSTOP, but will not necessarily have stopped by the completion of this
       call; use waitpid(2) to wait for the tracee to stop.  See the "Attaching  and  detaching"
       subsection for additional information.  (addr and data are ignored.)

       Permission to perform a PTRACE_ATTACH is governed by a ptrace access mode PTRACE_MODE_AT‐
       TACH_REALCREDS check; see below.

PTRACE_SEIZE (since Linux 3.4)
       Attach  to  the process specified in pid, making it a tracee of the calling process.  Un‐
       like PTRACE_ATTACH, PTRACE_SEIZE does not stop the process.  Group-stops are reported  as
       PTRACE_EVENT_STOP  and  WSTOPSIG(status) returns the stop signal.  Automatically attached
       children stop with PTRACE_EVENT_STOP and WSTOPSIG(status) returns SIGTRAP instead of hav‐
       ing SIGSTOP signal delivered to them.  execve(2) does not deliver an extra SIGTRAP.  Only
       a PTRACE_SEIZEd process can accept  PTRACE_INTERRUPT  and  PTRACE_LISTEN  commands.   The
       "seized" behavior just described is inherited by children that are automatically attached
       using  PTRACE_O_TRACEFORK,  PTRACE_O_TRACEVFORK,  and  PTRACE_O_TRACECLONE.  addr must be
       zero.  data contains a bit mask of ptrace options to activate immediately.

       Permission to perform a PTRACE_SEIZE is governed by a ptrace access mode  PTRACE_MODE_AT‐
       TACH_REALCREDS check; see below.

When you attach with PTRACE_ATTACH, the process will receive the signal SIGSTOP (and your process will receive SIGCHLD). You can use waitpid() to wait for that signal. Then you can tinker with the process as much as you want before using PTRACE_DETACH to let the program continue to do its thing. You don’t have to use PTRACE_DETACH, it’ll auto-detach when your process ends, but it’s polite!

You can use strace to make sure your debugger is doing what you think it’s doing:

$ strace ./test $(pgrep target)
execve("./test", ["./test", "27290"], 0x7ffed2e67eb8 /* 62 vars */) = 0
[...]
ptrace(PTRACE_ATTACH, 27290)            = 0
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_TRAPPED, si_pid=27290, si_uid=1000, si_status=SIGSTOP, si_utime=0, si_stime=1 /* 0.01 s */} ---
wait4(27290, [{WIFSTOPPED(s) && WSTOPSIG(s) == SIGSTOP}], WSTOPPED, NULL) = 27290
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0x1), ...}) = 0
getrandom("\xc5\x6b\xc7\xbe\x43\xde\x6e\xdc", 8, GRND_NONBLOCK) = 8
brk(NULL)                               = 0x217ca000
brk(0x217eb000)                         = 0x217eb000
write(1, "Done!\n", 6Done!
)                  = 6
ptrace(PTRACE_DETACH, 27290, NULL, 0)   = 0
exit_group(0)                           = ?
+++ exited with 0 +++

You can’t actually debug the target process while you’re doing this, because you can’t debug the same thing twice.

Reading registers

Let’s read registers from the process. We saw PTRACE_GETREGS earlier, so I read just enough of the documentation to figure out how to use it, then gave it a shot:

#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

#include <sys/ptrace.h>
#include <sys/wait.h>
#include <sys/user.h>

int main(int argc, char *argv[]) {
  if(argc < 2) {
    fprintf(stderr, "Usage: %s <pid>\n", argv[0]);
    exit(1);
  }

  pid_t pid = atol(argv[1]);

  if(ptrace(PTRACE_ATTACH, pid, 0, 0) < 0) {
    fprintf(stderr, "Error attaching: %d\n", errno);
    exit(1);
  }

  // Wait for the attach to complete
  int status;
  waitpid(pid, &status, WSTOPPED);

  // Save registers
  struct user_regs_struct regs;
  ptrace(PTRACE_GETREGS, pid, NULL, &regs);
  printf("rip = %llx\n", regs.rip);

  printf("Done!\n");

  ptrace(PTRACE_DETACH, pid, 0, 0);

  return 0;
}

Then run it:

$ ./test $(pgrep target)
rip = 7f51bfa0b457
Done!

Run it again to make sure it doesn’t change (it WILL change if you rerun the application though):

$ ./test $(pgrep target)
rip = 7f51bfa0b457
Done!

We can use gdb and confirm that IS where it stops (assuming we stop during sleep, which is almost guaranteed):

$ gdb -p (pgrep target)
[...]
(gdb) print/x $rip
$1 = 0x7f51bfa0b457

We can use an identical function call with the same structure and PTRACE_SETREGS to write to registers. Try changing rip to an executable memory address containing 0xcc!

Reading memory

Next, how do we read memory?

We can basically use /proc/<pid>/mem to access the process’s memory, and then pread to read it at an arbitrary offset. Of course, we need to know where to read memory from, so we’re just going to use the program counter - rip - which is the instruction that is about to be executed (using rsp, which is the process stack, would also work great):

[...]
#include <fcntl.h>
[...]
  // Save registers
  struct user_regs_struct regs;
  ptrace(PTRACE_GETREGS, pid, NULL, &regs);
  printf("rip = %llx\n", regs.rip);

  char filename[256];
  snprintf(filename, 255, "/proc/%d/mem", pid);

  int mem = open(filename, O_RDWR|O_CLOEXEC);
  if(!mem) {
    fprintf(stderr, "Process doesn't appear to exist, or can't access its memory!\n");
    exit(1);
  }

  unsigned char buffer[16];
  pread(mem, buffer, 16, regs.rip);
  int i;
  for(i = 0; i < 16; i++) {
    printf("%02x ", buffer[i]);
  }

  printf("Done!\n");
[...]

And run it:

$ make test && ./test $(pgrep target)
cc     test.c   -o test
rip = 7f51bfa0b457
f7 d8 c3 66 0f 1f 44 00 00 55 48 89 e5 48 83 ec Done!

And verify that gdb sees the same bytes in the same location:

$ gdb -p (pgrep target)
[...]
(gdb) x/16xb $rip
0x7f51bfa0b457 <__GI___clock_nanosleep+39>:     0xf7    0xd8    0xc3    0x66    0x0f    0x1f    0x44    0x00
0x7f51bfa0b45f <__GI___clock_nanosleep+47>:     0x00    0x55    0x48    0x89    0xe5    0x48    0x83    0xec

Writing memory

Not only can we use the same technique to write to memory, we can even write to memory that the process considers read-only! Memory protections aren’t enforced on the /dev/<pid>/mem endpoint.

Let’s write a debug breakpoint (0xcc) to rip (the process counter), then use PTRACE_CONT to resume execution:

[...]
  unsigned char wbuffer = 0xcc;
  pwrite(mem, &wbuffer, 1, regs.rip);
  ptrace(PTRACE_CONT, pid, NULL, 0);

  printf("Done!\n");
[...]

Then run it! If it works the way you’d expect - which it does! - the target program will terminate immediately (well, after the sleep ends) with a SIGTRAP:

$ make test && ./test $(pgrep target)
cc     test.c   -o test
rip = 7f51bfa0b457
f7 d8 c3 66 0f 1f 44 00 00 55 48 89 e5 48 83 ec Done!

And verify in target’s window:

866
867
868
fish: Job 1, './target' terminated by signal SIGTRAP (Trace or breakpoint trap)

You also get to see how long it’s taken me to write this much, but thankfully the timer resets here! :)

Taking control

Now we’re building some tools! We’re almost at the point where we have what’s essentially a debugger.

We can technically put the 0xcc anywhere in the code where we want to take control - that’s actually how your favorite debugger does software breakpoints - but putting it right at rip is a simple way to demonstrate.

This time, we’ll do the same thing, but instead of detaching and letting the program crash, we’ll use waitpid to continue debugging it. waitpid will wait until SIGTRAP hits, and catch it. Once we get control back, we validate that it was indeed SIGTRAP, then fix memory, restore registers, and carry on.

This is a good time to show you the full code again, with extra comments, in case you got lost in my samples above.. this is still the quick demo app, though, not linjector:

#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>

#include <sys/ptrace.h>
#include <sys/wait.h>
#include <sys/user.h>

int main(int argc, char *argv[]) {
  if(argc < 2) {
    fprintf(stderr, "Usage: %s <pid>\n", argv[0]);
    exit(1);
  }

  pid_t pid = atol(argv[1]);

  // Attach to the process
  if(ptrace(PTRACE_ATTACH, pid, 0, 0) < 0) {
    fprintf(stderr, "Error attaching: %d\n", errno);
    exit(1);
  }

  // Wait for the attach to complete
  int status;
  waitpid(pid, &status, WSTOPPED);

  // Save registers
  struct user_regs_struct regs;
  ptrace(PTRACE_GETREGS, pid, NULL, &regs);
  printf("rip = %llx\n", regs.rip);

  // Open the memory
  char filename[256];
  snprintf(filename, 255, "/proc/%d/mem", pid);
  int mem = open(filename, O_RDWR|O_CLOEXEC);
  if(!mem) {
    fprintf(stderr, "Process doesn't appear to exist, or can't access its memory!\n");
    exit(1);
  }

  // Read 16 bytes @ rip
  unsigned char buffer[16];
  pread(mem, buffer, 16, regs.rip);
  int i;
  for(i = 0; i < 16; i++) {
    printf("%02x ", buffer[i]);
  }

  // Overwrite the current byte with a breakpoint
  unsigned char wbuffer = 0xcc;
  pwrite(mem, &wbuffer, 1, regs.rip);

  // Continue execution
  ptrace(PTRACE_CONT, pid, NULL, 0);

  // Wait for the process to stop
  waitpid(pid, &status, WSTOPPED);

  // Make sure it's a SIGTRAP and not like SIGSEGV or something
  if(WIFSTOPPED(status) && WSTOPSIG(status) == SIGTRAP) {
    printf("\nSIGTRAP!\n");
  } else {
    printf("\nSome other signal!\n");
    exit(1);
  }

  // Fix the memory using the first byte of what we read
  pwrite(mem, buffer, 1, regs.rip);

  // Put the registers back to what they were (including rip)
  ptrace(PTRACE_SETREGS, pid, NULL, &regs);
  printf("rip = %llx\n", regs.rip);

  // Continue execution
  ptrace(PTRACE_CONT, pid, NULL, 0);

  printf("Done!\n");

  ptrace(PTRACE_DETACH, pid, 0, 0);

  return 0;
}

If you run that (don’t forget to restart target if it’s crashed!), you’ll see no special output:

$ make test && ./test $(pgrep target)
cc     test.c   -o test
rip = 7ffb9afdd457
f7 d8 c3 66 0f 1f 44 00 00 55 48 89 e5 48 83 ec 
SIGTRAP!
rip = 7ffb9afdd458
Done!

But also, the target doesn’t crash:

[...]
3
4
5
[...]

At this point, we’ve basically made a debugger with a software breakpoint! How cool is that?

Also, once again you can start calculating how quickly I write next time you see target’s output!

Run arbitrary code

At this point, we can effectively overwrite with any shellcode we want (defining “shellcode” as a self-contained blob of machine code which usually sets up arguments and calls syscall to do things). You can have msfvenom generate something that spawns a shell for example, but that’s kinda pointless because you can already run a shell on the computer.

Instead, we want to load a .so file, which means we need to call dlload(). That’s a libc function, not a syscall, so we can’t just do standard shellcode - we need to access libc functions.

I’m sure there are other ways to do this, but I opted for:

Find the address of libc.so.6 in your process
Find the address of libc.so.6 in the target process
Find the address of dlsym in your process
Do math
Write a blob of code to rip, with a debug breakpoint at the end
Let the process continue executing until it hits the breakpoint
Fix the memory, reset the registers back to their original states

Once you have access to dlsym() (the equivalent of GetProcAddress()), you can find anything else you want!

In retrospect, I realize we could have just gotten the address of dlload() directly, since we don’t need any other symbols, but being able to call other functions is handy so I’m just leaving it like this!

Finding libc

If you’re already running code on the system, you can parse /proc/<pid>/maps to figure out where things are loaded:

$ cat /proc/$(pgrep target)/maps | grep 'libc\.so'
7f281913f000-7f2819167000 r--p 00000000 00:21 3141132   /usr/lib64/libc.so.6
7f2819167000-7f28192d5000 r-xp 00028000 00:21 3141132   /usr/lib64/libc.so.6
7f28192d5000-7f2819323000 r--p 00196000 00:21 3141132   /usr/lib64/libc.so.6
7f2819323000-7f2819327000 r--p 001e4000 00:21 3141132   /usr/lib64/libc.so.6
7f2819327000-7f2819329000 rw-p 001e8000 00:21 3141132   /usr/lib64/libc.so.6

I briefly considered writing my own parser, but thankfully found that somebody already did it, and they write code kinda-sorta similarly to me so it fit rather nicely. It also has a permissive license, which makes it easy to use in my demo project. Perfect!

I wrote a little function that can find libc.so.6 in any process (use -1 for the current process) based on their example code:

#include "proc_maps_parser/pmparser.h"

unsigned long long find_libc(int pid) {
  procmaps_iterator maps_iter = {0};
  procmaps_error_t parser_err = PROCMAPS_SUCCESS;

  parser_err = pmparser_parse(pid, &maps_iter);
  if (parser_err) {
    fprintf(stderr, "Couldn't parse /proc/%d/maps (error=%d)\n", pid, (int)parser_err);
    exit(1);
  }

  // iterate over areas
  procmaps_struct *mem_region = NULL;
  while ((mem_region = pmparser_next(&maps_iter)) != NULL) {
    if(mem_region->offset == 0 && mem_region->map_type == PROCMAPS_MAP_FILE && !strcmp(mem_region->pathname, "/usr/lib64/libc.so.6")) {
      // Don't return the free'd variable! I know how to C
      void *result = mem_region->addr_start;
      pmparser_free(&maps_iter);

      return (unsigned long long) result;
    }
  }

  pmparser_free(&maps_iter);

  fprintf(stderr, "Couldn't locate the start of /usr/lib64/libc.so.6 in the remote process!\n");
  exit(1);
}

Then I used that to find the address of libc in both my process and the target process (I also got rid of all the write-0xcc-to-the-process code for now):

#include <fcntl.h>
[...]

  // Read 16 bytes @ rip
  unsigned char buffer[16];
  pread(mem, buffer, 16, regs.rip);
  int i;
  for(i = 0; i < 16; i++) {
    printf("%02x ", buffer[i]);
  }
  printf("\n");

  long long local_libc = find_libc(-1);
  long long remote_libc = find_libc(pid);

  printf("local libc = %llx, remote libc = %llx\n", local_libc, remote_libc);

  // Continue execution
  ptrace(PTRACE_CONT, pid, NULL, 0);

  printf("Done!\n");

And now when we compile, we need to link in the pmparser.c file as well:

$ gcc -Wall -o test test.c proc_maps_parser/pmparser.c && ./test $(pgrep target)
rip = 7f640d142457
f7 d8 c3 66 0f 1f 44 00 00 55 48 89 e5 48 83 ec 
local libc = 7f3769a30000, remote libc = 7f640d05e000
Done!

You’ll note that they’re loaded in different places - that’s ASLR (address space layout randomization) at work.

Finding `dlsym`

Now the easy part - we use dlsym to get the address of dlsym in our memory space, subtract our libc address (to get the offset), then add their libc address (to get the absolute address in the remote process):

#include <dlfcn.h>
[...]
unsigned long long find_dlsym(int remote_pid) {
  unsigned long long local_libc = find_libc(-1);
  unsigned long long remote_libc = find_libc(remote_pid);

  printf("local libc = %llx, remote libc = %llx\n", local_libc, remote_libc);

  unsigned long long local_dlsym = (unsigned long long) dlsym(RTLD_DEFAULT, "dlsym");
  unsigned long long dlsym_offset = local_dlsym - local_libc;
  unsigned long long remote_dlsym = dlsym_offset + remote_libc;

  printf("local dlsym = %llx, offset = %llx, remote dlsym = %llx\n", local_dlsym, dlsym_offset, remote_dlsym);

  return remote_dlsym;
}

And then we can compile/run it:

$ gcc -o test test.c proc_maps_parser/pmparser.c && ./test $(pgrep target)
rip = 7f2459188457
local libc = 7f1cdce13000, remote libc = 7f24590a4000
local dlsym = 7f1cdcea6d50, offset = 93d50, remote dlsym = 7f2459137d50

We can verify that address with a debugger (just make sure you don’t restart target, because it’ll change):

$ gdb -p $(pgrep target)
[...]
(gdb) x/i dlsym
   0x7f2459137d50 <___dlsym>:   endbr64

It does indeed match!

Writing a trampoline

I’m quite sure there’s a better way to do this, but I wrote what’s essentially shellcode with a couple templates to populate, and which ends with a debug breakpoint:

bits 64

; Make room on the stack, otherwise we will accidentally overwrite important stuff
; (rsp will be restored afterwards)
sub rsp, 1000

; Save these args to other non-volatile registers
mov rbx, 0x4141414141414141 ; dlsym address

call get_sofile
  db "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB",0 ; .so file
get_sofile:
  pop rbp

call get_dlopen
  db "dlopen",0
get_dlopen:
  pop rsi
  mov rdi, 0

  call rbx ; dlsym("dlopen", RTLD_DEFAULT)

mov rdi, rbp
mov rsi, 0x102 ; RTLD_NOW|RTLD_GLOBAL
call rax ; dlopen(<.so file>, RTLD_NOW|RTLD_GLOBAL)

; Indicate that we've finished
db 0xcc

Then assemble it:

$ nasm -o trampoline.bin trampoline.asm

Putting it all together

Starting here, I’m literally using code from linjector instead of the test program - it got to that complexity level!

I fill in those templates using memmem() to find the offset and them memcpy() to overwrite the address of dlsym (and the string to load) at those offsets:

  FILE *f = fopen("trampoline.bin", "rb");
  if(!f) {
    fprintf(stderr, "Couldn't read trampoline.bin!\n");
    exit(1);
  }

  // Get the size
  fseek(f, 0L, SEEK_END);
  long size = ftell(f);
  rewind(f);

  // Allocate memory + read the file
  uint8_t *trampoline = (uint8_t*) malloc(size);
  fread(trampoline, 1, size, f);
  fclose(f);

  // Find and replace our templates
  *((uint64_t*)(memmem(trampoline, size, "AAAAAAAA", 8))) = find_dlsym(pid);
  memcpy(memmem(trampoline, size, "BBBBBBBBBBBBBBBB", 16), argv[2], 128);

I wrote a quick function to swap memory (read memory and then overwrite it):

void swap_memory(int mem, uint64_t address, uint8_t *new_value, uint8_t *old_value, size_t length) {
  if(old_value) {
    printf("Reading %ld bytes from 0x%lx...\n", length, address);
    if(pread(mem, old_value, length, address) != length) {
      fprintf(stderr, "Error reading %ld bytes from 0x%lx: %d\n", length, address, errno);
      exit(1);
    }
  }

  printf("Writing %ld bytes to 0x%lx...\n", length, address);
  if(pwrite(mem, new_value, length, address) != length) {
    fprintf(stderr, "Error writing %ld bytes to 0x%lx: %d\n", length, address, errno);
    exit(1);
  }
}

So using that, I swap out the memory at rip with my shellcode and then continue execution:

  // Swap the current instruction with a breakpoint
  uint8_t *backup = (uint8_t*) malloc(size);

  // Replace the memory at RIP with the trampoline
  swap_memory(mem, regs.rip, trampoline, backup, size);

  // Continue the process
  ptrace(PTRACE_CONT, pid, NULL, 0);

When the code completes, it’ll swap everything back to the way it was:

  // Wait for the breakpoint
  int status;
  waitpid(pid, &status, WSTOPPED);
  print_signal(status);

  struct user_regs_struct regs2;
  ptrace(PTRACE_GETREGS, pid, NULL, &regs2);
  printf("rip (after) = %llx\n", regs2.rip);

  // Swap memory back
  swap_memory(mem, regs.rip, backup, NULL, size);

  // Restore the registers
  ptrace(PTRACE_SETREGS, pid, NULL, &regs);

  // Detach before exiting
  ptrace(PTRACE_DETACH, pid, 0, 0);

And now it’s loaded!

Here’s what it looks like with the .so file I created (NOTE that if target is running from a different folder, the path to mysolib.so has to be relative to the TARGET’s directory, not the injector’s, because it’s running in the context of target (including the working directory):

$ ./linjector $(pgrep target) "$PWD/mysolib.so "
Stopped (signal) (19)
rip = 7f640d142457
Reading 183 bytes from 0x7f640d142457...
Writing 183 bytes to 0x7f640d142457...
Trace/breakpoint trap (5)
rip (after) = 7f640d14250e
Writing 183 bytes to 0x7f640d142457...
Done!

And in the window for target, where you once again learn how fast I’m typing this:

636
637
638
639
640
Parent = 30734
Test!
641
642
Test!
643
Test!
644
Test!
645
Test!
646
Test!
^CGoodbye parent!

Now we have our own process and the target process running alongside each other. Friends forever!!

Funny note

I mentioned at the start that I’d explain why I’m using system("sleep 1") instead of sleep(1).

Well, because the target process also sleeps, when we do the injection we’re actually overwriting code inside the sleep() function. That means if we call sleep() again before we fix the memory (in the target process), it’ll hit the trampoline again and bad things will ensue. :)

Conclusion

So yeah, using linjector will let you load an arbitrary .so file into an arbitrary process, provided you have permissions.

I wrote this in a couple afternoons as a proof of concept, not as a production tool, so YMMV.

Have fun!