.noise

hello_friend

I’ve decided to come back to writing after some disturbing times. My life has been turned upside down, and not for the first time. But that’s okay, life won’t give out lemons for free..

Enough of that, you came here for a reason. Like myself, you’re interested in the art of writing assembler code. So today, we are diving into this topic. If you haven’t already, check out core architecture of x86 first.

As always, here are some beats to silence the world around us while we dive into it. I hope you like old synth-wave

https://youtu.be/LYOtZvwNCsc?si=x55xY2_JaPKkTP9D

Preparation

First, we need to install an assembler. If you’re running Linux, you can use NASM, the Netwide Assembler. On Windows, you should already have MASM, the Microsoft Macro Assembler, somewhere in your system32. Usually you’d have it installed with Visual Studio. It is also possible to install the (open source) NASM on Windows, but it lacks some Windows extensions and is more minimalistic than MASM.

Since I’m neither a big fan of Microsoft, nor a big fan of AT&T syntax, I will mainly focus at Intel syntax on NASM (Linux), but this should not hold you back, since you can code on either system. Just run

sudo apt update
sudo apt install nasm
nasm -v

to install NASM, assuming you’re using apt package manager.

On Linux, I’m just assuming that you have access to gcc/g++ and gdb. On windows, you can use the Gnu compiler or the Microsoft compiler, just know that they might be producing different bytecode.

Finding the right development tools

For IDEs, Visual Studio Code is a great choice IMHO and probably the last IDE I’ll ever use. If you want something smaller with less overhead, Sublime text is the best text editor out there. On Windows, classic Visual Studio might be something for you, although I think it’s very overloaded and takes a bazillion years to install.

If you want to debug the binaries we’ll produce, there are a couple of choices available. First of all, gdb is a well-documented debugger that’s been used for years. The downside with gdb is that you’ll have to practice a bit until you are fast enough to really work on a productive level (but that goes for every debugger, right?). Gdb is good enough for all our basic needs. However there are some more sophisticated debugging tools out there. NSA’s Ghidra is probably one of the strongest tools for reverse engineering and debugging available, but takes some time to master. Then there is RedAsm, a more lightweight suite with a sleeker UI. There’s also the option to write our own debugger, we will look into this in the future.

RedAsm has a nice program flow view, which can help beginners

Before we get to coding anything, make sure your VSCode is configured correctly.

Hands-on

Let’s look at a simple example program aka hello_world first:

section .data
    hello_friend db 'hello_friend', 0

section .text
    global _start

_start:
    ; Write the string to stdout
    mov eax, 4             ; syscall number for sys_write
    mov ebx, 1             ; file descriptor 1 (stdout)
    mov ecx, hello_friend  ; pointer to the string
    mov edx, 12            ; length of the string
    int 0x80               ; call kernel

    ; Exit the program
    mov eax, 1             ; syscall number for sys_exit
    xor ebx, ebx           ; exit code 0
    int 0x80               ; call kernel

This code is simple, but also has some issues, so we will look at a more simplistic approach in a minute. For now, let’s just look at some fundamental principles here. First of all, code and data are organized in so-called sections. We can find sections .data and .text mentioned in this example. I already wrote something about ELF files and their structure. In a nutshell, our string “hello_friend” is placed in the data section which is not executable while the rest of the code is inside the text section and thus marked “executable” - we can actually run the code.

Everything behind [ ; ] is just dropped by the assembler, since these are just comments. [ _start: ] marks a label, which can be addressed in actual code, f.e. in a loop to jumps back to the head. These aren’t available in the final binary and will be replaced with addresses by the process loader before execution. Labels are also called symbolic addresses.

What other information can we gather from this example? Let’s look at the registers that are used here. Registers are memory fields directly on the CPU, in this case eax, ebx and ecx. The e is short for extended which means we are dealing with 32 bit wide registers here. Modern architectures are using 64 bit wide registers called rax, rbx and so on, but often times we still need to use the 32 bit part of the register, f.e. when working with integers (since these are usually 32 bit wide).

The [ mov ] command is one of the most common instructions we will encounter and is short for move. Note there is only one instruction called [ mov ], but depending on which processor register we move into, the bytecode will differ. Also, this flavor is called Intel syntax, which I think is better to read than it’s counterpart, AT&T syntax. Let’s look at a random line of code:

mov eax, 4

Mov is the operator, followed by 2 operands. In intel syntax, the second operand is usually moved or applied to the first one, in this case meaning we move the value 4 into the CPU register eax.

Also, the code uses interrupt 0×80 calls to invoke sys_write and sys_exit, which is a Linux system call. However, I would like to make our code system-independent, so we need to let the compiler handle which system calls are actually used. This is one of the broader issues while learning assembler programming.

C calls to assembly functions

As we just discovered, the example code is tailored towards Linux operating systems. However, this is not a good approach to learn coding and also is platform-dependent, so what can we do about this? We will use a compiler to organize our code and also refrain from using system calls. To achieve this, let’s create a small C++ program from which we can run our assembly code:

#include <iostream>

extern "C" int addNumbers(int a, int b, int c);

int main()
{
    fprintf(stdout, "Test before call\n");

    //arguments transported via edi, esi, edx
    int a = 10, b = 20, c = 30;
    int sum = addNumbers(a,b,c);
    //returnValue transported via eax

    fprintf(stdout, "a: %d\nb: %d\nc: %d\n", a,b,c);
    fprintf(stdout, "sum: %d", sum);

    getchar();

    return 1337;
}

This program creates 3 local variables and then passes these to our assembler function marked as extern “C” int. Now, before we look at the assembler code, we need to talk about calling conventions. After all, the data needs to be passed to our function to be handled, and it also needs to come back to be written into our “sum” variable.

You have probably already heard of the stack. This is where, in 32 bit days, the arguments to functions where placed. Nowadays, arguments are typically passed via the 6 callee-saved registers rdi, rsi, rdx, rcx, r8 and r9. The other 10 registers on the CPU are caller-saved and typically not changed during a function call. Since we only passed integers in our example code, it is okay to use only the lower 32 bits of these registers (hence edi instead of rdi).

If we have more than 6 arguments or very large or unusual data, the remaining are still pushed onto the stack. Also, floating point arguments are passed via the mmx registers but other than that work just the same.

Let’s look at what is happening whenever we call a function in a 64 bit system:

Stack Frame	Alignment	Comment
arg 7	rsp + 0×10	arguments 7-9 are pushed onto stack
return address	rsp + 0×8	where we will go after work
base pointer	rsp	every stackframe has one base
local variable 1	rsp - 0×8	depending on number and size of variables
local variable 2	rsp - 0×10

First, arguments 1-6 are beeing placed in their respective registers. If there are still more arguments, the remaining ones are first pushed onto the stack, like arg 7 in the above example. Then the return address is placed on the stack, already pointing to the next instruction after the call from where we originated. Following that, we usually push a base pointer onto the stack, a kind of “anchor” to address stack variables in the current stack frame more easily. The register rbp, which stands for base pointer, is always pointing to the bottom of the stack, and the stack pointer at function-entry usually is saved into this register, meaning that before we return from the current function and the frame is destroyed, the old value is restored from the stack, and the old frame is the new bottom of the stack.

If we want to access f.e. one of the local variables, we can use pointer arithmetic:

mov rax, [ebp - 8] ; will move the content of basepointer - 8 into rax

Now, let’s look at the rather simple assembly function we called here:

section .text
global addNumbers

addNumbers:
; init a stack frame pointer    (prologue)

    push    rbp
    mov     rbp, rsp

; calculate sum

    xor     rax,    rax
    add     eax,    edi
    add     eax,    esi
    add     eax,    edx

; restore stack frame       (epilogue)

    pop     rbp
    ret

First, we need to tell our program loader that we want to define a function called “addNumbers”, so we write it into the .text section, into an array containing all our functions. Then we build up a function prologue, pushing the base pointer onto the stack and grabbing the value of our stack pointer to address variables etc. from the stack via rbp.

Then we produce our payload, first we need to clear out our return register. Return values are usually passed via the rax register, so we start by clearing that out by xoring it with itself, resulting in 0×0.

Since we only calculate integers, we can just add the lower 4 byte of our argument registers onto rax. Since we need to keep register lengths, we add all values onto eax. Normaly, it would be wise to test eax after the operations, looking for overflow flags, but we can neglect that in this simple example. If you are interested, change the values passed to a number bigger than 4 byte and watch the program crumble.

Finally, we need to restore our base pointer and return. When calling ret, we will return the value of eax to our caller, resulting in a sum of [ a+b+c ].

Let’s compile and test this:

# first, create an object file from our assembler source
nasm -f elf64 addNumbers.asm -o addNumbers.o 
# next, compile our c++ source and bind it together with the object file
g++ main.cpp addNumbers.o -o simpleCalc
# finally, link and execute the program as a process
./simpleCalc

Looking into the disassembly

The next part assumes that you have a basic understanding of debugging programs. We will use gdb and VSCode for this. If you plan on further exercising this skill, I highly recommend learning to use gdb on the command line, and when you have a solid foundation, switch to a professional tool like GHidra. For developing, the VSCode debugger is more than sufficient.

I don’t want to go deep into the field of reverse engineering here, we will just take a quick look at the binary that g++ produced.

After a call to fwrite in the procedure linkage table (“Test before call”), we can see that the numbers 10, 20 and 30 are placed into variables on the current stackframe of the main function. Main is, after all, just a regular function that has been called already. This means that in contrast to what most coding tutorials tell you, main is not the first function called in a program (it is called _start and followed by a constructor call).

As we already found out, a base pointer is used to address local variables on the stack. Since we are dealing with 3 integers, the compiler uses 32 bit pointers called DWORD PTR to write the number 10 into rpb-0×10. After the numbers 20 and 30 have been placed onto the stack as well, we arrive at the point where we are about to call addNumbers. Since 64 bit systems pass arguments via rdi, rsi and rdx, the compiler now wants to fill these registers but fails miserably, using 5 instructions to copy 3 values before doing the call to addNumbers. Can u already guess how to optimize this code here?

Our function is placed in the binary just as we expected it. The compiler will not even try to optimize assembler code and just decided to align it with a nop so it’s memory address ends in a 0. Further, we can see that using the Mnemonic “add” here 3 times results in 3 x byte 01, but followed with a different byte since we have different operands for the add operation.

Great, right? But how can we optimize what the compiler generates inside main? Unsurprisingly, the answer is to get rid of the unnecessary local variables a,b and c. If we pass the values directly to our function call like this

    //arguments transported via edi, esi, edx
    int a = 10, b = 20, c = 30;
    int sum = addNumbers(10,20,30);
    //returnValue transported via eax

then the compiler will generate this output instead:

It seems like moving values from memory into a register is preferably done via eax, ecx and edx registers by the compiler, but if we insert the values directly into our argument list, the compiler will use only 3 instructions to pass them to our function. This micro optimization is rather pointless, until we patch it into a function that gets called 10 millions of times, f.e. in a video game loop. Suddenly, we saved 2/5 instructions.

The downside of this is that the code gets increasingly harder to read the more it is optimized. People will tell you again and again that magic numbers are evil and that readability is an important code metric. I can’t really argue against this since even Robert C. Martin, father of C++, makes these statements. However, there is always the option to patch the final binary without ever touching the code 乁( ⁰͡ Ĺ̯ ⁰͡ ) ㄏ

I hope you found some help and inspirations in this article, I will definitely write more again. There are many articles and tutorials from an old blog of mine that I can transscribe, and I wrote a lot of code lately, so there are definately topics to be handled here. Anyway, thank you for reading this much text - if you like what you see, consider writing me a short comment.

Matane

x86 coding tutorial

setup, macros and inline assembly

Table of contents