Tutorials/Assembler Tutorial
Everything that is executed on a computer is executed in machine language. If you develop software in php, this software will be interpretreted by php to run. The interpreter is available in machine language. If you write software in C, the C compiler will translate your source code into machine language, a process known as compiling. Machine language is the godfather of programming languages and assembler is there to translate machine language into mnemonics, where one mnemonic stands for one command in machine language. You see this is very low-level and I like low-level topics. So here I show you how I deal with machine language and assembler. I am using x86 Linux in the examples.
Contents |
Endless loop
A "hello world" program in assembler is already advanced. So as a first lesson we will take a look at a program that does nothing but an endless loop. Here is it:
endless.asm
global _start _start: nop jmp _start
This assembler source code contains two commands, "nop" for "no operation" and "jmp" for "jump". The other two lines is a label (_start:) and meta-information (global _start saying that "start" is where the program starts).
compile it
nasm -f elf64 endless.asm
link it
ld -s -o endless endless.o
call it
./endless
Hello world
We now create a hello world program in C. Then we compile and disassemble it. So we have the C compiler translate it into machine language and then we use a disassembler to translate it into assembler. This is the program:
cat hello.c #include <stdio.h> int main() { int i=0x23; printf("hello world"); }
Now we compile it:
gcc hello.c -o hello
and see that it runs:
./hello hello world
To disassemble it, say
objdump -M intel -d hello
And the result for the main section is:
000000000040053c <main>: 40053c: 55 push rbp 40053d: 48 89 e5 mov rbp,rsp 400540: 48 83 ec 20 sub rsp,0x20 400544: c7 45 fc 23 00 00 00 mov DWORD PTR [rbp-0x4],0x23 40054b: bf 4c 06 40 00 mov edi,0x40064c 400550: b8 00 00 00 00 mov eax,0x0 400555: e8 d6 fe ff ff call 400430 <printf@plt> 40055a: c9 leave 40055b: c3 ret
To understand this you should know that every processor has a set of registers. eax, edi, rbp and rsp are such registers. The "push rbp" command is only one byte, 55 hexadecimal and means that the processor will take its register rbp and store it in memory so it can always be restored using the pop command. The "mov" command stands for "move" and says that one register's value is moved into another register, or a value is moved into a register, or a value is moved into ram. Note that this command ("mov") translates - depending on its exact meaning to quite some different bytes in machine language, in the above example b8, bf, c7 and 48 89. b8 requires 4 bytes as parameters, 48 89 only one. sub stands for "subtract", ret stands for "return". It will end the program and return to the calling program which is the operating system. "call" will do exactly this - call a library function that is in memory, in this case it will call printf. The actual "hello world" string is stored not in the <main> section but in the data section. Note that the "text" section is the "code" section; it is the section that will be executed.
GCC assembler
To learn the syntax of a gcc assembler program, let's write a C program and compile it without assembling it. Here is the C program, hello.c:
#include <stdio.h> int main() { int i=0x23; printf("hello world"); }
Now we compile this without assembling it:
# gcc -o hello.asm -S hello.c
Now we have the program transformed to assembler and take a look at it:
# cat hello.asm .file "hello.c" .section .rodata .LC0: .string "hello world" .text .globl main .type main, @function main: .LFB2: pushq %rbp .LCFI0: movq %rsp, %rbp .LCFI1: subq $32, %rsp .LCFI2: movl $35, -4(%rbp) movl $.LC0, %edi movl $0, %eax call printf [...]
Now we know the syntax of gcc assembler and we can finally write a program that consists of an endless loop:
.text .globl main main: start: nop; jmp start