Compilation Process
The compilation is a method whereby the source code is converted into object code. It is achieved with compiler assistance. The compiler tests the source code for syntactic or structural errors and produces the object code if the source code is error-free.
Last update: 2022-06-04
Table of Content
Following are the steps that a program goes through until it is translated into an executable form:
- Preprocessing
- Compilation
- Assembly
- Linking
Source code used in this guide:
// declaration
int min(int a, int b);
// implementation
int min(int a, int b) {
return (a < b) ? a : b;
}
// to look for min() function
#include "mylib.h"
// macros
#define SPEED_MAX 10 /* comment */
#define SPEED_INC 1 /* will be removed */
#define SPEED_UP(x) min((x) + SPEED_INC, SPEED_MAX)
#include <stdbool.h>
#include "header.h"
#ifndef SPEED_INIT
#define SPEED_INIT 0
#endif
int spd = SPEED_INIT;
void main() {
while(true) {
spd = SPEED_UP(spd);
}
}
Preprocessing#
The following works will be done by the preprocessor:
- Expand included files
- Substitute macros
- Remove disabled code and comments
It works on one C++ source file at a time. It also adds some special markers that tell the compiler where each line came from so that it can use those to produce sensible error messages.
Some errors can be produced at this stage with clever use of the #if
and #error
directives.
Find include files:
gcc -M source.c
/usr/include/stdc-predef.h /usr/lib/gcc/x86_64-linux-gnu/7/include/stdbool.h header.h mylib.h
stdc-predef.h
contains definitions of global environment and primitives types
Output of Preprocessor:
-
Content of
mylib.h
is copied toheader.h
, after that, the new content ofheader.h
is copied intosource.c
. The content ofstdbool.h
is also copied intosource.c
-
Macros are expanded to the final definition. Definition defined in the command line will be generated and added to the source code in this step. If you don’t declare
SPEED_INIT
, the#ifndef
directive will be activated, andSPEED_INIT
is declared. -
All comments are removed
gcc -E source.c -DSPEED_INIT=5
# 1 "source.c"
# 1 "<built-in>"
# 1 "<command-line>"
# 1 "/usr/include/stdc-predef.h" 1 3 4
# 1 "<command-line>" 2
# 1 "source.c"
# 1 "/usr/lib/gcc/x86_64-linux-gnu/7/include/stdbool.h" 1 3 4
# 2 "source.c" 2
# 1 "header.h" 1
# 1 "mylib.h" 1
int min(int a, int b);
# 3 "header.h" 2
# 3 "source.c" 2
int spd = 5;
void main() {
while(
# 7 "source.c" 3 4
1
# 7 "source.c"
) {
spd = min((spd) + 1, 10);
}
}
To see macros used in the source.c
, run with -E -dU
option:
Defined -DSPEED_INIT=5
gcc -E -dU source.c -DSPEED_INIT=5
# 3 "source.c" 2
int spd = 5;
#define SPEED_INIT 5
Undefined SPEED_INIT
gcc -E -dU source.c
# 3 "source.c" 2
#undef SPEED_INIT
int spd = 0;
#define SPEED_INIT 0
Compilation#
The compilation step is performed on each output of the preprocessor. The compiler parses the pure source code (now without any preprocessor directives) and converts it into assembly code.
You will see instructions to declare the object spd
, the function main()
, and a call to the function min()
. Note that there is no implementation of the function min:
.
gcc -S source.c -DSPEED_INIT=5
.file "source.c"
.text
.globl spd # symbol spd
.data
.align 4
.type spd, @object # is an object
.size spd, 4 # size of int = 4
spd: # definiation of spd
.long 5 # with init value is 5
.text
.globl main # symbol main
.type main, @function # is a function
main: # definition of main
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
.L2:
movl spd(%rip), %eax # load spd to register
addl $1, %eax # add 1 to spd in the register
movl $10, %esi # load 10 to register
movl %eax, %edi # load calculated value of (spd+1)
call min@PLT # call to min()
movl %eax, spd(%rip) # save value back to spd
jmp .L2 # loop back to L2 (while(true))
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0"
.section .note.GNU-stack,"",@progbits
Assembly#
The assembler creates an object written in machine code using a formatted structure (ELF, COFF, etc.). This object file contains the compiled code (in binary form) of the symbols defined in the input. Symbols in object files are referred to by name.
Object files can refer to symbols that are not defined. This is the case when you use a declaration, and don’t provide a definition for it.
All symbols and their definitions are listed, but not assigned to any address in the term of memory space. It means object file don’t provide information of where to find a symbol.
The produced object files can be put in special archives called static libraries, for easier reusing later on.
It’s at this stage that “regular” compiler errors, like syntax errors or failed overload resolution errors, are reported.
Compilers usually save all compiled object files after this point. This is very useful because with it you can compile each source code file separately. The advantage this provides is that you don’t need to recompile everything if you only change a single file.
Try this command:
gcc -c source.c mylib.c -DSPEED_INIT=5
You can not read the object using normal text editor anymore, as the file content is in binary. We have to use objdump
tool.
Symbol Table:
objdump -t source.o
You can see spd
object is located at the section data
, the main
function is at the section text
, and the function min
is undefined *UND*
. All of them are not assigned to any address.
source.o: file format elf64-x86-64
SYMBOL TABLE:
0000000000000000 l df *ABS* 0000000000000000 source.c
0000000000000000 l d .text 0000000000000000 .text
0000000000000000 l d .data 0000000000000000 .data
0000000000000000 l d .bss 0000000000000000 .bss
0000000000000000 g O .data 0000000000000004 spd
0000000000000000 g F .text 0000000000000021 main
0000000000000000 *UND* 0000000000000000 _GLOBAL_OFFSET_TABLE_
0000000000000000 *UND* 0000000000000000 min
Disassembly:
objdump -D source.o
source.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <main>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: 8b 05 00 00 00 00 mov 0x0(%rip),%eax # load spd
a: 83 c0 01 add $0x1,%eax # add 1 to spd
d: be 0a 00 00 00 mov $0xa,%esi # load 10
12: 89 c7 mov %eax,%edi # load (spd+1)
14: e8 00 00 00 00 callq 19 <main+0x19> # call to min()
19: 89 05 00 00 00 00 mov %eax,0x0(%rip) # save to spd
1f: eb e3 jmp 4 <main+0x4> # loop back
Disassembly of section .data:
0000000000000000 <spd>:
0: 05 .byte 0x5 # spd = 5
1: 00 00 add %al,(%rax)
Linking#
The linker is what produces the final compilation output from the object files the compiler produced. This output can be either a shared (or dynamic) library (and while the name is similar, they haven’t got much in common with static libraries mentioned earlier) or an executable.
It links all the object files by replacing the references to undefined symbols with the correct addresses. Each of these symbols can be defined in other object files or in libraries. If they are defined in libraries other than the standard library, you need to tell the linker about them.
At this stage the most common errors are missing definitions or duplicate definitions. The former means that either the definitions don’t exist (i.e. they are not written), or that the object files or libraries where they reside were not given to the linker. The latter is obvious: the same symbol was defined in two different object files or libraries.
Run this command:
gcc source.o mylib.o -DSPEED_INIT=5
The output binary file can be inspected using objdump
also.
Symbol Table:
objdump -t a.out
You will see the addresses are assigned to spd
object, main
, min
functions:
a.out: file format elf64-x86-64
SYMBOL TABLE:
...
0000000000000000 l df *ABS* 0000000000000000 source.c
0000000000000000 l df *ABS* 0000000000000000 mylib.c
0000000000000000 l df *ABS* 0000000000000000 crtstuff.c
0000000000201010 g O .data 0000000000000004 spd
000000000000061b g F .text 0000000000000016 min
00000000000005fa g F .text 0000000000000021 main
...
Disassembly:
objdump -D source.o
All functions, object have assigned addresses, therefore, the function call to min()
is also completed by using the min()
function’s address.
00000000000005fa <main>:
5fa: 55 push %rbp
5fb: 48 89 e5 mov %rsp,%rbp
5fe: 8b 05 0c 0a 20 00 mov 0x200a0c(%rip),%eax # load spd
604: 83 c0 01 add $0x1,%eax # add 1 to spd
607: be 0a 00 00 00 mov $0xa,%esi # load 10
60c: 89 c7 mov %eax,%edi # load (spd+1)
60e: e8 08 00 00 00 callq 61b <min> # call min(): ok
613: 89 05 f7 09 20 00 mov %eax,0x2009f7(%rip) # save to spd
619: eb e3 jmp 5fe <main+0x4> # loop back
000000000000061b <min>:
61b: 55 push %rbp
61c: 48 89 e5 mov %rsp,%rbp
61f: 89 7d fc mov %edi,-0x4(%rbp) # pop spd
622: 89 75 f8 mov %esi,-0x8(%rbp) # pod 10
625: 8b 45 fc mov -0x4(%rbp),%eax # load spd
628: 39 45 f8 cmp %eax,-0x8(%rbp) # compare spd vs 10
62b: 0f 4e 45 f8 cmovle -0x8(%rbp),%eax # if less, use 10
62f: 5d pop %rbp
630: c3 retq
631: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
638: 00 00 00
63b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
0000000000201010 <spd>:
201010: 05 .byte 0x5
201011: 00 00 add %al,(%rax)
Exercise#
Above guide show a case of static linking, from mylib.o
to source.o
.
How about the dynamic linking case?
Consider to use below simple program, how does compiler link the printf()
function?
#include <stdio.h>
int main() {
printf("Hello Workd!\n");
return 0;
}
Further reading#
The output executable file is written in Executable and Linkable Format (ELF) format which is a common standard file format for executable files, object code, shared libraries, and core dumps.
By design, the ELF format is flexible, extensible, and cross-platform. For instance, it supports different endiannesses and address sizes, so it does not exclude any particular central processing unit (CPU) or instruction set architecture. This has allowed it to be adopted by many operating systems on different hardware platforms.
Use objectdump
to inspect an executable file, and compare to the ELF format.