Assembly (RISC-V)
Assembly programs are encoded as plain text files and contain four main elements:
- Comments: comments are textual notes that are often used to document information on the code.
- Labels: labels are "markers" that represent program locations.
- Instructions: Assembly instructions are instructions that are converted by the assembler into machine instructions.
- Directives: Assembly directives are commands used to coordinate the assembling process.
Assembly Language
Labels
Labels are "markers" that represent program locations. They can be inserted into an assembly program to "mark" a program position so that it can be referred to by assembly instructions.
Assemblers usually accept two kinds of labels: symbolic and numeric labels.
Symbolic labels are defined by an identifier followed by :
.
They are stored as symbols in the symbol table and are often used to identify global variables and routines.
Numeric labels are defined by a single decimal digit followed by :
.
They are used for local reference and are not included in the symbol table of executable files. They can be redefined repeatedly in the same assembly program.
References to numeric labels contain a suffix that indicates whether the reference is to a numeric label positioned before (b
suffix) or after (f
suffix) the reference.
Symbols
Program symbols are "names" that are associated with numerical values and the symbol table is a data structure that maps each program symbol to its value.
Labels are automatically converted into program symbols by the assembler and associated with a numerical value that represents its position in the program, which is a memory address.
It's possible to explicitly define symbols with the .set
(or .equ
) directive.
References & Relocations
Each reference to a label must be replaced by an address during the assembling and linking processes. Relocation is the process in which the code and data are assigned new memory addresses so that they do not conflict with addresses of coming from the other linked sources.
The relocation table is a data structure that contains information that describes how the program instructions and data need to be modified to reflect the addresses reassignment. Each object file contains a relocation table and the linker uses their information to adjust the code when performing the relocation process.
Global vs Local Symbols
Symbols are classified as local or global symbols.
Local symbols are only visible on the same file, i.e., the linker does not use them to resolve undefined references on other files.
Global symbols, on the other hand, are used by the linker to resolve undefined reference on other files.
By default, the assembler registers labels as local symbols. The .globl
directive instructs the assembler to register a label as a global symbol.
Program Entry Point
Every program has an entry point: the point from which the CPU must start executing the program. The entry point is defined by an address, which is the address of the first instruction that must be executed.
GAS | |
---|---|
Note: the
start
label must be registered as a global symbol for the linker to recognize it as the entry point.
Program Sections
Executable and object files, and assembly programs are usually organized in sections.
A section may contain data or instructions, and the contents of each section are mapped to a set of consecutive main memory addresses.
The following sections are often present on executable files generated for Linux-based systems:
.text
: a section dedicated to store the program instructions..data
: a section dedicated to store initialized global variables..bss
: a section dedicated to store uninitialized global variables..rodata
: a section dedicated to store constants.
When linking multiple object files, the linker groups information from sections with the same name and places them together into a single section on the executable file.
To instruct the assembler to add the assembled information into other sections, the programmer (or the compiler) may use the .section <name>
directive.
GAS | |
---|---|
Assembly Instructions
Assembly instructions are instructions that are converted by the assembler into machine instructions.
They are usually encoded as a string that contains a mnemonic and a sequence of parameters, known as operands.
A pseudo-instruction is an assembly instruction that does not have a corresponding machine instruction on the ISA, but can be translated automatically by the assembler into one or more alternative machine instructions to achieve the same effect.
The operands of assembly instructions may contain:
- A register name: a register name identifies one of the ISA registers.
- An immediate value: an immediate value is a numerical constant that is directly encoded into the machine instruction as a sequence of bits.
- A symbol name: symbol names identify symbols on the symbol table and are replaced by their respective values during the assembling and linking processes. Their value are encoded into the machine instruction as a sequence of bits.
Immediate Values
Immediate values are represented on assembly language by a sequence of alphanumeric characters.
- Sequences started with the
0x
and the0b
prefixes are interpreted as hexadecimal and binary numbers, respectively. - Octal numbers are represented by a sequence of numeric digits starting with digit
0
. - Sequences of numeric digits starting with digits
1
to9
are interpreted as decimal numbers. - Alphanumeric characters represented between single quotation marks are converted to numeric values using the ASCII table
- To denote a negative integer, it suffices to add the
-
prefix.
GAS | |
---|---|
The .<value>
Directives
The .byte
, .half
, .word
, and .dword
directives add one or more values to the active section. Their arguments may be expressed as immediate
values, symbols (which are replaced by their value during the assembling and linking processes) or by arithmetic expressions that combine both.
The .string
, .asciz
, and .ascii
directives add strings to the active section. The string is encoded as a sequence of bytes.
Directive | Arguments | Description |
---|---|---|
.byte |
expr [, expr]* |
Emit one or more 8bit comma separated words |
.half |
expr [, expr]* |
Emit one or more 16bit comma separated words |
.word |
expr [, expr]* |
Emit one or more 32bit comma separated words |
.dword |
expr [, expr]* |
Emit one or more 64bit comma separated words |
.string |
string |
Emit NULL terminated string |
.asciz |
string |
Alias for .string |
.ascii |
string |
Emit string without NULL character |
The .set
and .equ
directives
The .set name, expression
directive adds a symbol to the symbol table.
It takes a name and an expression as arguments, evaluates an expression to a value and store the name and the resulting value into the symbol table.
The .equ
directive performs the same task as the .set directive.
The .globl
directive
The .globl
directive can be used to turn local symbols into global ones.
GAS | |
---|---|
The .skip
directive
The .bss
section is dedicated for storing uninitialized global variables.
These variables need to be allocated on memory, but they do not need to be initialized by the loader when a program is executed. As a consequence, their initial value do not need to be stored on executable nor object files.
To allocate variables on the .bss
section it suffices to declare a label to identify the variable and advance the .bss
location counter by the amount of bytes the variable require, so further variables are allocated on other address.
The .skip N
directive is a directive that advances the location counter by N
units and can be used to allocate space for variables on the .bss
section.
The .align
directive
Some ISA's require instructions or multi-byte data to be stored on addresses that are multiple of a given number.
The proper way of ensuring the location counter is aligned is by using the .align N
directive.
The .align N
directive checks if the location counter is a multiple of 2^N
, if it is, it has no effect on the program, otherwise, it advances the location counter to the next value that is a multiple of 2^N
.