You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
@@ -12,7 +12,7 @@ Required tools for playing around with memory:
*`diff`
*`cat`
We're going to go through this: https://sploitfun.wordpress.com/2015/02/10/understanding-glibc-malloc/
We're going to go through this: https://sploitfun.wordpress.com/2015/02/10/understanding-glibc-malloc/ and http://duartes.org/gustavo/blog/post/anatomy-of-a-program-in-memory/
There are actually many C memory allocators. And different memory allocators will layout memory in different ways. Currently glibc's memory allocator is `ptmalloc2`. It was forked from dlmalloc. After fork, threading support was added, and was released in 2006. After being integrated, code changes were made directly to glibc's malloc source code itself. So there are many changes with glibc's malloc that is different from the original `ptmalloc2`.
CMCDragonkai
revised
this gist Apr 20, 2016.
1 changed file
with
1 addition
and
1 deletion.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
@@ -14,7 +14,7 @@ Required tools for playing around with memory:
We're going to go through this: https://sploitfun.wordpress.com/2015/02/10/understanding-glibc-malloc/
There are actually man C memory allocators. And different memory allocators will layout memory in different ways. Currently glibc's memory allocator is `ptmalloc2`. It was forked from dlmalloc. After fork, threading support was added, and was released in 2006. After being integrated, code changes were made directly to glibc's malloc source code itself. So there are many changes with glibc's malloc that is different from the original `ptmalloc2`.
There are actually many C memory allocators. And different memory allocators will layout memory in different ways. Currently glibc's memory allocator is `ptmalloc2`. It was forked from dlmalloc. After fork, threading support was added, and was released in 2006. After being integrated, code changes were made directly to glibc's malloc source code itself. So there are many changes with glibc's malloc that is different from the original `ptmalloc2`.
The malloc in glibc, internally invokes either `brk` or `mmap` syscalls to acquire memory from the OS. The `brk` syscall is generally used to increase the size of the heap, while `mmap` will be used to load shared libraries, create new regions for threads, and many other things. It actually switches to using `mmap` instead of `brk` when the amount of memory requested is larger than the `MMAP_THRESHOLD`. We view which calls are being made by using `strace`.
CMCDragonkai
revised
this gist Apr 20, 2016.
1 changed file
with
12 additions
and
1 deletion.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The first step is to go through this: https://sploitfun.wordpress.com/2015/02/10/understanding-glibc-malloc/
Required tools for playing around with memory:
*`hexdump`
*`objdump`
*`readelf`
*`xxd`
*`gcore`
*`strace`
*`diff`
*`cat`
We're going to go through this: https://sploitfun.wordpress.com/2015/02/10/understanding-glibc-malloc/
There are actually man C memory allocators. And different memory allocators will layout memory in different ways. Currently glibc's memory allocator is `ptmalloc2`. It was forked from dlmalloc. After fork, threading support was added, and was released in 2006. After being integrated, code changes were made directly to glibc's malloc source code itself. So there are many changes with glibc's malloc that is different from the original `ptmalloc2`.
CMCDragonkai
revised
this gist Apr 20, 2016.
1 changed file
with
1 addition
and
3 deletions.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
We now see our first `[heap]` region. It exactly: 135168 B - 132 KiB. Exactly the same as our stack size! Remember we allocated specifically 1000 bytes: `addr = (char *) malloc(1000);` at the beginning. So how did a 1000 Bytes become 132 Kibibytes? Well as we said before, anything smaller than `MMAP_THRESHOLD` uses `brk` syscall. It appears that the `brk`/`sbrk` is called with a padded size in order to reduce the number of syscalls made and the number of context switches. Most programs most likely require more heap than 1000 Bytes, so the system may as well pad the `brk` call to cache some heap memory, and new `brk` or `mmap` calls for heap increase will only occur once you exhaust the padded heap of 132 KiB. The padding calculation is done with:
We now see our first `[heap]` region. It exactly: 135168 B - 132 KiB. (Currently the same as our stack size!) Remember we allocated specifically 1000 bytes: `addr = (char *) malloc(1000);` at the beginning. So how did a 1000 Bytes become 132 Kibibytes? Well as we said before, anything smaller than `MMAP_THRESHOLD` uses `brk` syscall. It appears that the `brk`/`sbrk` is called with a padded size in order to reduce the number of syscalls made and the number of context switches. Most programs most likely require more heap than 1000 Bytes, so the system may as well pad the `brk` call to cache some heap memory, and new `brk` or `mmap` calls for heap increase will only occur once you exhaust the padded heap of 132 KiB. The padding calculation is done with:
```
/* Request enough space for nb + pad + overhead */
CMCDragonkai
revised
this gist Apr 20, 2016.
1 changed file
with
3 additions
and
2 deletions.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
That is roughly a massive step of 127 Tebibytes. Why such a large jump in the address space? Well this is where the malloc implementation comes in. This documentation https://github.com/torvalds/linux/blob/master/Documentation/x86/x86_64/mm.txt shows how the memory is structured in this way:
> ```
> Virtual memory map with 4 level page tables:
>
> 0000000000000000 - 00007fffffffffff (=47 bits) user space, different per mm
> hole caused by [48:63] sign extension
> 0000000000000000 - 00007fffffffffff (=47 bits) user space, different per mm hole caused by [48:63] sign extension
What we can see is that, is that Linux's memory map reserves the first `0000000000000000 - 00007fffffffffff` as user space memory. It turns out that 47 bits is enough to reserve roughly 128 TiB. http://unix.stackexchange.com/a/64490/56970
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The first step is to go through this: https://sploitfun.wordpress.com/2015/02/10/understanding-glibc-malloc/
There are actually man C memory allocators. And different memory allocators will layout memory in different ways. Currently glibc's memory allocator is `ptmalloc2`. It was forked from dlmalloc. After fork, threading support was added, and was released in 2006. After being integrated, code changes were made directly to glibc's malloc source code itself. So there are many changes with glibc's malloc that is different from the original `ptmalloc2`.
The malloc in glibc, internally invokes either `brk` or `mmap` syscalls to acquire memory from the OS. The `brk` syscall is generally used to increase the size of the heap, while `mmap` will be used to load shared libraries, create new regions for threads, and many other things. It actually switches to using `mmap` instead of `brk` when the amount of memory requested is larger than the `MMAP_THRESHOLD`. We view which calls are being made by using `strace`.
In the old days using `dlmalloc`, when 2 threads call malloc at the same time, only one thread can enter the critical section, the freelist data structure of memory chunks is shared among all available threads. Hence memory allocation is a global locking operation.
However in ptmalloc2, when 2 threads call malloc at the same time, memory is allocated immediately, since each thread maintains a separate heap, and their own freelist chunk data structure
The act of maintain separate heap and freelists for each thread is called "per-thread arena".
In the last session, we identified that a program memory layout is generally in:
```
User Stack
|
v
Memory Mapped Region for Shared Libraries or Anything Else
^
|
Heap
Uninitialised Data (.bss)
Initialised Data (.data)
Program Text (.text)
0
```
For the purpose of understanding, most of the tools that investigate memory put the low address at the top and the high address at the bottom.
Therefore, it's easier to think of it like this:
```
0
Program Text (.text)
Initialised Data (.data)
Uninitialised Data (.bss)
Heap
|
v
Memory Mapped Region for Shared Libraries or Anything Else
^
|
User Stack
```
The problem is, we didn't have a proper grasp over what exactly is happening. And the above diagram is too simple to fully understand.
Let's write some C programs and investigate their memory structure.
> Note that neither direct compilation or assembly actually produce an executable. That is done by the linker, which takes the various object code files produced by compilation/assembly, resolves all the names they contain and produces the final executable binary.
> http://stackoverflow.com/a/845365/582917
This is our first program (compile it using `gcc -pthread memory_layout.c -o memory_layout`:
```c
#include<stdio.h>// standard io
#include<stdlib.h>// C standard library
#include<pthread.h>// threading
#include<unistd.h>// unix standard library
#include<sys/types.h>// system types for linux
// getchar basically is like "read"
// it prompts the user for input
// in this case, the input is thrown away
// which makes similar to a "pause" continuation primitive
// but a pause that is resolved through user input, which we promptly throw away!
void * thread_func (void * arg) {
printf("Before malloc in thread 1\n");
getchar();
char * addr = (char *) malloc(1000);
printf("After malloc and before free in thread 1\n");
getchar();
free(addr);
printf("After free in thread 1\n");
getchar();
}
int main () {
char * addr;
printf("Welcome to per thread arena example::%d\n", getpid());
printf("Before malloc in the main thread\n");
getchar();
addr = (char *) malloc(1000);
printf("After malloc and before free in main thread\n");
getchar();
free(addr);
printf("After free in main thread\n");
getchar();
// pointer to the thread 1
pthread_t thread_1;
// pthread_* functions return 0 upon succeeding, and other numbers upon failing
The usage of `getchar` above is to basically pause the computation waiting for user input. This allows us to step through the program, when examining its memory layout.
The usage of `pthread` is for creating POSIX threads, which are real kernel threads being scheduled on the Linux OS. The thing si, the usage of threads is interesting for examining how a process memory layout is utilised for many threads. It turns out that each thread requires its own heap and stack.
The `pthread` functions kind of weird, because they return a 0 based status code upon success. This is the success of a `pthread` operation, which does involve a side effect on the underlying operating system.
As we can see above, there are many uses of the reference error pattern, that is, instead of returning multiple values (through tuples), we use reference containers to store extra metadata or just data itself.
Now, we can run the program `./memory_layout` (try using `Ctrl + Z` to suspend the program):
```
$ ./memory_layout
Welcome to per thread arena example::1255
Before malloc in the main thread
```
At this point, the program is paused, we can now inspect the memory contents by looking at `/proc/1255/maps`. This is a kernel supplied virtual file that shows the exact memory layout of the program. It actually summarises each memory section, so it's useful for understanding how memory is layed out without necessarily being able to view a particular byte address.
```
$ cat /proc/1255/maps # you can also use `watch -d cat /proc/1255/maps` to get updates
Each row in /proc/$PID/maps describes a region of contiguous virtual memory in a process. Each row has the following fields:
* address - starting and ending address in the region's process address space
* perms - describes how pages can be accessed `rwxp` or `rwxs`, where `s` meaning private or shared pages, if a process attempts to access memory not allowed by the permissions, a segmentation fault occurs
* offset - if the region was mapped by a file using `mmap`, this is the offset in the file where the mapping begins
* dev - if the region was mapped from a file, this is major and minor device number in hex where the file lives, the major number points to a device driver, and the minor number is interpreted by the device driver, or the minor number is the specific device for a device driver, like multiple floppy drives
* inode - if the region was mapped from a file, this is the file number
* pathname - if the region was mapped from a file, this is the name of the file, there are special regions with names like [heap], [stack] and [vdso], [vdso] stands for virtual dynamic shared object, its used by system calls to switch to kernel mode
Some regions don't have any file path or special names in the pathname field, these are anonymous regions. Anonymous regions are created by mmap, but are not attached to any file, they are used for miscellaneous things like shared memory, buffers not on the heap, and the pthread library uses anonymous mapped regions as stacks for new threads.
There's no 100% guarantee that contiguous virtual memory means contiguous physical memory. To do this, you would have to use an OS with no virtual memory system. But it's a good chance that contiguous virtual memory does equal contiguous physical memory, at least there's no pointer chasing. Still at the hardware level, there's a special device for virtual to physical memory translation. So it's still very fast.
It is quite important to use the `bc` tool because we need to convert between hex and decimal quite often here. We can use it like: `bc <<< 'obase=10; ibase=16; 4010000 - 4000000'`, which essentially does a `4010000 - 4000000` subtraction using hex digits, and then convers the result to base 10 decimal.
Just a side note about the major minor numbers. You can use `ls -l /dev | grep 252` or `lsblk | grep 252` to look up devices corresponding to a `major:minor` number. Where `0d252 ~ 0xfc`.
This lists all the major and minor number allocations for Linux Device Drivers: http://www.lanana.org/docs/device-list/devices-2.6+.txt
It also shows that anything between 240 - 254 is for local/experimental usecase. Also 232 - 239 is unassigned. And 255 is reserved. We can identify that the device in question right now is a device mapper device. So it is using the range reserved for local/experimental use. Major and minor numbers only go up to 255 because it's the largest decimal natural in a single byte. A single byte is: `0b11111111` or `0xFF`. A single hex digit is a nibble. 2 hex digits is a byte.
The first thing to realise, is that the memory addresses starts from low to high, but every time you run this program, many of the regions will have different addresses. So that means for some regions, addresses aren't statically allocated. This is actually due to a security feature, by randomising the address space for certain regions, it makes it more difficult for attackers to acquire a particular piece of memory they are interested in. However there are regions that are always fixed, because you need them to be fixed so you know how to load the program. We can see that program data and executable memory is always fixed along with `vsyscall`. It is actually possible to create what people call a "PIE" (position independent executable), which actually even makes the program data and executable memory randomised as well, however this is not enabled by default, and it also prevents the program from being compiled statically, forcing it to be linked (https://sourceware.org/ml/binutils/2012-02/msg00249.html). Also "PIE" executables incur some performance problems (different kinds of problems on 32 bit vs 64 bit computers). The randomisation of the address for some of the regions is called "PIC" (position independent code), and has been enabled by default on Linux for quite some time. For more information, see: http://blog.fpmurphy.com/2008/06/position-independent-executables.html and http://eli.thegreenplace.net/2011/08/25/load-time-relocation-of-shared-libraries
It is possible to compile the above program using `gcc -fPIE -pie ./hello.c -o hello`, which produces a "PIE" executable. There's some talk on nixpkgs to make compilation to "PIE" by default for 64 bit binaries, but 32 bit binaries remain unPIEd due to serious performance problems. See: https://github.com/NixOS/nixpkgs/issues/7220
BTW: Wouldn't it be great if we had a tool that examined `/proc/$PID/maps` and gave exact human readable byte sizes instead?
So let's go into detail for each region. Remember, this is still at the beginning of the program, where no `malloc` has occurred, so there is no `[heap]` region.
```
0 - 400000 - 4194304 B - 4096 KiB ~ 4 MiB - NOT ALLOCATED
400000 - 401000 - 4096 B - 4 KiB
600000 - 601000 - 4096 B - 4 KiB
601000 - 602000 - 4096 B - 4 KiB
```
This is our initial range of memory. I added an extra component which starts at `0` address and reaches `40 00 00` address. Addresses appear to be left inclusive, right exclusive. But remember that addresses start at 0. Therefore it is legitimate to use `bc <<< 'obase=10;ibase=16 400000 - 0'` to acquire the actual amount of bytes in that range without adding or subtracing 1. In this case, the first unallocated region is roughly 4 MiB. When I say unallocated, I mean it is not represented in the the `/proc/$PID/maps`. This could mean either of 2 things, either the file doesn't show all the allocated memory, or it doesn't consider such memory to be worth showing, or there is really no memory allocated there.
We can find out whether there really is memory there, by creating a pointer to the memory address somewhere between `0` and `400000`, and try dereferencing it. This can be done by casting an integer into a pointer. I've tried it before, and it results in a segfault, this means there really is no memory allocated between `0-400000`
```c
#include <stdio.h>
int main () {
// 0x0 is hex literal that defaults to signed integer
// here we are casting it to a void pointer
// and then assigning it to a value declared to be a void pointer
// this is the correct way to create an arbitrary pointer in C
void * addr = (void *) 0x0;
// in order to print out what exists at that pointer, we must dereference the pointer
// but C doesn't know how to handle a value of void type
// which means, we recast the void pointer to a char pointer
// a char is some arbitrary byte, so hopefully it's a printable ASCII value
// actually, we don't need to hope, because we have made printf specifically print the hex representation of the char, therefore it does not need to be a printable ascii value
printf("0x%x\n", ((char *) addr)[0]); // prints 0x0
printf("0x%x\n", ((char *) addr)[1]); // prints 0x1
printf("0x%x\n", ((char *) addr)[2]); // prints 0x2
}
```
Running the above gives us a simple `segmentation fault`. Thus proving that `/proc/$PID/maps` is giving us the truth, there really is nothing between `0-400000`.
The question becomes, why is there this roughly 4 MiB gap? Why not start allocating memory from 0? Well this was just an arbitrary choice by the malloc and linker implementors. They just decided that that on 64 bit ELF executables, the entry point of a non-PIE executable should be `0x400000`, whereas for 32 bit ELF executables, the entry point is `0x08048000`. An interesting fact is that if you produce a position independent executable, the starting address instead changes to `0x0`.
> The entry address is set by the link editor, at the time when it creates the executable. The loader maps the program file at the address(es) specified by the ELF headers before transferring control to the entry address.
> The load address is arbitrary, but was standardized back with SYSV for x86. It's different for every architecture. What goes above and below is also arbitrary, and is often taken up by linked in libraries and mmap() regions.
What it basically means is that the program executable is loaded into memory before it can start doing things. The entry point of the executable can be acquired by `readelf`. But here's another question, why is the entry point given by `readelf`, not at `0x400000`. It turns out that, that entry point is consider the actual point where the OS should start executing, whereas the position between `0x400000` and entry point is used for the EHDR and PHDR, meaning ELF headers and Program Headers. We'll look into this in detail later.
```
$ readelf --file-header ./memory_layout | grep 'Entry point address'
Entry point address: 0x400720
```
Next we have the:
```
400000 - 401000 - 4096 B - 4 KiB
600000 - 601000 - 4096 B - 4 KiB
601000 - 602000 - 4096 B - 4 KiB
```
As you can see, we have 3 sections of memory, each of them 4 KiB, and allocated from `/home/vagrant/c_tests/memory_layout`.
What are these sections?
The first segment: "Text Segment".
The second segment: "Data Segment".
The third segment: "BSS Segment".
Text segment stores the binary image of the process. The data segment stores static variables initialised by the programmer, for example `static char * foo = "bar";`. The BSS segment stores uninitialised static variables, which are filled with zeros, for example `static char * username;`.
Our program is so simple right now, that each seemingly fits perfectly into 4 KiB. How can it be so perfect!?
Well, the page size of the Linux OS and many other OSes, is set to 4 KiB by default. This means the minimum addressable memory segment is 4 KiB. See: https://en.wikipedia.org/wiki/Page_%28computer_memory%29
> A page, memory page, or virtual page is a fixed-length contiguous block of virtual memory, described by a single entry in the page table. It is the smallest unit of data for memory management in a virtual memory operating system.
Running `getconf PAGESIZE` shows 4096 bytes.
Therefore, this means each segment is probably far smaller than `4096` bytes, but it gets padded up to 4096 bytes.
As we shown before, it's possible to create an arbitrary pointer, and print out the value stored that byte. We can now do this for the segments shown above.
But hey, we can do better. Rather than just hacking the individual bytes. We can recognise that this data is actually organised as structs.
What kind of structs? We can look at the `readelf` source code to uncover the relevant structs. These structs don't appear to be part of the standard C library, so we cannot just include things to get this working. But the code is simple, so we can just copy and paste. See: http://rpm5.org/docs/api/readelf_8h-source.html
Check this out:
```c
// compile with gcc -std=c99 -o elfheaders ./elfheaders.c
printf("Offset: 0x%lx\n", phdr_addr->p_offset); // 0x40 - byte offset from the beginning of the file at which the first segment is located
printf("Program Virtual Address: %p\n", (void *) phdr_addr->p_vaddr); // 0x400040 - virtual address at which the first segment is located in memory
printf("Program Physical Address: %p\n", (void *) phdr_addr->p_paddr); // 0x400040 - physical address at which the first segment is located in memory (irrelevant on Linux)
printf("Loaded file size: 0x%lx\n", phdr_addr->p_filesz); // 504 - bytes loaded from the file for the PHDR
printf("Loaded mem size: 0x%lx\n", phdr_addr->p_memsz); // 504 - bytes loaded into memory for the PHDR
We've basically just written our own little `readelf` program.
So it's starting to make sense as to what's actually located at the start of `0x400000 - 0x401000`, it's all the ELF executable headers that tell the OS how to use this program, and all the other interesting metadata. Specifically this is about what's located between the program's actual entry point (for `./elfheader`: `0x400490` and for `./memory_layout`: `0x400720`) and the actual start of memory at `0x400000`. There are more program headers to study, but this is enough for now. See: http://www.ouah.org/RevEng/x430.htm
But where does the OS actually get this data from? It has to acquire this data, before it put into memory. Well it turns out the answer is very simple. It's just the file itself.
Let's use `hexdump` to view the actual binary contents of the file, and also later use `objdump` to disassemble it to assembly to make some sense of the machine code.
Obviously the starting memory address, wouldn't be starting file address. So instead of `0x400000`, files should most likely start at `0x0`.
```
$ hexdump -C -s 0x0 ./memory_layout # the -s option is just for offset, it's actually redundant here, but will be useful later
So it turns out that one can say that `0x400000` for a non-PIE 64 bit ELF executable compiled by `gcc` on Linux is the exact the same starting point as the `0x0` for the actual executable file itself.
The file headers are indeed being loaded into memory. But can we tell if the entire file is loaded into memory or not? Let's first check the file size.
```
$ stat memory_layout | grep Size
Size: 8932 Blocks: 24 IO Block: 4096 regular file
```
Shows that the file is 8932 bytes, which is roughly 8.7 KiB.
Our memory layout showed that at most 4 KiB + 4 KiB + 4 KiB was mapped from the `memory_layout` executable file.
There's plenty of space, and certainly enough to fit the entire contents of the file.
But we can prove this by iterating over the entire memory contents, and check the relevant offsets in memory to see if they match the contents on file.
To do this, we need to investigate `/proc/$PID/mem`. However it's not a normal file that you can cat from, but you must do some interesting syscalls to get some output from it. There's no standard unix tool to read from it, instead we would need to write a C program read it. There's an example program here: http://unix.stackexchange.com/a/251769/56970
Fortunately, there's a thing called `gdb`, and we can use `gcore` to dump a process's memory contents onto disk. It require super user privileges, because we are essentially accessing a process's memory, and memory is usually meant to be isolated!
```
$ sudo gcore 1255
Program received signal SIGTTIN, Stopped (tty input).
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f849c407350 in read () from /lib/x86_64-linux-gnu/libc.so.6
Saved corefile core.1255
[1]+ Stopped ./memory_layout
```
This produces a file called `core.1255`. This file is the memory dump, so to view it, we must use `hexedit`.
```
$ hexdump -C ./core.1255 | less
```
Now that we have the entire memory contents. Lets try and compare it with the executable file itself. Before we can do that, we must turn our binary file into printable ASCII. Essentially ASCII armoring the binary programs. The `xxd` is better for this purpose because `hexdump` gives us `|` characters which can give confusing output when we use `diff`.
```
$ xxd ./core.1255 > ./memory.hex
$ xxd ./memory_layout > ./file.hex
```
Immediately we can see that 2 sizes are not the same. The `./memory.hex` ~ 1.1 MiB is far larger than the `./file.hex` ~ 37 KiB. This is because the memory dump also contains all the shared libraries and the anonymously mapped regions. But we are not expecting them to be the same, just whether the entire file itself exists in memory.
Both files can now be compared using `diff`.
```
$ diff --side-by-side ./file.hex ./memory.hex # try piping into less
What this shows us that even though there are some similiarities between the memory contents and the file contents, they are not exactly the same. In fact we see a deviation between the 2 dumps starting from 17 bytes, which is just pass the ELF magic bytes.
This shows that even though there's a mapping from the file to memory, it isn't exactly the same bytes. Either that, or somewhere in the dumping and hex translation, bytes got changed. It's hard to tell at this point in time.
Anyway moving on, we can alo use `objdump` to disassemble the executable file to see the actual assembly instructions that exist at the file. One thing to note, is that `objdump` uses the program's virtual memory address, as if were to be executed. It's not using the actual address at the file. Since we know the memory regions from `/proc/$PID/maps`, we can inspect the first `400000 - 401000` region.
```
$ objdump --disassemble-all --start-address=0x000000 --stop-address=0x401000 ./memory_layout # use less of course
Unlike `gcore` or manually dereferencing arbitrary pointers, we can see that `objdump` cannot or will not show us memory contents from `400000 - 400238`. Instead it starts showing from `400238`. This is because the stuff from `400000 - 400238` are not assembly instructions, they are just metadata, so `objdump` doesn't bother with them as it's designed to dump assembly code. Another thing to understand is that the elipsis ` ...` (not shown in the above example) (not to be confused with my own `...` meaning the output is an excerpt) mean null bytes. The `objdump` shows a byte by byte machine code with its decompiled equivalent assembly instruction. This is a disassembler, so the output assembly is not exactly what a human would write, because there can be optimisation, and lots of semantic information is thrown away. It's important to note that the hex addresses on the right represent the starting byte address, if there are multiple hex byte digits on right, that means they are joined up as a single assembly instruction. So the gap between `400251 - 400254` is represented by the 3 hex bytes at `2e 32 00`.
Let's jump to somewhere interesting, such as the actual "entry point" `0x400720` as reported by `readelf --file-header ./memory_layout`.
```
$ objdump --disassemble-all --start-address=0x000000 --stop-address=0x401000 ./memory_layout | less +/400720
Scrolling a bit up, we see that `objdump` reports this as the actual `.text` section, and at `400720`, this is the entry point of the program. What we have here, is the real first "procedure" that is executed by the CPU, the function behind the `main` function. And I think you get to play with this directly in C when you eschew the runtime library to produce a stand alone C executable. The assembly here is x86 64 bit assembly (https://en.wikipedia.org/wiki/X86_assembly_language), which I guess is meant to run on backwards compatible Intel/AMD 64 bit processors. I don't know anymore about this particular assembly, so we'll have to study it later in http://www.cs.virginia.edu/~evans/cs216/guides/x86.html
What about our other 2 sections (we can see that there's a skip of `401000 - 600000`, this can also just be an arbitrary choice from the linker implementation):
```
600000 - 601000 - 4096 B - 4 KiB
601000 - 602000 - 4096 B - 4 KiB
```
```
$ objdump --disassemble-all --start-address=0x600000 --stop-address=0x602000 ./memory_layout | less
```
There isn't much to talk about right now. It seems that `0x600000` contains more data and assembly. But the actual addresses for `.data` and `.bss` appears to be:
```
Disassembly of section .data:
0000000000601068 <__data_start>:
...
0000000000601070 <__dso_handle>:
...
Disassembly of section .bss:
0000000000601078 <__bss_start>:
...
```
It turns out we don't have anything in `.data` and `.bss`. This is because we do not have any static variables in our `./memory_layout.c` program!
To recap, our initial understanding of memory layout was:
```
0
Program Text (.text)
Initialised Data (.data)
Uninitialised Data (.bss)
Heap
|
v
Memory Mapped Region for Shared Libraries or Anything Else
^
|
User Stack
```
Now we realise that it's actually:
```
0
Nothing here, because it was just an arbitrary choice by the linker
ELF and Program and Section Headers - 0x400000 on 64 bit
Program Text (.text) - Entry Point as Reported by readelf
Nothing Here either
Some unknown assembly and data - 0x600000
Initialised Data (.data) - 0x601068
Uninitialised Data (.bss) - 0x601078
Heap
|
v
Memory Mapped Region for Shared Libraries or Anything Else
^
|
User Stack
```
Ok let's move on. After our executable file memory, we have a huge jump from `601000 - 7f849c31b000`.
That is roughly a massive step of 127 Tebibytes. Why such a large jump in the address space? Well this is where the malloc implementation comes in. This documentation https://github.com/torvalds/linux/blob/master/Documentation/x86/x86_64/mm.txt shows how the memory is structured in this way:
> Virtual memory map with 4 level page tables:
>
> 0000000000000000 - 00007fffffffffff (=47 bits) user space, different per mm
What we can see is that, is that Linux's memory map reserves the first `0000000000000000 - 00007fffffffffff` as user space memory. It turns out that 47 bits is enough to reserve roughly 128 TiB. http://unix.stackexchange.com/a/64490/56970
Well if we take look at the first and last of these sections of memory:
It appears that these regions are almost at the very bottom of the 128 TiB range reserved for user space memory. Considering that there's a 127 TiB gap, this basically means that our malloc uses the user space range `0000000000000000 - 00007fffffffffff` from both ends. From the low end, it grows the heap upwards (upwards in address numbers). While at the high end, it grows the stack downwards (downwards in address numbers).
At the same time, the stack is actually a fixed section of memory, so it cannot actually grow as much as the heap. On the high end, but lower than the stack, we see lots of memory regions assigned for the shared libraries and anonymous buffers which are probably used by the shared libraries.
We can also view the shared libraries that are being used by the executable. This determines which shared libraries will also be loaded into memory at startup. However remember that libraries and code can also be dynamically loaded, and cannot be seen by the linker. By the way `ldd` stands for "list dynamic dependencies".
You'll notice that if you run `ldd` multiple times, each time it prints a different address for the shared libraries. This corresponds to rerunning the program multiple times and checking that `/proc/$PID/maps` also shows different addresses for the shared libraries. This is due to the "PIE" position independent code discussed above. Basically every time you use `ldd` it does call the linker, and the linker performs address randomisation. For more information about the reasoning behind address space randomisation, see: [ASLR](https://en.wikipedia.org/wiki/Address_space_layout_randomization). Also you can check whether the kernel has enabled ASLR by running `cat /proc/sys/kernel/randomize_va_space`.
We can see that there are in fact 4 shared libraries. The `vdso` library is not loaded from the filesystem, but provided by the OS.
Also: `/lib64/ld-linux-x86-64.so.2 => /lib/x86_64-linux-gnu/ld-2.19.so`, it's a symlink.
Our initial stack size is allocated to 132 KiB. I suspect this can be changed with runtime or compiler flags.
So what is `vdso` and `vsyscall`? Both are mechanisms to allow faster syscalls, that is syscalls without context switching between user space and kernel space. The `vsyscall` has now been replaced by `vdso`, but the `vsyscall` is left there for compatibility reasons. The main differences are that:
* `vsyscall` - is always fixed to `ffffffffff600000` and maximum size of 8 MiB, even if PIC or PIE is enabled
* `vdso` - not fixed, but acts like a shared library, thus its address is subject to ASLR (address space layout randomisation)
* `vsyscall` - provides 3 syscalls: `gettimeofday` (`0xffffffffff600000`), `time` (`0xffffffffff600400`), `getcpu` (`0xffffffffff600800`), even though it given the reserved range `ffffffffff600000 - ffffffffffdfffff` of 8 MiB by Linux in 64 bit ELF executables.
* `vdso` - provides 4 syscalls: `__vdso_clock_gettime`, `__vdso_getcpu`, `__vdso_gettimeofday` and `__vdso_time`, however more syscalls can be added to `vdso` in the future.
For more info about `vdso` and `vsyscall`, see: https://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-3.html
It's worth pointing out, that now that we're past the 128 TiB reserved for user space memory, we're now looking at segments of memory provided and managed by the OS. As listed here: https://github.com/torvalds/linux/blob/master/Documentation/x86/x86_64/mm.txt These sections are what we're talking about.
```
ffff800000000000 - ffff87ffffffffff (=43 bits) guard hole, reserved for hypervisor
ffff880000000000 - ffffc7ffffffffff (=64 TB) direct mapping of all phys. memory
Out of the above sections, we currently only see the `vsyscall` region. The rest have not appeared yet.
Let's now move on with the program, and allocate our first heap. Our `/proc/$PID/maps` is now (note that the addresses have changed because I have reran the program):
We now see our first `[heap]` region. It exactly: 135168 B - 132 KiB. Exactly the same as our stack size! Remember we allocated specifically 1000 bytes: `addr = (char *) malloc(1000);` at the beginning. So how did a 1000 Bytes become 132 Kibibytes? Well as we said before, anything smaller than `MMAP_THRESHOLD` uses `brk` syscall. It appears that the `brk`/`sbrk` is called with a padded size in order to reduce the number of syscalls made and the number of context switches. Most programs most likely require more heap than 1000 Bytes, so the system may as well pad the `brk` call to cache some heap memory, and new `brk` or `mmap` calls for heap increase will only occur once you exhaust the padded heap of 132 KiB. The padding calculation is done with:
```
/* Request enough space for nb + pad + overhead */
size = nb + mp_.top_pad + MINSIZE;
```
Where `mp_.top_pad` is by default set to 128 * 1024 = 128 KiB. We still have 4 KiB difference. But remember our page size `getconf PAGESIZE` gives us 4096, meaning each page is 4 KiB. This means when allocating 1000 Bytes in our program, a full page is allocated being 4 KiB. And 4 KiB + 128 KiB is 132 KiB, which is the size of our heap. This padding is not a padding to a fixed size, but padding that is always added to the amount allocated via `brk`/`sbrk`. This means that 128 KiB by default is always added to how ever much memory you're trying allocate. However this padding only applies to `brk`/`sbrk`, and not `mmap`, remember that the past the `MMAP_THRESHOLD`, `mmap` takes over from `brk`/`sbrk`. Which means the padding will no longer be applied. However I'm not sure whether the `MMAP_THRESHOLD` is checked prior to the padding, or after the padding. It seems like it should be prior to the padding.
The padding size can be changed with a call like `mallopt(M_TOP_PAD, 1);`, which changes the `M_TOP_PAD` to be 1 Byte. Mallocing 1000 Bytes now will create only a page of 4 KiB.
For more information, see: http://stackoverflow.com/a/23951267/582917
Why is the older `brk`/`sbrk` getting replaced by newer `mmap`, when an allocation is equal or greater than the `MMAP_THRESHOLD`? Well the `brk`/`sbrk` calls only allow increasing the size of the heap contiguously. If you're just using `malloc` for small things, all of it should be able to be allocated contiguously in the heap, and when it reaches the heap end, the heap can be extended in size without any problems. But for larger allocatiosn, `mmap` is used, and this heap space does need to be contiguously joined with the `brk`/`sbrk` heap space. So it's more flexible. Memory fragmentation is reduced for small objects in this situation. Also the `mmap` call is more flexible, so that `brk`/`sbrk` can be implemented with `mmap`, whereas `mmap` cannot be implemented with `brk`/`sbrk`. One limitation of `brk`/`sbrk`, is that if the top byte in the `brk`/`sbrk` heap space is not freed, the heap size cannot be reduced.
Let's look at a simple program that allocates more than `MMAP_THRESHOLD` (it can also be overridden using `mallopt`):
```c
#include <stdlib.h>
#include <stdio.h>
int main () {
printf("Look at /proc/%d/maps\n", getpid());
// allocate 200 KiB, forcing a mmap instead of brk
There's many `mmap` calls in the above strace. How do we find the `mmap` that our program called, and not the shared libraries or the linker or other stuff? Well the closest call is this one:
It's actually 204800 B + 4096 B = 208 896 B. I'm not sure why an extra 4 KiB is added, because 200 KiB is exactly divisible by the page size of 4 KiB. It maybe another feature. One thing to note is there isn't any kind of obvious way we can pinpoint our program's syscalls over some other syscall, but we can look at the context of the call, that is the previous and subsequent lines to find the exact control flow. Using things like `getchar` can also pause the `strace`. Consider that immeidately after mmapping 204800 bytes, there's a `fstat` and another `mmap` call, before finally `getchar` is called. I have no idea where these calls are coming from, so in the future, we should look for some easy way to label the syscalls so we can find them quicker. The `strace` tells us that this memory mapped region is mapped to `0x7f3b11972000`. Looking at the process's `/proc/$PID/maps`:
As you can see, when malloc switches to using `mmap`, the regions acquired are not part of the so called `[heap]` region, which is only provided by the `brk`/`sbrk` calls. It's unlabelled! We can also see that this kind of heap is not put in the same area as the `brk`/`sbrk` heap, which we understood to start at the low end and grow upwards in address space. Instead this mmapped heap is located in the same area as the shared libraries, putting it at the high end of the reserved user space address range. However this region as shown by `/proc/$PID/maps` is actually 221184 B - 216 KiB. It is exactly 12 KiB from 208896. Another mystery! Why do we have different byte sizes, even though `mmap` in the `strace` called exactly `208896`?
Looking at another `mmap` call also shows that the corresponding region in `/proc/$PID/maps` has a difference of 12 KiB. It is possible that 12 KiB here represents some sort of memory mapping overhead, used by malloc to keep track or understand the type of memory that is available here. Or it could just be extra padding as well. So we can say here that something is consistently adding 12 KiB to whatever we're mmapping, and that there is an extra 4 KiB added to my 200 KiB `malloc`.
By the way, there's also a tool called `binwalk` that's really useful for examining firmware images which may include multiple executable files and metadata. Remember you can in fact embed a file into an executable. Which is kind of like how viruses work. I used it to inspect the initrd of NixOS and figure out how it was structured. Combine it with `dd` you can cut and slice and splice binary blobs easily!
At this point, we can continue investigating the heap from our original program, and the thread heaps as well. But I'm stopping here for now.