Memory Management

A computer’s main memory is another resource used by several programs, at least in multiprogrammed systems. As such, it has to be managed in a way that allows for efficient and secure use of said resource. The responsibility of the operating system in this area is to allow several programs to execute in the same memory without interfering with each other (or the kernel itself), to give memory to the processes requesting it, as well as deciding where in the memory they’ll reside.

Objects in memory

To perform useful work, a computer has to follow a set of instructions, called program. The way most computer architectures work forces programs to reside in the main memory while being executed. In a multiprogrammed environment, several processes have to be in memory at the same time. This means one of the objects in memory are process code and process data.

Also, the kernel is a program as well; meaning some part of the memory have to be reserved for the kernel to run. These memory addresses are particular in the sense that they require a lot of protection, given that the kernel executes with complete control of the computer; if some attacker was able to write or read from the kernel, the entire security of the system could be compromised.

In modern systems, some pieces of code, such as some libraries, are compiled as shared objects, meaning they can be used by several programs by sharing that part of memory between them, in a secure way. The main advantage of this approach is that the library needs to be compiled and loaded into memory only once; but it can make the deployment of programs more complex as compared to a static binary.

As discussed in other sections, processes can talk to each other using shared memory, other king of object in memory.

All of these objects have to interact in the main memory in a secure or efficient way.

The memory

The memory is presented to the cpu as an array of words (usually word = byte, so I’ll use them interchangeably):

+--------------------------------------------------------------------------+
| operating system | shared memory | shared object | process 1 | process 2 |
|                  |               |               |           |           |
|                  |               |               |           |           |
|                  |               |               |           |           |
+--------------------------------------------------------------------------+
0 1 2 3 4 5 6 7 8 9 a b c d e f ........

and is connected through the memory bus:

cpu1 ----- bus ---- memory
cpu2 ----- |

When the cpu wants the word of address 0xffff, it puts that address into the bus and the memory responds with the value stores in it. Of course, it’s usual that the memory can bring at once several words at a time. This short explanation ignores several aspects of this transaction, such as the cache memory.

Memory organization

With so many elements, there has to be a format in which we store and keep track of the objects in memory. This generates two problems that need to be solved by any scheme for managing memory: reallocation and protection.

The reallocation problem exists because it’s necessary to deal with the situation in which a program will be executed in different memory locations. For example, in

x = y + z / 2;
return x

the compiler will transform the variables x, y and z into memory addresses. But if those addresses are fixed, then, the program can only be executed only in the same place. This is not desirable for portability nor multiprogramming.

The protection problem is easier to understand. It has to be possible to prohibit one process from reading and/or writing another process memory. Specially, the kernel should be able to protect its code from interference of other processes.

Different schemes of memory organization solve this problems in different ways, sometimes requiring hardware features.

Logical and physical addresses

The addresses generated by the computer are usually different than the ones in the physical memory. The former are called logical and the latter physical memory addresses. The difference stems from the memory scheme providing an abstraction of the memory to the programs. So, when a program does *address it’s not really (physically) getting the values in &address; in other words, the cpu sees a number and the bus gets another. The mapping between those types of addresses is one in the Memory Management Unit(mmu), a component in the processor. Some types of memory schemes require support from the mmu.

This mapping is usually transparent to the application programmer, but not necessarily to the kernel programmer, who may have to configure it.

Swapping

Another important concept in memory organization is the concept of swapping. The operating system sometimes needs to take a process from memory, storing it somewhere less scarce, like a hard drive. The difference in speed between the main memory and the secondary storage usually make swapping an expensive operation. In the extreme, a system under a great deal of load can enter in a state of thrashing , in which the kernel is spending more and more time swapping portions of the memory in and out of the secondary storage; this situation can greatly reduce, and even collapse, the performance of the system.

Contiguous memory allocation

The idea is to store the processes in contiguous memory addresses. This is the layout of memory using this scheme:

+------------------------------------------------------------------+
| operating system | process 1 | process 2 | process 3 | process 4 |
|                  |           |           |           |           |
|                  |           |           |           |           |
|                  |           |           |           |           |
+------------------------------------------------------------------+
0 1 2 3 4 5 6 7 8 9 a b c d e f ........

There’s two ways of implementing this format: fixed sized or variable sized partitions. With fixed sized partitions, the amount of processes that can be in memory at the same time are limited by the amount of partitions, and it introduces a lot of internal fragmentation, meaning, it’s likely to waste space inside each partition, reducing the amount of total usable memory. If the partitions are variable, that’s less of a problem, but it requires to keep track of the empty memory addresses that could be used, and it creates external fragmentation, which is the unusable spaces of memory created by the places the different partitions are allocated, for example:

+----------------------------------------------------------------------+
| operating system | process 1 | empty 50 MB | process 3 | empty 50 MB |
|                  |           |             |           |             |
|                  |           |             |           |             |
|                  |           |             |           |             |
+----------------------------------------------------------------------+
0 1 2 3 4 5 6 7 8 9 a b c d e f ........

Let’s say a 50 MB program is launched, even though there’s enough available memory, the kernel can’t create the process because it’s fragmented. Those fragments could have been created by process 2 and 4 finishing.

In this scheme, the protection and relocation problems can be solves by, for example, the use of a relocation register and a limit register. This two registers together create an interval of addresses that are valid for this program, protecting a memory partition from other processes. Also, the relocation problem is solved by compiling memory addresses as an offset, and then, the mmu adds the logical address with the value of the relocation register. Of course to keep this scheme secure, only the operating system can set the registers values, giving a pair of values to each process at scheduling. This approach requires hardware support.

rr = relocation register | lr = limit register

                   rl          rl+lr
                   ^           ^
                   |           |
+------------------------------------------------------------------+
| operating system | process 1 | process 2 | process 3 | process 4 |
|                  |           |           |           |           |
|                  |           |           |           |           |
|                  |           |           |           |           |
+------------------------------------------------------------------+
0 1 2 3 4 5 6 7 8 9 a b c d e f ........

Processes are created and eventually finish, which means, the memory has to keep allocating space for the new processes. When a new request arrives the kernel has to figure out if the memory has enough contiguous space to be given to the new process; if there isn’t, processes are put in a queue until its possible to allocate memory for them, according to some algorithm. This makes it clear that a way to keep track of vacant spaces is needed, as well as an algorithm to decide which piece goes to which program.

The three most common algorithms used to allocate a contiguous memory space for a process are best fit, worst fit and first fit. Best-fit allocates the smallest chunk of memory that can contain the process. Worst-fit gives a process a space inside the biggest empty space. And first-fit gives the first empty space big enough.

Segmentation

The idea behind this method is to separate the different parts of a program into their own segment; one for the stack, one for data, one for code, other for libraries, etc. Each program can have its own group of segments. Each segment is a set of contiguous memory addresses, defined by two registers, the segment base, with the first address of a segment, and the segment limit, the size of the register. These segments need not be contiguous themselves.

Process' memory view                  Physical memory

+-------+                             +------------+
|code   |                             |____________|
|segment|---------------------------->|code segment|
|   +-------+                         |            |
+---|stack  |                         |____________|
    |segment|                         |            |
----|   +-------+                     |            |
|   +---|data   |                     |____________|
|       |segment|                     |data segment|
|       |       |-------------------->|            |
|       +-------+                     |____________|
|                                     |            |
|                                     |            |
|                                     |____________|
|                                     |stack       |
------------------------------------->|segment     |
                                      |____________|
                                      |            |
                                      +------------+

To implement the map between logical addresses (segment number + offset) and physical addresses, a table is used containing of the segment base and the segment limit and indexed by the segment number (a simple array of pairs of numbers). One of this tables is needed for each process.

When a memory address is generated by the cpu, it’s going to have a segment number and offset pair and the mmu will translate it into a specific physical memory address. If the offset is too big, a trap is generated and the operating system alerted (by using the trap). This is how a process can’t meddle with another’s memory, achieving protection.

The relocation problem is solved by the compiler defining the segments needed, and the operating system locating the segments in memory and generating the mapping table when the program is launched. So, the compiler defines segments and uses offsets within them to refer to memory addresses, which means wherever the segments end up in physical memory, will work for the program.

Paging

In paging, the physical memory is divided in equal size partitions, called frames. The logical addresses are also divided in partitions of the same size of the frames, but called pages. When a program is launched, enough pages are allocated for storing it in main memory, without needing those partition to be contiguous.

Physical address space

   frame1      frame2      frame3      frame4      frame 5
+-----------------------------------------------------------+
| process 1 | process 2 | process 1 | process 3 | process 2 |
|           |           |           |           |           |
|           |           |           |           |           |
|           |           |           |           |           |
+-----------------------------------------------------------+

The mmu, as always, is in charge of mapping logical addresses into physical ones; in a paging scheme this means it maps pages into frames. This mapping is done through the use of paging tables, giving one to each process:

Page Table        Logical memory of a process      Physical memory

                     +---------+                     +---------+
 page|frame          |page 0   |-----------     ---->|frame 0  |
------------         |_________|          |     |    |_________|
  0   |  1     ------|page 1   |          ---------->|frame 1  |
  1   |  3     |     |_________|                |    |_________|
  2   |  2     |     |page 2   |-------------------->|frame 2  |
  3   |  0     |     |_________|                |    |_________|
               |     |page 3   |-----------------    |frame 3  |<---
               |     |_________|                     |_________|   |
               |     |         |                     |         |   |
               |     +---------+                     +---------+   |
               |                                                   |
               -----------------------------------------------------

Every hardware architecture implementation of paging is a bit different, but they follow those same principles. For example, IA32 has 2 levels page tables: the page directories and the page tables, the first one maps to the other one which finally maps into physical frames.

From the view of a process (or programmer, its memory address space is contiguous. This can be achieved by the mmu using the first n bits of the logical address to find the index in the page table and the rest of the logical address as the offset within the frame:

Logical address

00  01                  -------> Physical address: wpf * frame + offset
--  --                  |         (wpf = words per frame)
|   |                   |
|   -------> offset: word 1 inside frame 1  <--
|                                             |
-----------> index 0 of page table: frame 1 ---

Since the addresses the program deals with are contiguous, the relocation problem is not a big concern. The protection in this scheme is provided in two ways. First, the hardware architecture can define certain bits in the page table to mean read, write and/or execute; then the mmu at translation time can say if the memory address can be executed, written or read. Second, well used, the paging mechanism can be use to obscure all physical addresses to a processes which doesn’t own them. In this way, processes only see their part of the physical address.

A nice feature of paging is that it makes it clear how to implement shared memory (for IPC or shared libraries for example). Let’s say two programs want to talk to each other by sharing memory, one way to do it is to define a frame to be shared and then map it into both processes paging tables as a read/write page.

Another interesting use of paging is the application of copy on write for process creation. The new process needs to have the same code and data than the process it spawn from at the moment of creation; but, copying every page is a waste of cpu and memory, instead, some operating systems copy the paging tables of the parent process, but set the writable pages (e.g data pages) as read only. When the new process tries to write, a trap will be caused and the operating system will intervene, creating a new writable entry in the new process page table and copying the old data. After that, the process can try to execute the same instruction again and apply the write into the page. In that way, only the necessary copies are performed, and not the entire data. Of course, the non writable code pages can be reused completely.

Of course there is a lot of subtlety involved in correctly using paging, such as, what happens if I/O registers are memory mapped? Unfortunately, every architecture has its own problems and I only implemented paging in IA32, for a toy operating system.

Virtual memory

In the above schemes, the amount of physical memory limited the degree of multiprogrammng the system is capable of. The concept of virtual memory aims to solve this issue by using the fact that a process does not need to have all of its resources in main memory to function. If the system is able to temporarily store those parts in the secondary memory, then more free memory is available. Also, while a process is blocked waiting for an I/O event, such as a disk read, its pages can be swapped until the scheduler give it a cpu.

In Linux, a part of the disk can be defined as a swap area, in which the operating system can store pages that are not needed right now. Another option is to configure a swap file which can perform the same function.

Demand paging

The on demand loading of data into frames of memory is called demand paging. When a process starts, either a few or no pages are loaded into actual memory. When the process starts executing at whatever logical address, a page fault is generated, and the kernel figures out if the error was caused by trying to access a non mapped memory region or because be page in question has not been loaded yet, in which case, the kernel loads it and, when the transfer from the storage hardware is done, the process resumes at the exact same instruction that caused the fault. The hardware gives certain features to make this possible, such as a bit in the page tables to mark if whether the page is valid or not; trying to access an invalid page results in a page fault.

Demand paging allows a system to increase the level of multiprogramming by requiring less pages per program in memory at the same time; but this increases the contention for the memory as more processes are competing to get memory. If a process needs too many frames (for example, it holds and reads a huge data structure in memory) it could block because not enough empty frames are present. If too many programs are running concurrently it could cause the cpu to perform too many transactions between disk and memory, degrading overall performance (even if a dma is used, the processes would block, reducing responsiveness). In order to have good performance under load, a virtual system should have a good page replacement algorithm.

Page replacement

This means, what frame should the system swap into disk? A good intuition is that more used frames should be swapped less often. Also, if the page is read only, it’s frame doesn’t need to be saved into disk, because it will not change.

Since it’s not possible to know exactly how the flow of a program will impact memory use, some heuristics were developed which give good enough solutions. Just like in caches (the memory could be thought of a cache for the disk), the empirically observed principle of locality (link1, link2), in which memory location accessed are likely to be accessed again shortly, and that the surrounding memory addresses are more likely to be accessed than others.

A simple approach is the FIFO (first in, first out) algorithm, in which the page first loaded into memory is the one to replace. The operating system has to keep a queue of pages and when the need arises, pop the first one to be replaced. The main advantage of this idea is the simple implementation and the low overhead; but it doesn’t use with the principle of locality: a recently used page could be the oldest one. To avoid that problem, a small modification creates the Second chance algorithm in which, as the name implies, gives a second chance to the pages who were recently used. A bit in the page table indicates if it was recently referenced, and when the page becomes the oldest one and chosen for replacement, instead of being replaced, the bit is cleared. If the page is chosen, but the bit is 0, it’s replaced.

A more nuanced idea is the LRU (least recently used) algorithm in which the page to be replaced is the one which hasn’t been used for the longest time. This idea is more based in the principle of locality; if a page was just used, it’ll probably be used again shortly, so send it to the disk would be a bad idea. The problem with this algorithm is the implementation, counting the times of each access to the pages involves a bit more complexity and hardware.