Common Dynarec Optimizations

by cqcumbers on 01 Feb 2021

Many emulator developers are interested in the performance advantages of dynamic recompilers, but fewer have implemented one. A good way to start is through one of the beginner tutorials available on the Getting Started page. But while the simple dynarecs described by these tutorials may be satisfactory for some systems, a lot more performance remains on the table, and in between beginner content and the cutting edge there are many widely known optimizations that can be taken for granted by more experienced developers. This article will cover some of these baseline optimizations, with a focus on getting the most performance for the least code complexity.

Introduction

It is important to note before we start that there are some inherent limitations to what this guide can help you with. For example, it will not cover bottlenecks outside of the JIT, of which there are plenty in most full system emulators. As with all optimization, you should profile changes and target the parts that actually limit performance - which in emulators are frequently synchronization, peripherals, or graphics, rather than the JIT itself - before proceeding. As you optimize the bottlenecks will naturally shift, so it probably doesn’t make sense to implement all of these JIT changes at once, without investigating other components in between.

More specific to JITs is the issue that guests may have many differences in architecture that help or hinder certain optimizations, and the kind of programs that people care about can cause many differences in behavior that change what is worth optimizing. For instance, if the guest program you care about is itself a JIT, or frequently modifies its own code, you may want to trade off lookup speed for cache size or invalidation; while if your guest cannot modify its own code at all, you may be able to use simpler techniques that don’t handle all cases mentioned. This means that while I can talk about what sorts of situations these optimizations can accelerate, you will have to use your judgement as to which will provide your particular use case the most bang for the buck. Knowing more about the actual characteristics of the code you are executing - what addresses it jumps to, which instructions it executes most often, etc. will pay off far more than brute force application of a list of techniques. For context, my own experience was mostly gained through an N64 emulator, which has two MIPS III guests on x86_64 desktop hosts. I’ve also discussed JIT design with inolen and Merrymage, creators of redream and dynarmic respectively, who both have far more experience in this subject than I.

We start with the following pseudocode, representing the control flow of an unoptimized JIT translating one basic block at a time. We won’t go into the details of compilation, because for the simplest binary translators (ones that don’t call LLVM) the overwhelming majority of overhead comes not from inefficient code or slow compilation, but from the surrounding basic block loop.

// outer loop in C++
// lookup = unordered_map<uint32_t, Block>
while (time_slice > 0) {
  pc = check_interrupts(pc);
  block = lookup[pc];
  if (!block) {
    block = compile(pc);
    lookup[pc] = block;
  }
  pc = block();
}

// in every emitted block
save_host_regs(), load_guest_regs()
// translate instructions, set next_pc
save_guest_regs(), load_host_regs()
return next_pc

Indirect Branch Lookup

Our first and most important optimization is to move the most common path through this outer loop into assembly. Most of the time, we won’t need to handle an interrupt or scheduler event, so blocks only need to verify that they do not need to exit the JIT, then lookup the address of the translation for the next block and jump to it - if the exit checks do not pass or the next block has not yet been compiled, only then do we have to incur the overhead of saving and restoring registers. Doing a hash table lookup in assembly can be expensive, not to mention difficult to implement, so one way to simplify is to replace it with a flat lookup table of pointers, one for each possible guest target address. With this we may be able to replace a function return, hash table lookup, and function call to the next block, each of which may take a large number of instructions, with a load of the lookup table base address, a read of the appropriate element of the lookup table using the guest address, and an indirect branch to the pointer contained in that element.

Of course, this approach wastes a lot of memory compared to a hash table, but there are several ways to reduce it. In many consoles, instructions must be 4-byte aligned. If the guest has virtual memory, you can save a great deal of space by doing the physical translation first (though a system-level emulator likely already does this, and it probably doesn’t make sense for a user-level one, given the extra complexity). If code must be executed from ROM, only ROM addresses must be included. Even if these are not possible, you can do things on the host side. When generating blocks, you can align the start address and not store the last bits of the pointer. You can also reduce the block cache size, forcing blocks to be within a 32 bit offset from a starting address. Also, keep in mind that unused areas of the lookup will not slow down lookup by taking up cache space unless they reduce the density of useful areas, and in fact may not take up additional physical RAM at all if zero page optimization is applied. Note that these optimizations may affect the assumptions you can make during the compilation process, and how often you need to invalidate. For example, physical lookup means either not using the virtual pc in compiled code, since it could be different at runtime, or invalidating when the virtual address of a block changes, and the latter is only possible if a block can only have one virtual address.

// initially lookup[N_PAGES * N_INSTS] = JIT_EXIT
// C++ calls JIT_ENTER(block) instead of block()

// in every emitted block
// translate instructions, save next_pc
if (time_slice -= CYCLES < 0) goto JIT_EXIT
if (has_interrupt) goto JIT_EXIT
goto lookup[next_pc / INST_WIDTH]

// in assembly far code
JIT_ENTER:
save_host_regs(), load_guest_regs()
goto block
JIT_EXIT:
save_guest_regs(), load_host_regs()
return next_pc

Once we move the loop into assembly, the prologue and epilogue for entering and exiting JIT context can also be removed from the individual blocks. You may have noticed that the loop described above does not mention what we do if the next guest address has not yet been added to the hash or lookup table, and needs to be compiled. This is because explicit checks for this can be avoided by pre-filling the lookup table with the address of a common code fragment that calls the recompiler, so the single indirect branch can be used for both the common case of jumping directly to the next block, and the uncommon case of compiling it. Another source of things that can be moved to “far code” is exceptions, including interrupts and page faults. These often require storing the current pc, reading or modifying status registers, and jumping to an exception handler, though this code is skipped over in most blocks. In general, it makes sense to find ways of moving as much infrequently used computation as possible to far code, by for example by forcing the time slice to 0 when an interrupt occurs, or lazily calculating counter registers from scheduler time, instead of incrementing it at the end of blocks. This general pattern of reducing overhead on the common path, at the expense of more logic elsewhere, is the essence of most JIT optimizations.

The remaining methods in this section I don’t have direct experience implementing, but are worth mentioning for those emulating systems with other characteristics. If your address space is very large, as might be the case for modern systems with GBs of RAM, it may not be possible or performant to use the large lookup table method. Instead we can find ways to perform the expensive part of hash table lookup as infrequently as possible. In the most general case, we can make use of the fact that in most reasonably loaded hash tables, the first slot the key maps to contains the value we want - that is, most accesses do not require probing. Therefore, we can put only the code for hashing and the initial lookup in assembly, and exit the JIT only if probing is required. This is not only easier to write in assembly, but also keeps infrequently used code from using up cache space. It’s also possible to augment the global hash table with a smaller, possibly block-local, hash or address table for unpredictable blocks - those that aren’t directly linked or function returns, as described below. These smaller hash table may involve less probing and simpler lookup, as well as being more easily kept in cache than a full one, but require seperate invalidation.

For functions returns specifically, it may also make sense to use a return stack buffer to further reduce lookup cost. This means adding assembly code to push a predicted return address to the buffer on function calls, including both the guest address after the call and the host address of the matching JIT block (the latter is either known at compile time, if blocks don’t end at calls, or can be patched in later as described below). Then, when a guest function returns, it can pop the last prediction from the buffer, compare the guest return address with the predicted one, and if they match, jump to the associated host address. If they don’t match, the entire buffer is cleared and the block falls back to hash table lookup for finding the next block. If the predicted return JIT block is invalidated, the buffer should also be invalidated. In most cases though, since the return stack buffer is smaller and more specific than the full hash or address table, it is less likely to require probing or incur a cache miss on lookup. The stack buffer can sometimes also be maintained using the host stack and CALL/RET instructions to maintain, so the host branch predictor can predict returns. However, this can complicate JIT enters/exits, and means mismatching the return address, or invalidating the returning block, can cause every future return to be mispredicted. For single threaded guests, or guests where thread switches can be detected, the tradeoff may be worthwhile. More details regarding specific implementations can be found in writeups for Dolphin and Dynarmic

Static Branch Linking

So far, these optimizations have been built on assumptions of the emulator path, but information available for specific blocks at compile time can be used for further performance improvements. The most obvious and important of these are static branches, which almost always jump to the same one or two host addresses. These are usually so common, in fact, that JITs can replace the ending lookup for these blocks with a direct host branch, or no branch at all (fallthrough). This reduces the pressure on the host indirect branch predictor, but makes modification and invalidation more difficult than with a single global lookup table. Depending on how complicated your lookup process - for example, if it requires virtual address translation, checking pipeline states, probing a hash table, or other complex logic - it could also save memory pressure and computation.

One way to implement static branch linking involves replacing the lookup part of the epilogue in blocks ending with static branches. The replacement epilogue initially includes a few nops followed by a short stub that passes the host address of the nops and the guest address of the next block to a linker function. The linker function can then lookup the host address of the next block using the passed guest address, then write a conditional branch to that host address at the host address of the nops, which should have been sized so that this write does not affect the linker call site after it. After the write the linker should make a note of the host address it just wrote to in a linker table so that the write can be undone when the target is invalidated. It can then jump to the host address of the next block to continue execution as usual. If the next block has not been compiled yet, then we may have to skip the writing and exit the JIT to compile, then wait for the next time the block is called to perform any linking.

If the linked block ends in a guest conditional branch, the linker function may be called a second time from the same site, when the other arm of the branch is taken. Since there are only two arms, we can assume the linker function cannot be called a third time from the same site, so we can now write an unconditional branch to the next host address over that linker call site, assuming there is enough memory space there before the end of the block. The easiest way to check this condition is to check whether the host address passed to the linker points to nops, or if these have already been overwritten. The following pseudocode shows what the epilogue of a block ending in a guest conditional branch might look like at the 3 stages in its “life cycle” (I assume the rest of the block, including time slice and interrupt checking, is emitted before START):

// initial state of epilogue
START:
nops;
link(START, guest_pc)

// after first call, where link was called with guest_pc = PC1,
// which was compiled to host address BLOCK1
START:
if (guest_pc == PC1) goto BLOCK1
link(START, guest_pc)

// after second link, where link was called with guest_pc = PC2
// which was compiled to host address BLOCK2
START:
if (guest_pc == PC1) goto BLOCK1
goto BLOCK2

This is one basic form of static branch linking in a JIT. There is one more thing that usually needs to be implemented for this to work though, and that is invalidating these branches when their target block is invalidated. The next section will include more regarding when invalidation occurs and how to detect it, but essentially there may be times when the guest code at a certain address is overwritten and so the code that was first compiled for it is no longer valid. When this occurs, we need to remove references to it not only from the lookup table, but also from the links, by first looking up the invalidated block (or page, depending on the invalidation granularity) in the linker table. We can then go through all the host addresses that reference that block/page, and rewrite the original form of the epilogue at those addresses, including both the nops and the call to the linker function.

Code Invalidation

One important thing real translators often must handle that isn’t shown in the commonly seen basic loop is code invalidation - detecting when the content of the basic block at a guest address has changed, and updating the translation. The most efficient way to do so can vary depending on the properties of the guest system. One very cheap method for systems that allow self-modifying code, but where most programs keep code and frequently data relatively far apart, and writes to code areas don’t happen frequently, is to make use of host page faults to detect writes. In linux this is done through mprotect. After translating a block, we protect the pages containing guest memory the translation covers. This means writes that would invalidate any block within that page will now trigger a host page fault. In the fault handler, we can then calculate the guest address of the write from the faulting address, find the guest page that was written to, and invalidate every block within that guest page.

// on compiler fetch of new page
page = fetch_addr / PAGE_SIZE;
if (protected[page]) return;
mprotect(base_ram + page * PAGE_SIZE, PAGE_SIZE, PROT_READ);
protected[page] = true;

// within page fault handler (N_INSTS = PAGE_SIZE / INST_WIDTH)
if (sig != SIGBUS && sig != SIGSEGV) return;
page = (fault_addr - ram_base) / PAGE_SIZE;
if (fault_addr < ram_base || page >= N_PAGES)) exit(1);
// remove blocks from LUT (unlink blocks here if needed)
memset(block_lut + page * N_INSTS, N_INSTS * sizeof(void*), 0);
mprotect(ram_base + page * PAGE_SIZE, PAGE_SIZE, PROT_READ | PROT_WRITE);
protected[page] = false;

The primary downside of using host memory protection for invalidation is often the granularity. Since writes anywhere to a page invalidate all code on that page, you might find that some pages are invalidated very frequently not because code is being frequently modified, but because there is frequently modified data that just happens to be located next to it. When this becomes a problem, it may be advantageous to switch away from host memory invalidation and instead make changes to the emitted code for memory accesses, perhaps doing so only when necessary in a manner similar to fastmem.

The core of finer-grained invalidation is to keep track of precisely which addresses contain code, and check write addresses against it. Most often this takes the form of a bitmap, where the bit corresponding to an address is set by the compiler when it fetches the opcode there, and a whole page of bits is unset when the corresponding guest page is invalidated. After a write instruction is executed, there must be code to look up the address in this bitmap, and if that bit is set, invalidate the page. While accurate, doing this check for a whole system would take a lot of space and use a lot of overhead for most systems, but it can make sense when there is very limited memory, or specific pages are known to frequently contain both data and code. In the latter case access to those pages would go through a special handler, and the bitmaps would only cover that subset of memory.

For code that truly is frequently overwritten though, even fine-grained bitmaps won’t help. In that case the simplest option is to emit code at the beginning of every block that computes a checksum of the addresses it would fetch, and verify that this is the same as it was at compile time. If they are different, we must immediately exit the JIT, invalidate the block, and compile a new one. On x86, the hardware crc32 instructions may be useful for the checksum portion of this task. In many cases frequently invalidated code is also reused later, so it may be wise to compute the checkum up to the next branch, instead of to where the end of the block was at compile time. This way the checksum can be used to index into a backup content-based block lookup table which can replace the newly invalidated block without actually recompiling.

Of course, this is all extra code that must be run on every block, so it may be worth trying to reduce overhead by only running the check when the block’s page is potentially dirty, as detected by page faults or another method. With the former, the page fault handler might no longer directly modify lookup but instead mark the page as dirty in a page-granularity bitmap. Each block would then start with a lookup of the bitmap before conditionally branching to a handler in far code that would perform the checksum and invalidation. Without block linking this may be simplified by replacing the block pointer with the checksum handler pointer directly when the page is dirtied - whether that trade off is worth it depends on the system, though.

There are ways to further optimize invalidation, for example by distinguishing between swapping out code pages and frequently overwriting a single insturction, using hooks on guest OS memory API calls and DMAs. This information may then be used to reduce page faults, split basic blocks to reduce hashing, or any number of other things. We won’t go into detail here though, as there are many ways to optimize invalidation and I only have experience with a few.

Memory access

While an inefficient dispatch loop can consistently cause great overhead, depending on the guest system, memory access from generated code can also be an important issue. Furthermore, memory access changes can open up avenues to finer-grained invalidation and other optimizations. Many JITs start out by handling all access through a set of external read/write functions, requiring every load and store to save and restore JIT context. Translating the virtual address to a physical one, determining whether the access needs to go to RAM or ROM, specially handling memory mapped I/O, and triggering possible interrupts and watchpoints is done by code outside the JIT. However, executing the common path through this in assembly directly can significantly reduce overhead.

Inline page tables are useful for systems with virtual memory, where translating virtual into physical addresses is a common source of complexity on the common path. Even if this isn’t strictly the case, many systems allow RAM banking or at least group memory mapped IO in a way that can be treated similarly. Before discussing it though, it should be mentioned that at least for virtual address translation, if your guest pages are the same size as or larger than your host pages, you may be able to use the host MMU directly for the task, using mmap() to transparently turn “virtual” accesses of certain mapped pages into “physical” accesses of another. However, the 64kb page size (for non-consecutive pages) on Windows frequently interferes with this strategy. Instead, we must simulate the guest MMU in code, and the way to do this as quickly as possible is through precalculating as much as possible during the relatively infrequent address mapping changes, while simplifying the logic for the common case.

When a new mapping is given to the MMU, we can subtract the starting physical address from the starting virtual address, and store this difference in a table. For unmapped pages, I/O, and other logic that may need to be executed, we can make use of the fact that most systems have much less physical memory than virtual memory, and encode this data in the unused high bits of the physical address. For example, we might prefill the table with differences between each virtual address and the physical address -1 for unmapped pages, and set the msb of the actual physical address to one for mappings of MMIO pages. The emitted code for memory accesses first calculates the virtual page, looks up the difference for that page from the table, and subtracts the difference from the virtual address. It can then branch to far code if necessary, ideally using condition bits already set by the previous subtraction. The rest of the logic for determining whether page fault, MMIO access, or other event has occurred, and handling it accordingly, is left to far code.

// initially pages[vpage] = vpage * PAGE_SIZE + 1
// when vpage is mapped to ppage
if (needs_fallback(ppage)) ppage = ppage + 0x8000'0000;
pages[vpage] = vpage * PAGE_SIZE - ppage * PAGE_SIZE;

// when read from vaddr emitted
vpage = vaddr / PAGE_SIZE
paddr = vaddr - pages[vpage]
host_pc = address_of_label(AFTER)
if (paddr < 0) goto READ_FALLBACK
value = ram_base[paddr]
AFTER:
register = value

// in far code
READ_FALLBACK:
if (paddr == -1) goto PAGE_FAULT
value = mmio_read(paddr - 0x8000'0000)
goto host_pc

While this method is much faster than calling to C++ on every access, and can be pretty close to optimal for systems with full virtual memory on every instruction, on many systems we can take simplifying the fast path even farther using a combination of host page fault write detection and backpatching, usually known as “fastmem”. This reduces the initial emitted code for memory accesses to a direct read or write to an address-space-sized array, where pages that require more complex code to access - such as TLB-mappable regions and MMIO - are protected by the host MMU. Accesses to those regions then generates a page fault, the handler for which can use both the particular protected address accessed, and the address of the faulting instruction, to replace the direct write with more complex or specialized code as needed - for example, a call to an assembly stub implementing inline page table lookup if the write was of a TLB-mappable region, a call to a stub checking the address against a bitmap if the write was to a page containing code, or a checked direct call to an MMIO handler if the write was to an MMIO register. This kind of patching not only improves RAM access speed, but also makes MMIO polling faster, by making use of the fact that specific instruction addresses usually access only RAM or only MMIO.