Memory Usage and Layout: Optimizing for Backing StoreEven while memory is continuously getting cheaper and computers are offered with higher memory capacities, the available memory never seems enough. The use of virtual memory (i.e., paging to disk) may help somewhat, but often, performance is slowed beyond human tolerance. As a result of our investigations, we noticed that dynamic data structures are one of the main slow-downs. Frequently, structures that logically belong together and are accessed together are distributed over many virtual memory pages. As a result, each data structure access will require many pages to already be active or newly activated (require a page-in from disk and the eviction of other pages which are currently in-core). Each of these pages will use a significant amount of memory, even though only a few bytes may be used: A massive waste of RAM. We mostly concentrated on one particular data structure: hash tables or associative arrays, a common feature of modern programming languages, especially scripting languages. For example, all arrays in GNU awk (gawk) are associative. Therefore, even when a programmer tries to access the array in a linear fashion, the accesses are in fact random. Furthermore, both key and value of the hash table are generic gawk variables. This means that each string is in fact only a container for some control information plus a pointer to a string. Together with the deep hash chains (linked lists) used by gawk, many additional pages may be referenced, heavily increasing memory pressure and disk I/O. Even though we discuss gawk here, a similar scheme is used by many other languages supporting hash tables. While one mechanism would be to change the data structures to colocate the string buffer with the control information, this may not be feasible: changing the variable layout may break
Instead, we proposed the following changes, which were also implemented:
Related Publications
Participants
|