Thursday, November 24, 2011

Optimizing string memory footprint in Java - Part 1

This is first in a series of blog posts where I will try to describe my miserable attempts at storing large amount of strings to build efficient spell correction and auto-suggest facility in Java. I learned some important lessons doing this and it will be a waste if I didn't share it. So here it goes..

In order to efficiently store large amount of strings, we first need to understand how JVM stores string objects in memory. In this post I will try to summarize memory layout of String object.

Each object in JVM has fixed, unavoidable structural overhead called object header that JVM uses for various tasks such as Garbage Collection, Identification, addressing and others that I don't understand. A 32-bit HotSpot VM uses 8 bytes header per object, 64-bit HotSpot VM with heap larger than 32g as well as IBM's J9 VM takes 16 bytes per object. As always, arrays are treated differently. Array is an object so it has a fixed header of its own as described above. However, since Java spec guarantees array bounds check on every array operation it has to store length of each array which is another 4 byte, making effective array header size 12 bytes instead of regular 8 for objects. And finally there's object padding in multiples of 8 or 16 bytes, depending on CPU word size and JVM to make memory to CPU communication efficient.

Assuming 32-bit JVM, the java.lang.String class has three bookkeeping integer fields called offset, count, hash occupying 12 bytes plus a 4 byte char array pointer and 8 byte header totaling 24 bytes fixed overhead per string. This is inefficient but slightly clever because padding is not required for string object itself making hash a free cache.

Here's the formula to calculate String's shallow and deep sizes:
  • Shallow size = HEADER + ( offset + count + hash + char array pointer) = 24 bytes
  • Retained size = Shallow size + 12 byte array header + (nchars * 2 + padding)
Where padding = ( (nchars * 2) mod word_size) is rounded off to 8/16 bytes depending on 32/64-bit JVM (Except J9) word_size.

So, on a 32-bit HotSpot VM, 6 byte word "Memory" takes :
24 + 12 + (6 * 2 + 4 padding) = 24 + 12 + 16 = 52 bytes

That is 77% JVM imposed overhead v/s 23% actual data considering it a unicode string.

One way to make string representations efficient is to amortize structural overhead by storing larger strings. To illustrate, 100 char string will be 80% data v/s 20% overhead, 200 char string will be 94.33% data v/s overhead, 500 char string will be 97.6% data v/s overhead and so on. This will be one of the key technique in reducing string's memory footprint I will write about in later blog posts. Following is a rough graph that depicts this theory.


It is easy to see that storing a lot of small strings actually waste huge amount of memory which in no less part the reason why Java is considered a memory hog.

That's all with the basics of understanding overhead involved with storing strings. In next blog post, I will write about various implementations I tried and general rant on Java Collections' memory efficiency.