Normally, I avoid writing something so obvious but since I'm bitten multiple times now, it might help future me.
Never ever store references of java.net.URL in Java collections. The reasoning is pretty simple, 'equals' and 'hashCode' methods of this class does extremely expensive synchronous DNS lookup on every call.
It is not uncommon to see most of your thread's time being spent on monitors:
"pool-2-thread-2" prio=10 tid=0x92061400 nid=0x1744 waiting for monitor entry [0x91fad000]
java.lang.Thread.State: BLOCKED (on object monitor)
at java.net.URLStreamHandler.getHostAddress(URLStreamHandler.java:429)
- waiting to lock <0x9731b200> (a sun.net.www.protocol.http.Handler)
at java.net.URLStreamHandler.hashCode(URLStreamHandler.java:354)
at java.net.URL.hashCode(URL.java:875)
- locked <0xaac87290> (a java.net.URL)
at java.util.HashMap.getEntry(HashMap.java:361)
at java.util.HashMap.containsKey(HashMap.java:352)
at java.util.HashSet.contains(HashSet.java:201)
"pool-2-thread-1" prio=10 tid=0x9205e800 nid=0x1743 runnable [0x91ffe000]
java.lang.Thread.State: RUNNABLE
at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:866)
at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1258)
at java.net.InetAddress.getAllByName0(InetAddress.java:1211)
at java.net.InetAddress.getAllByName(InetAddress.java:1127)
at java.net.InetAddress.getAllByName(InetAddress.java:1063)
at java.net.InetAddress.getByName(InetAddress.java:1013)
at java.net.URLStreamHandler.getHostAddress(URLStreamHandler.java:437)
- locked <0x9731b200> (a sun.net.www.protocol.http.Handler)
at java.net.URLStreamHandler.hashCode(URLStreamHandler.java:354)
at java.net.URL.hashCode(URL.java:875)
- locked <0xaac97228> (a java.net.URL)
at java.util.HashMap.getEntry(HashMap.java:361)
at java.util.HashMap.containsKey(HashMap.java:352)
at java.util.HashSet.contains(HashSet.java:201)
This stack-trace just depicts hashCode but expect similar blocking code for 'equals' too. If you care a bit about performance, just stay away from this goddamned class.
Monday, December 05, 2011
Thursday, November 24, 2011
Optimizing string memory footprint in Java - Part 1
This is first in a series of blog posts where I will try to describe my miserable attempts at storing large amount of strings to build efficient spell correction and auto-suggest facility in Java. I learned some important lessons doing this and it will be a waste if I didn't share it. So here it goes..
In order to efficiently store large amount of strings, we first need to understand how JVM stores string objects in memory. In this post I will try to summarize memory layout of String object.
Each object in JVM has fixed, unavoidable structural overhead called object header that JVM uses for various tasks such as Garbage Collection, Identification, addressing and others that I don't understand. A 32-bit HotSpot VM uses 8 bytes header per object, 64-bit HotSpot VM with heap larger than 32g as well as IBM's J9 VM takes 16 bytes per object. As always, arrays are treated differently. Array is an object so it has a fixed header of its own as described above. However, since Java spec guarantees array bounds check on every array operation it has to store length of each array which is another 4 byte, making effective array header size 12 bytes instead of regular 8 for objects. And finally there's object padding in multiples of 8 or 16 bytes, depending on CPU word size and JVM to make memory to CPU communication efficient.
Assuming 32-bit JVM, the java.lang.String class has three bookkeeping integer fields called offset, count, hash occupying 12 bytes plus a 4 byte char array pointer and 8 byte header totaling 24 bytes fixed overhead per string. This is inefficient but slightly clever because padding is not required for string object itself making hash a free cache.
Here's the formula to calculate String's shallow and deep sizes:
So, on a 32-bit HotSpot VM, 6 byte word "Memory" takes :
24 + 12 + (6 * 2 + 4 padding) = 24 + 12 + 16 = 52 bytes
That is 77% JVM imposed overhead v/s 23% actual data considering it a unicode string.
One way to make string representations efficient is to amortize structural overhead by storing larger strings. To illustrate, 100 char string will be 80% data v/s 20% overhead, 200 char string will be 94.33% data v/s overhead, 500 char string will be 97.6% data v/s overhead and so on. This will be one of the key technique in reducing string's memory footprint I will write about in later blog posts. Following is a rough graph that depicts this theory.
It is easy to see that storing a lot of small strings actually waste huge amount of memory which in no less part the reason why Java is considered a memory hog.
That's all with the basics of understanding overhead involved with storing strings. In next blog post, I will write about various implementations I tried and general rant on Java Collections' memory efficiency.
In order to efficiently store large amount of strings, we first need to understand how JVM stores string objects in memory. In this post I will try to summarize memory layout of String object.
Each object in JVM has fixed, unavoidable structural overhead called object header that JVM uses for various tasks such as Garbage Collection, Identification, addressing and others that I don't understand. A 32-bit HotSpot VM uses 8 bytes header per object, 64-bit HotSpot VM with heap larger than 32g as well as IBM's J9 VM takes 16 bytes per object. As always, arrays are treated differently. Array is an object so it has a fixed header of its own as described above. However, since Java spec guarantees array bounds check on every array operation it has to store length of each array which is another 4 byte, making effective array header size 12 bytes instead of regular 8 for objects. And finally there's object padding in multiples of 8 or 16 bytes, depending on CPU word size and JVM to make memory to CPU communication efficient.
Assuming 32-bit JVM, the java.lang.String class has three bookkeeping integer fields called offset, count, hash occupying 12 bytes plus a 4 byte char array pointer and 8 byte header totaling 24 bytes fixed overhead per string. This is inefficient but slightly clever because padding is not required for string object itself making hash a free cache.
Here's the formula to calculate String's shallow and deep sizes:
- Shallow size = HEADER + ( offset + count + hash + char array pointer) = 24 bytes
- Retained size = Shallow size + 12 byte array header + (nchars * 2 + padding)
So, on a 32-bit HotSpot VM, 6 byte word "Memory" takes :
24 + 12 + (6 * 2 + 4 padding) = 24 + 12 + 16 = 52 bytes
That is 77% JVM imposed overhead v/s 23% actual data considering it a unicode string.
One way to make string representations efficient is to amortize structural overhead by storing larger strings. To illustrate, 100 char string will be 80% data v/s 20% overhead, 200 char string will be 94.33% data v/s overhead, 500 char string will be 97.6% data v/s overhead and so on. This will be one of the key technique in reducing string's memory footprint I will write about in later blog posts. Following is a rough graph that depicts this theory.
It is easy to see that storing a lot of small strings actually waste huge amount of memory which in no less part the reason why Java is considered a memory hog.
That's all with the basics of understanding overhead involved with storing strings. In next blog post, I will write about various implementations I tried and general rant on Java Collections' memory efficiency.
Monday, October 24, 2011
On using "PermGen" as application level cache
I was reading an interesting article 'Assualt by GC' by stack exchange guy and it felt like a Déjà vu with my past couple of years of development on the JVM. It struck to me that we can definitely do better, so here it goes..
Automatic GC is really a great step forward in software development, except when it is not. If you have deployed application on a JVM with large heap (4g+) you probably know what a long GC pause really feels like [insert familiar knock-knock joke about java]. Jokes aside, JVM's GC advancement is unprecedented. The amount of tuning you can do with different garbage collectors can define a niche profession.
For most applications where GC latency isn't an issue; default garbage collector works just fine. For applications which need to scale GC can (and does) become a 'bottleneck'. If you disagree, try running a JVM under memory pressure and see app response times. It should be surprising because in most data driven applications bottleneck is usually IO or other IO bound resources (e.g. a DB). This situation generally happens when GC is completely thrashing the process because there are too many "tenured" objects which don't fit in the allocated heap or heap is fragmented and GC wastes a lot of time compacting the heap. Unlike .NET, Java folks are not very lucky with platform specific optimizations such as locking pages to prevent swapping. So it is not uncommon that a full GC causes excessive paging, making GC IO bound.
Turns out that GCing large "tenured" object space is expensive compared to short lived young objects (sweep). A large population of "tenured" objects is generally a genuine requirement for long running server processes relying on large amounts of data and this requirement shouldn't really punish application with long GC pauses. While not impossible, it is not really practical to set size of tenured generation very large because it may adversely affect young generation collections. JVM GC optimization is a skill not in abundance but the problem is all too common. So what can we do about it?
One way to eliminate GC on predictable "tenured" application data is to just not store it on JVM heap (i.e. use direct byte buffer etc. ). I've been watching solutions like Terracotta's BigMemory which uses similar approach to address GC issues. However all such solutions seem a mix of manual memory management with hacks to circumvent GC which end up being half-baked reinvention of JVM's copy-on-write "permgen".
Most of the java developers I know consider "permgen" to be some kind of evil which causes all kinds of problems including "eclipse crashing", crying JSP/[insert other template library] compiler, unpredictable class unloading and really large interned strings which stick around. "permgen" is going to go away from the hotspot vm, which is kind of sad because I think it could be a great way to achieve GC free heap storage for application level data (more specifically cache). This is not really possible unless "permgen" is used for one specific purpose, and if that specific purpose allows application to store its data, we can have standard supported GC free application data without the need of third party solutions which achieve the goal poorly. Even better would be java.cache using "permgen" for cache storage.
One of the commenter at HN talked about Smalltalk VM's way of using permgen (just send a message to object to move itself to "permgen"). I like this approach because applications can control which objects are long lived which is sensible because they have the best knowledge about long lived objects. The only similarity in JVM we have is String.intern, which unfortunately caches strings forever and it is not really as useful as having some kind of eviction control.
So, what do you think about this approach?
Automatic GC is really a great step forward in software development, except when it is not. If you have deployed application on a JVM with large heap (4g+) you probably know what a long GC pause really feels like [insert familiar knock-knock joke about java]. Jokes aside, JVM's GC advancement is unprecedented. The amount of tuning you can do with different garbage collectors can define a niche profession.
For most applications where GC latency isn't an issue; default garbage collector works just fine. For applications which need to scale GC can (and does) become a 'bottleneck'. If you disagree, try running a JVM under memory pressure and see app response times. It should be surprising because in most data driven applications bottleneck is usually IO or other IO bound resources (e.g. a DB). This situation generally happens when GC is completely thrashing the process because there are too many "tenured" objects which don't fit in the allocated heap or heap is fragmented and GC wastes a lot of time compacting the heap. Unlike .NET, Java folks are not very lucky with platform specific optimizations such as locking pages to prevent swapping. So it is not uncommon that a full GC causes excessive paging, making GC IO bound.
Turns out that GCing large "tenured" object space is expensive compared to short lived young objects (sweep). A large population of "tenured" objects is generally a genuine requirement for long running server processes relying on large amounts of data and this requirement shouldn't really punish application with long GC pauses. While not impossible, it is not really practical to set size of tenured generation very large because it may adversely affect young generation collections. JVM GC optimization is a skill not in abundance but the problem is all too common. So what can we do about it?
One way to eliminate GC on predictable "tenured" application data is to just not store it on JVM heap (i.e. use direct byte buffer etc. ). I've been watching solutions like Terracotta's BigMemory which uses similar approach to address GC issues. However all such solutions seem a mix of manual memory management with hacks to circumvent GC which end up being half-baked reinvention of JVM's copy-on-write "permgen".
Most of the java developers I know consider "permgen" to be some kind of evil which causes all kinds of problems including "eclipse crashing", crying JSP/[insert other template library] compiler, unpredictable class unloading and really large interned strings which stick around. "permgen" is going to go away from the hotspot vm, which is kind of sad because I think it could be a great way to achieve GC free heap storage for application level data (more specifically cache). This is not really possible unless "permgen" is used for one specific purpose, and if that specific purpose allows application to store its data, we can have standard supported GC free application data without the need of third party solutions which achieve the goal poorly. Even better would be java.cache using "permgen" for cache storage.
One of the commenter at HN talked about Smalltalk VM's way of using permgen (just send a message to object to move itself to "permgen"). I like this approach because applications can control which objects are long lived which is sensible because they have the best knowledge about long lived objects. The only similarity in JVM we have is String.intern, which unfortunately caches strings forever and it is not really as useful as having some kind of eviction control.
So, what do you think about this approach?
Tuesday, September 06, 2011
Thoughts on Event sourcing
I read about event sourcing a while back and I couldn't stop thinking about it so it had to explode here as a blog post.
A typical data driven application involves CRUD operations on domain entities which are important to business. Such application typically capture data from user or other system in a centralized database for reporting or for further distribution. Architecture of such application is generally simple and there's a vast ecosystem of platforms, frameworks, tools and libraries to support it.
If we wanted to represent traditional design of such application in terms of a state machine, then we can say that such application capture, distribute and allow reporting of domain in a certain state. The primary objective of such application is to facilitate data manipulation (CRUD) which changes state of the domain. For most applications it is generally a sound design ignoring familiar caveats.
Thinking in terms of state machines, there's an interesting alternative: we can store all the state transitions that led initial domain model to its current state. This second perspective to application design has many interesting repercussions. In this approach, current state of domain is no longer as important as the earlier approach because it can be recreated just by repeating all the transitions. This second approach is named "Event sourcing" where event is just a little familiar name for state transition.
Not every application care about the easy recreatibility aspect of domain, most business applications care only about current state of domain. However many applications, especially the ones with mandated audit trails or domain with significant historical data, can benefit from event sourcing. A common example of such application is a version control system, version control systems capture state transitions (diffs) of domain entities (source files) so you can switch to any state (version) and rebuild it to desired state by successive applications of state transitions (diffs).
As far as business domains goes, Insurance domain by far seems to be a great area of application for event sourcing given that audit trail is a legal compliance requirement and insurance domain models tend to be really complex. Think of an insurance policy as a state and all the changes to it (endorsements) as transitions. By just tracking the transitions, one can rebuild a policy to its current state and reason about it for underwriting analysis and audit. Compare it, instead, with our initial approach of capturing multiple states (identical records in relational database) with complex logic to diff them and comparing them. This approach has profound positive implication on usability of application as well as testability, with this approach it is easy to visualize and rebuild data of interest to any point in time.
One more interesting application of event sourcing, I think, is in data mining. If data is stored as events it is fairly easy to sample, plot and build historical and predictive models. My limited experience in mining data has always involved custom (expensive) efforts to store historical data, such custom efforts usually involve complex development efforts just to extract marginally meaningful information.
It shouldn't be surprising that event sourcing can have significant influence on application architecture which may not be an easy sell especially in a larger setting. There are many related concepts to event sourcing, specifically CQRS which lead to wild architectures (which I'm not quite fond of yet).
I'm learning this is not new or revolutionary and has been done in past but never caught up for whatever reasons, nonetheless I find it interesting. As far as my technical curiosities go I'm very much inclined to try it out with a pet project to see how far these benefits are viable.
A typical data driven application involves CRUD operations on domain entities which are important to business. Such application typically capture data from user or other system in a centralized database for reporting or for further distribution. Architecture of such application is generally simple and there's a vast ecosystem of platforms, frameworks, tools and libraries to support it.
If we wanted to represent traditional design of such application in terms of a state machine, then we can say that such application capture, distribute and allow reporting of domain in a certain state. The primary objective of such application is to facilitate data manipulation (CRUD) which changes state of the domain. For most applications it is generally a sound design ignoring familiar caveats.
Thinking in terms of state machines, there's an interesting alternative: we can store all the state transitions that led initial domain model to its current state. This second perspective to application design has many interesting repercussions. In this approach, current state of domain is no longer as important as the earlier approach because it can be recreated just by repeating all the transitions. This second approach is named "Event sourcing" where event is just a little familiar name for state transition.
Not every application care about the easy recreatibility aspect of domain, most business applications care only about current state of domain. However many applications, especially the ones with mandated audit trails or domain with significant historical data, can benefit from event sourcing. A common example of such application is a version control system, version control systems capture state transitions (diffs) of domain entities (source files) so you can switch to any state (version) and rebuild it to desired state by successive applications of state transitions (diffs).
As far as business domains goes, Insurance domain by far seems to be a great area of application for event sourcing given that audit trail is a legal compliance requirement and insurance domain models tend to be really complex. Think of an insurance policy as a state and all the changes to it (endorsements) as transitions. By just tracking the transitions, one can rebuild a policy to its current state and reason about it for underwriting analysis and audit. Compare it, instead, with our initial approach of capturing multiple states (identical records in relational database) with complex logic to diff them and comparing them. This approach has profound positive implication on usability of application as well as testability, with this approach it is easy to visualize and rebuild data of interest to any point in time.
One more interesting application of event sourcing, I think, is in data mining. If data is stored as events it is fairly easy to sample, plot and build historical and predictive models. My limited experience in mining data has always involved custom (expensive) efforts to store historical data, such custom efforts usually involve complex development efforts just to extract marginally meaningful information.
It shouldn't be surprising that event sourcing can have significant influence on application architecture which may not be an easy sell especially in a larger setting. There are many related concepts to event sourcing, specifically CQRS which lead to wild architectures (which I'm not quite fond of yet).
I'm learning this is not new or revolutionary and has been done in past but never caught up for whatever reasons, nonetheless I find it interesting. As far as my technical curiosities go I'm very much inclined to try it out with a pet project to see how far these benefits are viable.
Monday, March 07, 2011
Why Concurrency is hard
Concurrency is hard because we haven't figure out how to make it easy. For most developers, specifically web developers, concurrency doesn't really matter. I envy that assuasive confident feeling of a sequential execution of http requests. The number of cores on my machine quadrupled in last three years and I don't know a single reliable, comforting (easy) way of harnessing it as much as possible, I feel a little sad about current state of concurrency support.
Utilizing all the processing power consistently is a lot easier for well defined and not so concurrent tasks such as map-reduce. I have done it a lot, processing gigabytes of data by reducing the problem to independent subsets is programmatic triviality. On the other hand, I have always found developing a relatively concurrent application the "right way" to be a nightmare. Concurrency applications come in two mutually exclusive flavours: slow or complex.
At this point enthusiasts will point out java.util.concurrent and move on. While j.u.concurrent is nice and a significant improvement over explicit synchronization, it still mandates that API users be concurrency wizards and its complexity exposure is nearly at par with explicit synchronization. Here's one example blog post explaining common gotcha with ConcurrentHashMap. The only benefit j.u.concurrency provides is finer grained control over where to do CAS. I am a huge fan of j.u.concurrent and have been using it pre-1.5 but I still don't think it makes concurrency so easy. For one more example,
synchronized(this){ aRef = newVal; return aRef;}
v/s
while (true) {
V x = atomicRef.get();
if (atomicRef.compareAndSet(x, newValue))
return atomicRef.get();
}
Which one do you think is easier to grasp?
Many people think that Actors are the next big thing to tackle concurrency monster and complexities introduced by these shared memory model primitives. I too initially thought so, but then I found that Actor model isn't really the sweet spot in practice as it is touted. The very notion that Actors can fail and code must handle the tricky bits to recover from it makes it even more complex than using locks/mutexes etc. I am in constant a awe to see people talking so lightly about fault tolerant/fail safe systems without giving thought on the amount of complexity it adds. I am not necessarily protesting that philosophy but that behaviour is just not common in yer average regular applications (will your user be happy if one actor failed to process her payment and was asked to retry?). We still live in dark ages of transparent concurrency.
I remain as ignorant and unsatisfied about concurrency support as I was several years ago. For me, concurrency is hard so I am off to shopping!
Utilizing all the processing power consistently is a lot easier for well defined and not so concurrent tasks such as map-reduce. I have done it a lot, processing gigabytes of data by reducing the problem to independent subsets is programmatic triviality. On the other hand, I have always found developing a relatively concurrent application the "right way" to be a nightmare. Concurrency applications come in two mutually exclusive flavours: slow or complex.
At this point enthusiasts will point out java.util.concurrent and move on. While j.u.concurrent is nice and a significant improvement over explicit synchronization, it still mandates that API users be concurrency wizards and its complexity exposure is nearly at par with explicit synchronization. Here's one example blog post explaining common gotcha with ConcurrentHashMap. The only benefit j.u.concurrency provides is finer grained control over where to do CAS. I am a huge fan of j.u.concurrent and have been using it pre-1.5 but I still don't think it makes concurrency so easy. For one more example,
synchronized(this){ aRef = newVal; return aRef;}
v/s
while (true) {
V x = atomicRef.get();
if (atomicRef.compareAndSet(x, newValue))
return atomicRef.get();
}
Which one do you think is easier to grasp?
Many people think that Actors are the next big thing to tackle concurrency monster and complexities introduced by these shared memory model primitives. I too initially thought so, but then I found that Actor model isn't really the sweet spot in practice as it is touted. The very notion that Actors can fail and code must handle the tricky bits to recover from it makes it even more complex than using locks/mutexes etc. I am in constant a awe to see people talking so lightly about fault tolerant/fail safe systems without giving thought on the amount of complexity it adds. I am not necessarily protesting that philosophy but that behaviour is just not common in yer average regular applications (will your user be happy if one actor failed to process her payment and was asked to retry?). We still live in dark ages of transparent concurrency.
I remain as ignorant and unsatisfied about concurrency support as I was several years ago. For me, concurrency is hard so I am off to shopping!
Subscribe to:
Posts (Atom)