Monday, April 14, 2014

MySQL Insert performance and random primary key values

It's been almost a year since I've been putting up with MySQL and its bizarre performance characteristics and lack of choices on index data structures. If it wasn't for the volume and velocity of data we're storing in it, I would have proposed to ditch it by now.

We use MySQL purely as a stable high performance key-value store so many might not share the frustrations. I've been bitten by poor performance of InnoDB tables with semi-random values in primary key more than once now and hopefully this post will help someone who is in similar situation.

MySQL with InnoDB struggles with random primary key values, probably because it only supports BTree indices on it and page split cost on a BTree can be very IO intensive, I've only experienced this with numerical columns but it may be the true for other data types as well.

The only solution that has worked satisfactorily for me so far is to add an AUTO INCREMENT primary key and index the other column as unique. This is a complete waste of storage space but has offered the best insert performance so far. AUTO INCREMENT with upsert (ON DUPLICATE UPDATE) is probably the only reliable and performant way of using MySQL as super fast ingestion engine.

Monday, December 16, 2013

Interesting use-case for SynchronousQueue

While working on a very unusual system, where: 1) producers can be significantly faster than consumers at times (by more than a factor of two) and 2) producers have low latency processing overhead for real time data, I was contemplating on a data structure that is efficient, performant and can model this situation elegantly. After researching probable candidates, I came across Exchanger/SynchronousQueue from Java's util.concurrent class library.

If I was looking at SynchronousQueue without the above context, I would have wondered why would anyone need a queue that's not really a queue but more like a pointer swap between appropriate threads. But the use-case I'm dealing with ("event bursts") are probably the perfect use case for preventing the consumers from overwhelming rates by modeling the problem more as a "hand-off" than a typical case of buffered queuing. The central idea behind this data-structure is to adapt queue idioms without using a queue in a very efficient manner with an added feature that message production is rate limited by consumer's speed of processing them. Behind the scenes, it uses dual-stack/queue algorithm (depending on ordering fairness preference) to transfer a reference between threads.

SynchronousQueue is more of a thread queue than a data queue, it maintains a stack/queue of waiter threads (i.e. "consumers") and not the queue of data itself. You can probably achieve the same functionality by using BlockingQueue of size 1 or using an explicit object lock and explicit wait/notify on a datum reference like an example below:

//Example code, probably riddled with concurrency bugs 
//(I've only tested it on my laptop :)) 
public class MyNaiveSyncQueue {
    private final Object LOCK = new Object();
    private volatile Object data; //volatile is needed for non compressed OOPS
    public void put(Object o) throws InterruptedException{
        synchronized (LOCK) {
            if(data != null){
                LOCK.wait();
            }
            data = o;
            LOCK.notify();
        }
    }
    public Object take() throws InterruptedException{
        synchronized (LOCK) {
            if(data == null){
                LOCK.wait();
            }
            Object o = data;
            data = null;
            LOCK.notify();
            return o;
        }
    }
}

There are several problems with the solution above:
  • Violent locking and memory fence overhead: for individual queue operations, this will scale terribly with number of producers/consumers, especially on server class SMP hardware.
  • Constant context switching: each successful queue operation involves  syscall(s) for context switching which might involve kernel scheduler and everything that comes with it (cache flush/register reload et. al.).
  • Overhead for fair processing: JVM manages object monitor wait queues in a fixed FIFO order, there's certain overhead in seeking the first enqueued thread to schedule as a consumer. This may or may not be the behavior the programmer cares about.
SynchronousQueue takes care of all of these limitations by providing options for trade-off in terms of scheduler ordering fairness as well as eliminating expensive locking with hardware level CAS (whenever available). It also does a fair bit of spin-locking before a kernel level timed-wait kicks-in, this ensures that context-switches don't become the hot-spots in message processing.

So far this has been working great for the system I'm dealing with which processes about a couple of hundred messages per millisecond at peak bursts but I realize that it might not be appropriate (or even worth it) for non-realtime/non-bursty producers.


Tuesday, May 15, 2012

Achieving Decoupling: Dependency Injection v/s Event driven design

(Update: A more exhaustive sequel post is in progress on this topic so don't be disappointed by lack of details and stay tuned)

Coupling is considered to be a bad sign in a software design, having seen enough of it in practice I agree with this belief. The need for decoupling has resulted in a cottage industry of IoC containers in Java ecosystem, where it still remains popular, so this is not a philosophical topic.

When I was learning object oriented software design as a student, the gist I received was encapsulation and message passing between objects with state. In theory this sounded great, objects with behavior and encapsulated state seemed unbeatable (given that I only knew BASIC and C).  In real life, however, I have found that it is hard to keep objects simple while keeping their dependency fulcrum in control. This may not correlate in real OO programming languages like Smalltalk but I can't speak for that.

Given that I've only worked on a relatively medium to larger enterprise applications (ranging from 10KLOC to 1MLOC+)  this might sound a bit biased: most enterprise applications are not object oriented. This is counter intuitive given that OO approach is considered to excel at modelling large systems by breaking it down in to more manageable pieces, but I digress.

Coming back to the main issue, how can we control coupling?

Dependency Injection
The most popular option so far is to exploit dependency injection to invert the dependencies. The touted benefits of it are: testability, flexibility and abstracted, and hence, switchable implementations.
For example:
interface IHttp{  InputStream open(URL url); }
class FireFox implements IHttp{ 
 InputStream open(HtttpURL url){  doOpen(url).andLeakSomeMemoryAsDesigned(); } 
}

class Chrome implements IHttp{
 InputStream open(HttpURL url){ feedUserInfoTo(googleAI).thenOpenUrl(url).makeItFast(); }
}

class InternetExplorer implements IHttp{
 //Since no one uses it apart from downloading other browser,
 // being efficient, user friendly or useful is just a misplaced goal.
 InputStream open(HttpUrl url){ 
  waitIndefinitelyOrForeverFor(url).attemptToLoadWhenever(url).possiblyReadOrJustGiveUp(); 
 }
}

class MavenHttpClient implements IHttp{
 InputStream open(HttpUrl url) {
   downloadKnownInternetFiveThousandTimesEveryHour().completelyIgnoringWtfHttp()
   .then()
   .inflictDarkSufferingOnUserAndBeReluctantlyOpinianatedAboutHowYouScrewUpBuilds().asAlways(); 
 }
}

class MyHappyEnterpriseApp{
 IHttp http;
 setBrowser(IHttp browser){this.http = http} //Injected dependency
 void myBusinessLogic(){ http.open("http://blog.nirav.name").doSomething() }
}
So by coding to interface and relying on a dependency injection "container" we get all the prescribed  benefits. This is the de-facto standard to achieve decoupling in current enterprise application architecture in Java ecosystem.

Event Driver Design
The second obvious alternative (may be not so obvious to many) is to fall back to good old message passing which can be synchronous or async. Isn't OO approach all about message passing and keeping state encapsulated while reacting to meaningful messages? Turns out, it doesn't jive well enough with statically typed languages such as Java.

Instead of lethally injecting objects with dependencies (abstract or concrete) why not just send a message which will be observed by objects who are really interested in responding? The benefits here are inclusive of the ones offered by dependency injection but are more evolutionary.

For example, The similar implementation of the above design can be represented as events:
interface IEvent{}
interface HTTPEvents{
 interface OpenUrl extends IEvent{
   InputStream open(HtttpURL url);
 }
}
class FireFox implements HTTPEvents.OpenUrl{
 InputStream open(HtttpURL url){  doOpen(url).andLeakSomeMemoryAsDesigned(); } 
}

class Eventing{
  Dictionary events; //initialize implementations
  T  sync( evt){
    return events.lookupImpl(evt.getType()).send(evt);
  }
}
class MyHappyEnterpriseApp{
 void myBusinessLogic(){ doSomething(Eventing.sync(HTTPEvents.OpenUrl, "http://blog.nirav.name")); }
}

While simplistic, this implementation should provide general idea that dependency injection is not the only true way to achieve decoupling.

Even the most complex systems are designed around the idea of message passing (Linux kernel, Win API and so on) where it successfully achieves decoupling from user land API (POSIX) from physical implementations (e.g. CPU interrupt faults and traps). This elucidates a strong merit that message passing can be applied to a general purpose applications (which are comparatively less complex). It is not a novel approach to system design but it appears that reinventing the wheel is in the fashion as far as enterprise application development goes. JMS has been a solid spec geared more or less towards similar principles but it is viewed only as a queueing mechanism to offload the async task.

I guess not every one sees coupling from the same angle and starts to build yet another money making framework to achieve the same goal. I hope that agent driven libraries like Akka and Erlang style message passing gets some more attention from enterprise app community to see how benefits beyond coupling (performance, fault tolerance and scalability) are achieved with message passing.

Monday, March 05, 2012

On Static v/s Dynamic typing

I've been pondering about trivialities again, I am observing how most programmers have strong vested interest in defending static v/s dynamic typing as if they are mutually exclusive.

There're those who think static typing is the ultimate engineering solution to build reliable software and dynamic typing is some kind of a sick joke to put their lazyness to test. Then there are those who think types are for primitive minds and variable declaration is an insulting feature.

At this point in time, I can relate to both groups as I've had strong opinion on this subject. Having opinion on typing is OK, I think. (having no opinion on this topic basically means google didn't work for you and you should head back).

Most programmers go through, what I call, a typing evolution cycle. I started with dynamically typed language (name removed because of possible copyright violation). It was really great to stuff everything in to a variant (actually I never even bothered to declare variables!) and not worrying about compilation or runtime errors, it mostly ran. Then I was introduced to C (and later C++) and I remember how I hated it because I was now forced to declare variables and had to think about what I wanted computer to do upfront. It was alien to me that compiler will point out my mistakes instead of doing my bidding. During that time I was a supporter of dynamic typing because I was naive.

As the time went by I got more used to static typing with interesting IDE features such as IntelliSense of Visual Studio. Writing correct program was much easier now and it always worked (mostly correctly). Then came the Java programming, everything was object (except primitives). Large projects and Eclipse JDT made me appreciate value of static typing. When I start writing test I don't think about type errors because they are taken care of, I could easily refactor code without nightmares that something could be broken. Scala's type inference and simpler type system actually boosted my faith in static typing even further. At this point in time I supported static typing because I was naive.

This was my experience, may be it is reverse for others (starting from static typing and going back and forth).
My views have changed on typing over the time. Depending on what I'm working on, I don't mind hand-waving type system (coupled with esoteric tests) for a quick isolated stab that brings a lot of benefits v/s a bloat of API to do the same in few years. While still having the confidence in my software being immune to my bad keyboard-fu and assurance that stupid mistakes will not make it all the way to production. Probably, I'm still naive.