Tuesday, September 06, 2011

Thoughts on Event sourcing

I read about event sourcing a while back and I couldn't stop thinking about it so it had to explode here as a blog post.

A typical data driven application involves CRUD operations on domain entities which are important to business. Such application typically capture data from user or other system in a centralized database for reporting or for further distribution. Architecture of such application is generally simple and there's a vast ecosystem of platforms, frameworks, tools and libraries to support it.

If we wanted to represent traditional design of such application in terms of a state machine, then we can say that such application capture, distribute and allow reporting of domain in a certain state. The primary objective of such application is to facilitate data manipulation (CRUD) which changes state of the domain. For most applications it is generally a sound design ignoring familiar caveats.

Thinking in terms of state machines, there's an interesting alternative: we can store all the state transitions that led initial domain model to its current state. This second perspective to application design has many interesting repercussions. In this approach, current state of domain is no longer as important as the earlier approach because it can be recreated just by repeating all the transitions. This second approach is named "Event sourcing" where event is just a little familiar name for state transition.

Not every application care about the easy recreatibility aspect of domain, most business applications care only about current state of domain. However many applications, especially the ones with mandated audit trails or domain with significant historical data, can benefit from event sourcing. A common example of such application is a version control system, version control systems capture state transitions (diffs) of domain entities (source files) so you can switch to any state (version) and rebuild it to desired state by successive applications of  state transitions (diffs).

As far as business domains goes, Insurance domain by far seems to be a great area of application for event sourcing given that audit trail is a legal compliance requirement and insurance domain models tend to be really complex. Think of an insurance policy as a state and all the changes to it (endorsements) as transitions. By just tracking the transitions, one can rebuild a policy to its current state and reason about it for underwriting analysis and audit. Compare it, instead, with our initial approach of capturing multiple states (identical records in relational database) with complex logic to diff them and comparing them. This approach has profound positive implication on usability of application as well as testability, with this approach it is easy to visualize and rebuild data of interest to any point in time.

One more interesting application of event sourcing, I think, is in data mining. If data is stored as events it is fairly easy to sample, plot and build historical and predictive models. My limited experience in mining data has always involved custom (expensive) efforts to store historical data, such custom efforts usually involve complex development efforts just to extract marginally meaningful information.

It shouldn't be surprising that event sourcing can have significant influence on application architecture which may not be an easy sell especially in a larger setting. There are many related concepts to event sourcing, specifically CQRS which lead to wild architectures (which I'm not quite fond of yet).

I'm learning this is not new or revolutionary and has been done in past but never caught up for whatever reasons, nonetheless I find it interesting. As far as my technical curiosities go I'm very much inclined to try it out with a pet project to see how far these benefits are viable.

Monday, March 07, 2011

Why Concurrency is hard

Concurrency is hard because we haven't figure out how to make it easy. For most developers, specifically web developers, concurrency doesn't really matter. I envy that assuasive confident feeling of a sequential execution of http requests. The number of cores on my machine quadrupled in last three years and I don't know a single reliable, comforting (easy) way of harnessing it as much as possible, I feel a little sad about current state of concurrency support.

Utilizing all the processing power consistently is a lot easier for well defined and not so concurrent tasks such as map-reduce. I have done it a lot, processing gigabytes of data by reducing the problem to independent subsets is programmatic triviality. On the other hand, I have always found developing a relatively concurrent application the "right way" to be a nightmare. Concurrency applications come in two mutually exclusive flavours: slow or complex.

At this point enthusiasts will point out java.util.concurrent and move on. While j.u.concurrent is nice and a significant improvement over explicit synchronization, it still mandates that API users be concurrency wizards and its complexity exposure is nearly at par with explicit synchronization. Here's one example blog post explaining common gotcha with ConcurrentHashMap. The only benefit j.u.concurrency provides is finer grained control over where to do CAS. I am a huge fan of j.u.concurrent and have been using it pre-1.5 but I still don't think it makes concurrency so easy. For one more example,

synchronized(this){ aRef = newVal;  return aRef;}

v/s

 while (true) {
            V x = atomicRef.get();
            if (atomicRef.compareAndSet(x, newValue))
                return atomicRef.get();
 }


Which one do you think is easier to grasp?

Many people think that Actors are the next big thing to tackle concurrency monster and complexities introduced by these shared memory model primitives. I too initially thought so, but then I found that Actor model isn't really the sweet spot in practice as it is touted. The very notion that Actors can fail and code must handle the tricky bits to recover from it makes it even more complex than using locks/mutexes etc. I am in constant a awe to see people talking so lightly about  fault tolerant/fail safe systems without giving thought on the amount of complexity it adds. I am not necessarily protesting that philosophy but that behaviour is just not common in yer average regular applications (will your user be happy if one actor failed to process her payment and was asked to retry?). We still live in dark ages of transparent concurrency.

I remain as ignorant and unsatisfied about concurrency support as I was several years ago. For me, concurrency is hard so I am off to shopping!

Monday, December 13, 2010

Tackling nulls the functional way

Most programmers have suffered null pointer one way or other - usually a core-dump followed by a segmentation fault on development machine or on a production box with application in smokes. NullPointerException results in a visible embarrassment of not thinking about "that something *could be* null".

Tracking Null Pointer ranges from loading core-dump in gdb and tracing dereferenced pointer to stack traces pointing to exact location in source. However, ease of tracking nulls opens up the doors to ignore them in practice and throwing null-checks just becomes as common as throwing one more div to fix IE's layout problems which is bad.

Problems with Null:
I hate having to ignore nulls as it is not always enough just to add one more null check. The reason why I am writing this blog is because I have several problems with Nulls:
  1. All and Every reference can be a null in languages like Java. This covers everything: method parameters, return values, fields etc. There's no precise way for a programmer to know that some method might return null or accept null parameters. You absolutely have to resort to actual source code or documentation to see if it can possibly return null (and you are going to need good luck with that). All of it adds extra work when you really want to be focusing on fixing the real problem.
  2. The problem with NullPointerException is that they point to causal eventuality and not usually the actual cause. So what you see in stack traces are usually the code paths where damage is not really initiated but done when we are normally interested in case where damage is initiated. 
  3. Null is actually very ambiguous. Is it the uninitialized value or absence of value or is it used to indicate an error? The paradigm of null fits well in database but not in programming model.
  4. Having Nulls in your code has major implications in code quality and complexity. For example, it is not unusual to see code branches with null checks breeding like rabbits when an API "may" return null which in turn results in extremely defensive code. This significantly taxes readability.
  5. Null makes Java's type system dumber when a method is overridden and you want to call it. Writing code like methodDoingStuff((ActualType)null, otherArgs) isn't exactly a pretty sight. This results in subtle errors when arguments are non-generic. 
In many ways Nulls are necessary evil. For those of us who care about readability and safety we can't ignore them yet we shouldn't just let it overtake safety and readability.

I have come to know several techniques to tackle nulls. First, there is Null Object pattern which is not entirely as ridiculous as the name implies but it's not practical in real life software having hundreds of class hierarchies and thousands of classes, and so, I will not talk about it. Then there are languages like Haskell and Scala with library classes that try to treat nulls in, IMO, a better way. Haskell has MayBe and Scala has Options. After using options in Scala for a while in a side project, I found that I was no longer fighting with nulls. I knew exactly when I had to make a decision that a value is really optional and I must do alternate processing.

The central idea behind Haskell's MayBe and Scala's Option is to introduce a definitive agreement on a value's eligibility to be either null or not-null enforced with the help of type system. I will talk about Scala's Option since I have worked with it, but the concept remains same. I will also introduce how to implement and use Options in Java since this is much more of a functional way of thinking about handling nulls and it doesn't (almost) take Scala's neat language features to implement it.

Treating nulls the better way:

Usual course of action when you are not sure what value to return from a method is:
Most of the times it results in the last option because you don't have to fear about breaking everything (well, mostly) and everyone passes the buck like this.

We can do better with Scala's Option classes. We can wrap any reference in to Some or None and handle it with pattern matching or "for comprehension". For example:


Some(x) represents a wrapper with x as actual value; None represents absence of value. Some and None are subclasses of Option. Option has all the interesting methods you can use. When a variable in question is null we can:  fall back to default value, evaluate and return a function's computed value, filter and so on.

Options in Java:
Implementing Options in Java is surprisingly a trivial task. However, it is not as pleasant as Scala's options. Implementation boils down to a wrapper class Option with two children: Some and None. None represents a null but with a type (None[T]) and Some represents non-null type.

To make Option interesting we make Option extend List, so we can iterate on it to mimic poor man's "for comprehension". We will also go as far as tagging both types with Enums so we can do poor man's pattern matching with a switch. You can find example implementation of Options in Java with a test case demonstrating use of Options. Here's the small snippet which covers the essense:

As you can see, Option opens up several doors to fix the null situation. You now have  choice to compute a value, use default value or do arbitrary stuff when you encounter nulls.

How are my null problem solved with Options:
  1. Using Options for optional/null-able references I have at least avoided "all things could be null" problem in my code. When an API is returning a Option, I don't have to wonder if it can return null. Intention is pretty clear.
  2. When I am forced to handle null right at the time of using an API, I have to handle it right there: do alternate processing or use default. No surprises.
  3. Option is a very clear way of saying a variable represents possibly an absent value.
  4. Option doesn't really solve this problem completely. For example, method signatures with wrapper Option type can get really long (e.g. def method1(): Map[String, Option[List[Option[String]]] = {}). However, compared to null checks, I would prefer long method signature any day. Other benefits out-weight this limitation.
  5. Clearly, Option[Integer] always means only Option[Integer] and not Option[Integer], Option[String], Option[Character], Option[Date] and so on. Compiler can infer exact method call from generic types.
As good as the concept behind optional values is, it doesn't and will not always save you from Null. You will still have to deal with existing libraries which return nulls and cause all these problems and more. However, most of the time null is problematic in your own code.

Where to use Options:
Here are the common places where I think using Options makes more sense:
  1. APIs: Make your API as specific and as readable as possible; all optional parameters and return values should be Option.
  2. Use in your domain model: You already have fair understanding on null-able columns, use Option for null-able fields in your table. It is not hard to integrate using Options if you are using an ORM with interceptable DB fetch; you can initialize fields to None if database contains null and so on.

In the interest of keeping this post relevant and on topic, I have completely avoided heavy theoretical baggage (monads et. al.) that's inevitable when theoretical functionalists (functional programmers) talk about Options. I really hope this post generates some interest in this topic. If you disagree or would like to share more on this topic, please leave a comment.

Saturday, May 01, 2010

Future of a Java programmer

As a long time Java only programmer professionally, I have been pondering about how things are changing around me as a Java programmer. Ever since I remember I had no choice but to use subset of C++ dialect (Java lang) with an extremely rich class library and ecosystem (Java platform).

In last few years there has been a drastic shift in number of languages targeting JVM. For example: dynamic (javascript, jruby, jython, groovy), functional + OO (scala) and a lisp dialect (clojure) and so many others. While I am excited about all the options I have today I don't think a single language will dominate on JVM anymore like Java did so far.

In a way this is a good thing, one tool rarely fits all needs (I couldn't curse Java enough for GUI programming). Like C, Java was never designed to be used for developing dynamic web apps, but we still tried and miserably failed with JSP/JSF and plethora of frameworks against PHP/Rails/Python in terms of productivity. One really good thing Java did was to raise a level of abstractions from platform specific details and memory management. These new languages on top of JVM raise the abstraction level even further for its area of strength.

It is not a remote future when we will see concurrent processes being programmed in clojure and presented with jruby/rails with intermediate code written in Java. Each layer of application is going to be implemented in different programming languages while interfaces being transparent for developers working in each layer. This is a big thing, it has never been envisioned before for Java Platform, the lowest coupling we have seen so far is through remoting (web services et. al.) where clients and servers are on different runtimes and languages.

What this means for a Java developer is if you are

  • A web developer: you are going to learn things which are extremely different from struts/jsf/jsps, no more artificial model1/model2 MVCs.
  • A non web-developer : you are going to write code which is far more readable and very specific to your business domain via DSL created in any of the languages mentioned above without worrying about accidental complexity Java and its frameworks imposed on you.

While I can keep classifying developers on Java platform all day long, these two are major ones whose life (and resumes) are going to change soon, they will be expected to know more than one programming languages rather than frameworks now. Contrary to the cool kids on interwebz, I don't think Java the language is going to die anytime soon not because many of the existing libraries are written in it, but because of the number of programmers on earth who know Java, tooling around it and the native JVM support for it. Java is like C in a way, you can do whatever is supported by underlying implementation.

Many of you who are like me are going to see change around them soon, I am thrilled to see how my career is going to transform as polyglot programmer are you?