Those of you who have ever tried to process big records with CloverETL already learned that it required some tweaking and special care to make it run smoothly and efficiently. In some cases, CloverETL could get too greedy with memory requirements for a graph run, making it quite cumbersome to set up. With CloverETL 3.2 we have introduced improved memory management in the runtime layer that has optimized memory usage when running graphs with big records.
Let’s take a look inside to see what this is all about…
Clover’s approach to record processing is based on a pipeline – a chain of processing components connected by edges. The edges are the key point in inter-component communication. They have to ensure a fast transfer of records from one component to another. Our approach for edge data transfer has always been based on records serialization into byte stream on the starting end of an edge and deserialization back to record form on the other end. This ensures a basic invariant for all of our components - no record instance sharing. Each component has its own instance of a record populated from data on the input edge. It is then processed by the component and serialized into an output edge. This simple idea delivers excellent performance gains. (We have tried many times to find an even better approach, but have always returned to this one. Believe me – we have tried hard and many, many times.)
This imposes a painful decision to make on the edge itself – the capacity of the buffer that stores the bytes as they are passed from one end to the other. Obviously, that buffer must have enough room to hold the biggest record which passes through it. Those familiar with the CloverETL engine already know where I am going – the Records.MAX_RECORD_SIZE parameter.
In versions prior to 3.2, we used standard java.nio.ByteBuffer allocated to various multiples of MAX_RECORD_SIZE. That meant that all edges, component buffers, and just about anything with records passing through it were set to accommodate at least MAX_RECORD_SIZE bytes worth of the “guessed” biggest record possible. Over time, we gradually raised the default from 8 KB up to 64 KB (which, in the world of XMLs, unstructured data and other modern marvels is still far from being enough.) Yet, increasing MAX_RECORD_SIZE had quite a few negative effects on memory consumption, as any small increase was immediately multiplied by number of components and edges in a graph that shared this static buffer size. It was also shared among all graphs and sandboxes in the Server where the default was applied, regardless of whether or not the graph processed big records.
Now we are proud to say that with release 3.2, we have brought a significant improvement to this area. No more MAX_RECORD_SIZE trade-off decisions are necessary. Memory allocation for an edge and component buffer is now smart: it grows with higher demands and stays low for low demands. We have stepped up from the plain ByteBuffer to our own new container for serialized byte form of records – a CloverBuffer. It acts as a full replacement of a ByteBuffer, but what sets it apart is the ability to grow. CloverBuffer starts small but can transparently grow up to a predefined maximum limit (newly introduced RECORD_INITIAL_SIZE and RECORD_LIMIT_SIZE) without needing any programmer intervention.
So although there still is one global setting for all, it just sets boundaries that cannot be crossed. But anything in between those limits is allocated automatically to ensure the smallest memory footprint of each transformation run based on its needs in real time, not estimated ones. Graphs combining the processing of big records and small records, e.g. main stream of data combined with some logging branch, utilize only as much memory per edge/component as the size of the data passing through them.
All CloverETL code base has been refactored to use the new CloverBuffer. We recommend that everyone adopt it too, so that your transformations can run seamlessly. In any case, you don’t need to worry – we keep our code backward compatible so even without changing your code, it still complies with the new release.
For completeness, here is an example of old record container allocation:
ByteBuffer recordBuffer = ByteBuffer.allocateDirect(Defaults.Record.MAX_RECORD_SIZE);
This now should be substituted with following code:
CloverBuffer recordBuffer = CloverBuffer.allocateDirect(Defaults.Record.RECORD_INITIAL_SIZE, Defaults.Record.RECORD_LIMIT_SIZE);
The constant Record.MAX_RECORD_SIZE is now deprecated and a pair of new constants was introduced:
Record.RECORD_INITIAL_SIZE – initial buffer size, for now 64KB and will be probably decreased in upcoming release (http://bug.javlin.eu/browse/CL-2070) to minimalize initial memory allocation for regular graphs.
The second constant ‘Record.RECORD_LIMIT_SIZE’ is actually one-to-one replacement for MAX_RECORD_SIZE (keeping MAX_RECORD_SIZE backward compatible for the sake of unmodified components), which sets the maximum upper bound per one CloverBuffer instance. This can be virtually anything – for convenient early detection of real buffer overruns, it is set to 32 MB by default. Lowering or increasing this upper bound affects memory consumption only in the cases where there is a real need for such big buffer – otherwise the buffers are kept at RECORD_INITIAL_SIZE and are grown gradually towards the upper limit.
As you can see, the CloverBuffer now makes it possible to process bigger records with less memory footprint, since only buffers for edges or components that actually manipulate big records grow, while the others still remain small.