The first beta version CloverETL Data Profiler was released in October, and since then we have been working on improvements for the second beta version, which was released at the end of last year. Besides bug fixing and adding a few new features, we also worked on performance optimization of profiling metrics. This article will describe this improvement and how profiling is interconnected with CloverETL Engine.
The CloverETL Data Profiler processes input data as a stream. All metrics read input values as they are obtained from the source (CSV file, Excel sheet, or database table) and, at the end of a reading, metrics return their results and these results are then stored into the results database.
For most of the metrics (minimum value, maximum value) this approach works just fine. However, certain metrics cannot work like this – not only do they have higher computation-time requirements, but they also require all the input data to be kept in memory. For large data sets this makes using the operating system memory inappropriate. Therefore, external memory needs to be used.
In the first beta version we used Profiler’s internal SQL database to store all the values for these memory-consuming metrics. The data were first inserted into the SQL database and then a database query was used to calculate the result of the metric. This allowed for profiling large data sets– larger than the amount of available memory.
However, there was a large overhead caused by inserting the data into the database; also, the final result query computation consumed lots of system resources. We were not happy with the performance and architecture of this solution, so we decided to redesign it and use the powerful CloverETL Engine to get the job done.
We exploit the fact that the memory-consuming metrics can still be computed on a stream of data, if the incoming data are sorted. In the improved version we use the CloverETL Engine to first sort the values using the ExtSort components, and then we analyze the sorted data as a stream. In this approach, no other external facilities (such as SQL database) are used during profiling.
The overall performance of CloverETL Data Profiler has improved, especially for large data sets. Even with full set of metrics enabled, we are now able to analyze 4 GB of data with 30 fields in 30 minutes. Also, memory consumption has improved significantly.
Finally, in the Profiler GUI, we have marked the metrics that require sorting, and therefore have longer computation time, with a small clock icon. These metrics are not enabled by default.
The following picture shows in detail the different phases of metrics calculation. First, we calculate the metrics that can work on unsorted streams of data. In the following phases, for each field in its own separate phase, we run an ExtSort component and connect it to a component that calculates the metrics that expect sorted data. We use Rollup with custom transform Java code to calculate the metrics. Rollup allows for producing variable amount of output records for any amount of incoming records.
Another performance improvement in second beta version of CloverETL Data Profiler also affects the metrics that do not require the input data to be sorted. The profiler will now make better use of available CPUs and there will be less CPU time consumed on context-switching. This results in a boost up to 15% for data with a high amount of fields. Also, simpler structure of the internal CloverETL graph results in a significantly lower memory footprint.
In summary, in the second beta version of CloverETL Data Profiler, we have improved both performance and memory consumption by fully exploiting the capabilities of CloverETL Engine.