XML is a markup language that has been around for some years now. Originally, it comes from the world of documents – used in web hypertext, word processors and other representations. Today, it is very popular in many areas, including the world of data exchange. The reasons are simple – the format is straightforward, well defined, and easily transferable accross platforms. XML can be easily read and modified by users in contrast to proprietary and binary formats. It also represents structured hierarchical data, which can be very difficult to express in plain CSV format. XML is self-descriptive, which heavily increases the user’s ability to understand data and eliminates the need of data format description and parsing instructions.
XML is often used to transport data between potentionally incompatible systems, resulting in a task to parse and store data of this format and eventually to process this data. CloverETL provides powerful tools to accomplish this task.
One of the components that provides XML parsing is XMLXPathReader. The user simply defines the mapping of each data element or attribute to a given CloverETL field. In the background of the component there is a DOM parser which allows the user to include general XPath expressions in the mapping definition.
In practice, users will often encounter vast XML files, which typically follow a standard structure. This structure contains records which represent a given entity (company, person, etc.) that can be repeated many times in a large XML data source. It is quite common that these sources of data come in sizes of 10s or even 100s of gigabytes. When this happens, DOM parsing is greatly inappropriate as all this data cannot be contained in memory. For this reason, another CloverETL XML parsing component becomes handy – XMLExtract. This handles records individually which are usually quite small, at least small enough to be processed in memory.
In XMLExtract, the user is able to define how each element can be mapped to a CloverETL record at every level of the XML structure . XMLExtract also provides the possibility of including a parent key at each structure level, thus allowing later complete reconstructions of the entire data structure. If the XML does not contain the unique key itself, it can also be easily generated using a CloverETL sequence object.
XML data and their basic integrity rules can be very well specified using XML Schema which today is a standard part of well defined data exchange. If you use XML Schema, CloverETL provides a very convenient visual drag&drop editor which helps the user build an XML mapping:
This screenshot represents an XML mapping which defines how XML and Clover fields are mapped. This mapping can also be displayed as text:
To provide an example where these methods were essential, CloverETL successfully completed a master data consolidation and matching project for an international insurance company. The XML Schemas were very complex, containing hundreds of different XML element types in its structure. The volume of data was over a hundred GBs describing tens of millions of customers as organizations and 4-5 million customers as persons. One of the many tasks assigned to CloverETL was to read and store the vast amount of data in XML in which it performed substantially greater due to XML’s fast sequential processing.