AspellLookupTable is a commercial lookup table which has been around since CloverETL 2.6. Because Aspell is a free software spell checker, you might be wondering what it is used for in CloverETL. In fact, AspellLookupTable does not perform any spell checking at all, it “just” allows you to lookup data records with keys similar to the one you provide. This may be useful e.g. when looking for a street whose name is misspelled to a certain extent.
Similarity of String Keys
The similarity of two string keys is measured in terms of edit operations used to “transform” a misspelled key to the correct one. At the moment, CloverETL supports five edit operations: change of character case, character transposition, deletion of a character, insertion of a character, and finally character replacement. Each edit operation is assigned an edit cost which reflects its “seriousness”. By default, edit costs are set to 10, 90, 95, 95, and 100, respectively. As you can see, the difference in character case is not considered as serious as the character transposition. However, it is possible to tweak the edit costs as you wish.
If you add all edit costs of the minimum number of edit operations required to “transform” the misspelled key to the correct one, you get an edit distance of the two keys. In other words, you get the similarity measure of the keys. However, the edit distance itself is just a number which will vary for different keys. Therefore you need to somehow determine which data records should be included in the result set and which should be thrown away. And that’s why you also need to specify a spelling threshold. This threshold is 230 by default but may be changed to best suit your needs.
An example is worth a thousand words, so let’s assume that one of the keys is
"Hello". If you use
"helo" for lookup, the edit distance will be 105 because you had to change the case of
'h' (cost = 10) and to insert another
'l' (cost = 95). If you used
"Halllo" instead, the edit distance would be 195 (
'a' was replaced by
'e' and one
'l' had to be removed). Both these queries would return a non-empty result set as their edit distance is below the spelling threshold. On the other hand,
"Bye" as a lookup key would result in an empty result set as its edit distance is way above the threshold.
AspellLookupTable in Action
AspellLookupTable is a great candidate for address cleansing. Imagine you have a huge list of incorrectly spelled street names and you need to determine whether these streets actually exist in a certain town or city. You also have a directory of correctly spelled street names with several additional information, e.g. the town name, its region, etc. In the end, you want a list of all the streets, and corresponding towns, that do exist. Does it seem difficult? No, it’s a piece of cake with CloverETL:
First you need to set up your AspellLookupTable instance to load the street directory. Then you just load the invalid addresses and “pair” them with existing streets using the lookup table. When you have all the streets joined with whatever data you need, you may perform some further processing and then store the result in a flat file. Pretty simple, huh?