Many real-world KDD expeditions involve investigation of relationships between variables in different, heterogeneous databases. For example, one may be interested in investigating relationships between customer satisfaction and a company's in-house maintenance and sales records. Satisfaction surveys are generally conducted on a periodic basis and only involve a relatively small sample of customers. On the other hand, maintenance and sales records are collected continuously, providing massive amounts of information on all customers.
We present a dynamic programming technique for linking records in multiple heterogeneous databases using loosely defined fields that allow free-style verbatim entries.
After linking the databases, the objective is to discover potentially interesting relationships among the variables. One major difficulty that often arises from such data is the variable-length nature of many variables of interest. To overcome this problem, we develop an interestingness measure based on non-parametric randomization tests, which can be used for mining potentially useful relationships among variables. This measure uses distributional characteristics of historical events, hence accommodating variable-length records in a natural way. We also describe a graphical method for visualizing relationships, based on Trellis displays.
These methods are model-free, robust to the presence of outliers, and scale-up to databases of arbitrary size. As an illustration, we include a successful application of the proposed methodology to a real-world data mining problem in Lucent Technologies.
Key words: dynamic-programming; interestingness-measure;
record-linkage; scale-up; visualization
The statistical and computational methods we developed for this project are described in the paper "Methods for Linking and Mining Massive Heterogeneous Databases" [ PostScript ][ PDF ], Proceedings of the Fourth International Conference in Knowledge Discovery and Data Mining, 1998.