A fundamental type of data mining queries involves finding objects that are similar or discernible in different aspects. Mathematically, this means an examination of the correlation of clusters obtained according to different similarity metrics. Mirage is a software tool for studying cluster correlations.

For example, a physical object can be described by its size, shape, weight, position, orientation, color, chemical composition, etc. Each of these attributes can be represented by a single number (100 kg) or label (an oval shape, which can also be numerically coded), or an array of numbers (position in rectangular coordinates (x, y, z)). With sophisticated measurements, it is not uncommon that hundreds or thousands of numbers are needed to describe a single object. A mathematical description of such an object is a point in a very high dimensional space, where the number of dimensions equals to the number of numerical descriptors. A collection of objects in such a high dimensional space can span clouds of very complicated geometry and topology.

Clustering, or unsupervised learning, refers to a family of algorithms for finding natural groups in data. Two major kinds of clustering algorithms are model-based or similarity-based. Model-based algorithms assume that the data are generated according to an underlying probability distribution model, such as a finite mixture of Gaussian distributions. The goal of the algorithm is to estimate the parameters of the component distributions and the mixing factors. The result can be thought of as a soft partition of the data set. Similarity-based algorithms assume that the data can be compared meaningfully by a pre-defined metric of similarity, such as the Euclidean distance. The goal of the algorithm is to construct data structures that represent groupings of the objects according to this metric. Two commonly used data structures are partitional structures, where each point belongs to one and only one cluster, and hierarchical structures, where the collection of points is divided recursively into finer and finer scale clusters. So clustering is to find and describe the data clouds in high dimensional spaces, in terms of partitional or hierarchical structures.

However, in practice, there is often a complication that objects under study can be described by many features that are not necessarily measured on a single scale. For example, differences in position descriptors are not directly comparable to differences in weights. Thus no meaningful metric can be defined that compares all features simultaneously. In addition, objects may be described by ordinal or categorical features for which the comparison operators are more restricted. Most clustering algorithms do not apply directly to measurement spaces of mixed scales, and there are no easy ways to define and interpret a global similarity measure that is a function of all such measurements.

Nevertheless, natural metrics may exist for groups of related measurements. In the subspaces spanned by those measurements, the clustering algorithms are applicable. Thus a challenge is how to use clustering results from those subspaces in a larger context of study involving all the relevant measurements. Mirage is a tool designed to meet this challenge.

Usage Scenarios

Finding Patterns in Observed Multimedia Data ... In data mining problems one can consider an observation to be a sample point of an unknown, implicit function that links between different measurement spaces. The measurements may show different perspectives of the same objects, such as their weight, colors, sizes, image features, or time series of some variations. Using Mirage, one can study different projections of this function, and query about the proximity of points in each projection, for instance, to find whether objects of similar weights have similar dynamic attributes. This can be as simple as manually selecting a neighborhood in some projection, and track the selection in another projection. Alternatively, cluster structures can be automatically computed from certain features and the structures can be validated against other features.

Parameter Studies in Simulations ... Mirage can be used to study the effects of various parameters in large-scale simulations in computational science and engineering. There, one starts with a mathematical model (implemented as an algorithm) that maps a set of input parameters to a corresponding set of output quantities. With each specific choice of the input parameters, a point of the mapping function is computed. Mirage can be used to study sensitivity of the output quantities with respect to the input parameters. That is, how points in a neighborhood in a certain space map to points in another space.

Main Features

Mirage is a software tool designed for interactive exploration of the correlation of multiple partitional or hierarchical cluster structures arising in different contexts. The tool shows projected images of point classes and traversals of proximity structures in one, two, or higher dimensional subspaces, in linked views of tables, histograms, scatter plots, parallel coordinates, or over an image background. It also provides facilities for arbitrary plot configuration, manual or automatic classification, and intuitive graphical querying. Analysis and visualization operations are controlled by a small, interpreted command language.

Data Visualization in Basic Plots ... Mirage can be used as a simple data visualization tool. Four basic views are provided: tables, histograms, scatter plots, and feature vector plots in parallel coordinates. The graphical plots can be opened on any numerical attributes or defined vectors.

Selection and Tracking ... Each of the basic plots supports manual selection of data using a mouse. Several operations can be performed on the selected data: they can be exported to an external file, marked with choosable colors, broadcasted to all open views, or displayed in isolation in one of the basic plots.

Visualization and Traversal of Cluster Structures ... The most sophisticated use of Mirage involves traversals of cluster structures and the tracking of such traversals in other views.

Cluster Computation and Offline Analysis ... Mirage supports simple procedures that compute cluster structures which can be displayed inside the tool. These procedures can be run either online or offline.

Monitoring Dynamic Data ... Mirage can also be used to monitor dynamically generated data. New data can be added to the set by the command "addfrom file". Alternatively, data can be updated by the command "replacefrom file". More advanced interfaces to other programs can be built by making an external driver that periodically, or on some signal, writes new data to a specific file and sends the command "addfrom file" to the Mirage command interpreter. Some versions of Mirage support time series views to track an individual variable or a set of comparable variables.

User's Guide

See "A Graphical Guide to the Usage of Mirage". Also, refer to the manual pages available from the "Help" Menu inside the software. These pages are also posted on the web site, and may be updated from time to time. Or, just open a data set and play with it! The interface is intended to be intuitive and easy to learn.

Tips and Warnings

To delay running into "out of memory" errors, start the java process with a large maximum heap size by using the command line option "-Xmx[size]".