Pattern Recognition Methodology
Exploratory Analysis of Point Proximity
We consider clustering as computation of a structure of proximity relationships within a data set in a feature space or its subspaces. We propose a data structure to represent such relationships, and show that, despite unavoidable arbitrariness in the clustering algorithms, constructive uses of their results can be made by studying correlations between multiple proximity structures computed from the same data. A software tool, Mirage, is constructed to facilitate such explorations.
Complexity of Classification Problems
We studied a number of measures that characterize the difficulty of a classification problem. We compared a set of real world problems to random combinations of points in this measurement space and found that real problems contain structures that are significantly different from the random sets. Distributions of problems in this space show that there exist at least two independent factors affecting a problem's difficulty. We suggest using this space to describe a classifier's domain of competence. This can guide static and dynamic selection of classifiers for specific problems as well as subproblems formed by confinement, projection, and transformation of the feature vectors.
Random Decision Forests
We propose a method to construct a decision tree based classifier that maintains highest accuracy on training data and improves on generalization accuracy as it grows in complexity, thereby solving the dilemma between overfitting and achieving maximum accuracy. The classifier consists of multiple trees constructed systematically by pseudo-randomly selecting subsets of components of the feature vector, that is, a forest of trees constructed in randomly chosen subspaces. The subspace method is found to be superior to single-tree classifiers and, in many cases, alternative forest construction methods by training set subsampling. An interesting special case is when each tree has only one level of split that partitions the subspace by a Voronoi tessellation, i.e., when each tree is a nearest-neighbor classifier.
Learning in everyday life is often accomplished by synthesizing the results derived from making many random guesses, and assessing their accuracy on given examples. A formal analysis of this process by Eugene Kleinberg in 1990 has resulted in a distinctive method for classifier design -- stochastic discrimination (SD). The method constructs an accurate classifier by combining a large number of very weak discriminators that are generated essentially at random. An important advantage is that classifiers designed this way can be made as accurate as possible on both training and test data at the same time, i.e., it does not suffer from the problem of overtraining. We looked into several algorithmic implementations of the method focusing on different conditions required by the theory. (joint work with Eugene M. Kleinberg)
A difficult problem in classification is to represent the class-conditional distributions concisely and faithfully. We propose a way of mapping such distributions and using the maps to construct a similarity metric. A classifier based on this metric can achieve low error rates and produce reliable confidence scores. For applications to arbitrary domains, a method is proposed that automatically constructs feature transformations suitable for such mappings.
Adaptive Coordination of Multiple Classifiers
Parallel use of multiple classifiers has been known to improve classification accuracy. Yet the additional demand on computational resources often prohibits its application to every input. With concerns for both accuracy and speed, we investigate strategies to coordinate and combine alternative classifiers that can adapt to certain conditions of the input. In an application to symbol recognition, we designed such strategies based on a detailed analysis of the classifiers' performances on images synthesized using a parameterized defect model. Individual accuracies, comparative strengths, correlation with defect parameters, confusion probabilities, and agreement of decisions are studied to provide support for the strategies.
Decision Combination in Multiple Classifier Systems
Motivated by the realities of a practical problem -- combining word classifiers of vastly different natures and for thousands of classes, I studied methods for combining decisions generated by multiple classifiers. I identified several core difficulties such as the need to bring the decisions to a comparable scale with sufficient resolution to keep the secondary guesses, and the requirements on the component classifiers to make the combination better than each individual. I used rank orders to represent class decision and the logistic regression model to combine contributions of different classifiers, thereby linking the problem to conventional statistical techniques for parameter estimation, test of model fitness, and redundancy identification. I also investigated methods for dynamically selecting a good set of classifiers for each test pattern. These methods were tested in applications on degraded machine-printed characters and words from large lexicons, resulting in substantial improvement in classification accuracy. With these methods, we shifted the focus of the field from concentrating on the competition among specific statistical, syntactic, or structural approaches to viewing all these as potentially contributing components in a combined system.
Modeling and Simulation of Control
Dynamics in Optical Transport Systems
The design and analysis of control strategies for sophisticated optical line systems requires an understanding of complex system dynamics that involves time-dependent interactions of many components. This need cannot be met by static simulations using steady-state models of individual components. We developed a system simulation software that is driven by models of discrete events coupled to continuous physical layer simulators. The simulator models, with detailed time synchronization, signal and noise propagation along a line system consisting of multiple independently controlled transmission elements and monitoring devices, in response to events such as channel add/drops, component failures and recoveries, and message passing. The software supports experimentation in many different modes, and provides logging and visualization of all relevant effects. It serves as a versatile platform for evaluating control algorithms, facilitates systematic exploration of design parameters, and allows assessment of system reliability related to component failures. (joint work with Lawrence Cowsar, Roland Freund, Leon Vardapetyan, and Todd Salamon)
Reading without Knowing Symbol Shapes
Conventional text recognition systems use a symbol shape classifier trained with huge image databases. Still they cannot deal with unseen fonts. We attempted to abandon shape training entirely and decode the text directly as a cryptogram. Imaging noise, character segmentation errors, and imperfect shape clustering make it much harder than a simple substitution cipher. Our method starts with clustering isolated symbol shapes, makes guesses on the largest clusters, and eventually recovers the symbol identities by sophisticated constraint satisfaction algorithms. Tested on 200 faxed business letters, the method recognized almost 80% of the words for 2/3 of the cases. (joint work with George Nagy)
Adaptive Methods for Text Recognition
This family of methods attempt to fine tune a general recognition procedure to data from a specific source. An example is to learn about the dominant font type and noise patterns on a given page, and use such knowledge to build a page-specific recognizer. Making use of a well-known linguistic observation that over 50% of words in a typical English page are contained in a very small lexicon of size less than 150, we developed a bootstrapping strategy beginning with recognizing the shapes of those most common words.
Text Categorization using Decision Forests
Text categorization is useful for indexing documents for information retrieval, filtering parts for document understanding, and summarizing contents of documents of special interests. We describe a text categorization task and an experiment using documents from the Reuters and OHSUMED collections. We applied the Decision Forest classifier and compared its accuracies to those of C4.5 and kNN classifiers, using both category dependent and category independent term selection schemes. It is found that Decision Forest outperforms both C4.5 and kNN in all cases, and that category dependent term selection yields better accuracies. Performances of all three classifiers degrade from the Reuters collection to the OHSUMED collection, but Decision Forest remains to be superior. (joint work with Hao Chen)
Image Enhancement by Clustering and Averaging
Proper display and accurate recognition of document images are often hampered by degradations caused by poor scanning or transmission conditions. We propose a method to enhance such degraded document images for better display quality and recognition accuracy. The essence of the method is in finding and averaging bitmaps of the same symbol that are scattered across a text page. Outline descriptions of the symbols are then obtained that can be rendered at arbitrary resolution. (joint work with John Hobby)
Simulation Studies on Character Recognition
Many obstacles to progress in image pattern recognition result from the fact that per-class distributions are often too irregular to be well-approximated by simple analytical functions. We believe that there are fundamental disadvantages in exclusively relying on implicit descriptions of image recognition problems. Simulation studies offer one way to circumvent these obstacles. In this study, we focus on an explicit, quantitative, stochastic model of degradations that occur in images of documents as a result of the physics of printing and imaging. We attempt to specify a recognition problem precisely using the model, and then study various issues related to the problem. (joint work with Henry Baird)
Estimation of Intrinsic Error Rate
This is an experiment in estimating the Bayes error of a difficult two--class problem. The Bayes error gives the `intrinsic difficulty' of the problem since it is the minimum error achievable by any classification method. For many realistically complex problems, deriving this analytically appears to be hopeless, so we approach the task empirically. We proceed first by expressing the problem precisely in terms of ideal prototype images and the image defect model, and then by carrying out the estimation on pseudorandomly simulated data. The study of the data reveals many interesting statistics, which allow the prediction of the worst--case time/space requirements for any given classifier performance. (joint work with Henry Baird)
Asymptotic Accuracy of Classifiers
Poor quality --- sparse or unrepresentative --- training data is widely suspected to be one cause of disappointing accuracy of isolated--character classification in OCR machines. We conjecture that, for many trainable classification techniques, it is the dominant factor affecting accuracy. In this study, we tested this conjecture by comparing the asymptotic accuracy of three classifiers on the same two--class problem. Using the defect model, the problem is represented as a stochastic source of an indefinitely long sequence of simulated images labeled with ground truth. Using this sequence, we were able to train all three classifiers to high and statistically indistinguishable asymptotic accuracies. This result suggests that the quality of training data is indeed the dominant factor affecting accuracy. (joint work with Henry Baird)
Systematic Evaluation of Classifiers
In this study we perform an experiment in which a classifier is evaluated on synthetic character images created using the defect model. The use of synthetic data permits control of the quality of the test data and quantitative analysis of the effects of image quality on OCR accuracy. For that particular classifier, we show that good performance --- accuracy above a given threshold --- occurs within contiguous regions of the space defined by the range of key parameters of the defect model. (joint work with Henry Baird)
Word-Based Recognition Methods
Poorly segmented (touching, fragmented, etc) characters often lead to major difficulties in automatic recognition. We developed alternative techniques that operate on the whole word as unit. The method is especially suitable for degraded text such as those created on FAX machines or high-speed scanners. The method can be used in combination with conventional approaches for superior results.