Pattern Recognition Methodology
Exploratory Analysis of Point Proximity
in Subspaces
We consider clustering as computation of a structure of
proximity relationships within a data set in a feature space or its
subspaces. We propose a data structure to represent such relationships,
and show that, despite unavoidable arbitrariness in the clustering
algorithms, constructive uses of their results can be made by
studying correlations between multiple proximity structures computed from the
same data. A software tool, Mirage, is constructed to facilitate such
explorations.
Complexity of Classification Problems
We studied a number of measures that characterize the difficulty of a
classification problem. We compared a set of real world problems to
random combinations of points in this measurement space and found
that real problems contain structures that are significantly different
from the random sets. Distributions of problems in this space
show that there exist at least two independent factors affecting
a problem's difficulty. We suggest using this space to describe a
classifier's domain of competence. This can guide static and dynamic
selection of classifiers for specific problems as well as subproblems
formed by confinement, projection, and transformation of the feature
vectors.
Random Decision Forests
We propose a method to construct a decision tree based classifier
that maintains highest accuracy on training data and improves on
generalization accuracy as it grows in complexity, thereby solving
the dilemma between overfitting and achieving maximum accuracy.
The classifier consists of multiple trees constructed systematically
by pseudo-randomly selecting subsets of components of the feature
vector, that is, a forest of trees constructed in randomly chosen
subspaces. The subspace method is found to be superior to
single-tree classifiers and, in many cases, alternative forest
construction methods by training set subsampling. An interesting special
case is when each tree has only one level of split that partitions the
subspace by a Voronoi tessellation, i.e., when each tree is a
nearest-neighbor classifier.
Stochastic Discrimination
Learning in everyday life is often accomplished by
synthesizing the results derived from making many
random guesses, and assessing their accuracy on given examples.
A formal analysis of this process by Eugene Kleinberg in 1990
has resulted in a distinctive method
for classifier design -- stochastic discrimination (SD).
The method constructs an accurate classifier by
combining a large number of very weak discriminators that are generated
essentially at random. An important advantage is that classifiers
designed this way can be made as accurate as possible on both training
and test data at the same time, i.e., it does not suffer from
the problem of overtraining. We looked into several algorithmic
implementations of the method focusing on different conditions
required by the theory. (joint work with Eugene M. Kleinberg)
Distribution Maps
A difficult problem in classification is to represent the
class-conditional distributions concisely and faithfully.
We propose a way of mapping such distributions and using the maps
to construct a similarity metric. A classifier based on this metric
can achieve low error rates and produce reliable confidence scores.
For applications to arbitrary domains, a method is proposed that
automatically constructs feature transformations suitable for such mappings.
Adaptive Coordination of Multiple Classifiers
Parallel use of multiple classifiers has been known to improve
classification accuracy. Yet the additional demand on computational
resources often prohibits its application to every input.
With concerns for both accuracy and speed, we investigate strategies
to coordinate and combine alternative classifiers that can adapt to
certain conditions of the input. In an application to symbol
recognition, we designed such strategies based on a detailed analysis
of the classifiers' performances on images synthesized using a
parameterized defect model. Individual accuracies, comparative
strengths, correlation with defect parameters, confusion
probabilities, and agreement of decisions are studied to provide
support for the strategies.
Decision Combination in Multiple Classifier Systems
Motivated by the realities of a practical problem -- combining word
classifiers of vastly different natures and for thousands of classes,
I studied methods for combining decisions generated by multiple
classifiers. I identified several core difficulties such as
the need to bring the decisions to a comparable scale with sufficient
resolution to keep the secondary guesses, and the requirements on the
component classifiers to make the combination better than each
individual. I used rank orders to represent class decision and
the logistic regression model to combine contributions of different
classifiers, thereby linking the problem to conventional statistical
techniques for parameter estimation, test of model fitness, and redundancy
identification. I also investigated methods for dynamically
selecting a good set of classifiers for each test pattern.
These methods were tested in applications on degraded
machine-printed characters and words from large lexicons, resulting in
substantial improvement in classification accuracy.
With these methods, we shifted the focus of the field from
concentrating on the competition among specific statistical,
syntactic, or structural approaches to viewing all these as
potentially contributing components in a combined system.
Applications
Modeling and Simulation of Control
Dynamics in Optical Transport Systems
The design and analysis of control strategies for sophisticated
optical line systems requires an understanding of complex system
dynamics that involves time-dependent interactions of many components.
This need cannot be met by static simulations using steady-state
models of individual components. We developed a system simulation
software that is driven by models of discrete events coupled to
continuous physical layer simulators. The simulator models, with
detailed time synchronization, signal and noise propagation along a
line system consisting of multiple independently controlled
transmission elements and monitoring devices, in response to events
such as channel add/drops, component failures and recoveries, and
message passing. The software supports experimentation in many
different modes, and provides logging and visualization of all
relevant effects. It serves as a versatile platform for evaluating
control algorithms, facilitates systematic exploration of design
parameters, and allows assessment of system reliability related to
component failures.
(joint work with Lawrence Cowsar, Roland Freund,
Leon Vardapetyan, and Todd Salamon)
Reading without Knowing Symbol Shapes
Conventional text recognition systems use a symbol shape classifier
trained with huge image databases. Still they cannot deal with
unseen fonts. We attempted to abandon shape training entirely
and decode the text directly as a cryptogram.
Imaging noise, character segmentation errors, and imperfect shape
clustering make it much harder than a simple substitution cipher.
Our method starts with clustering isolated symbol shapes, makes
guesses on the largest clusters, and eventually recovers the symbol
identities by sophisticated constraint satisfaction algorithms.
Tested on 200 faxed business letters, the method recognized almost
80% of the words for 2/3 of the cases.
(joint work with George Nagy)
Adaptive Methods for Text Recognition
This family of methods attempt to fine tune a general recognition procedure
to data from a specific source. An example is to learn about
the dominant font type and noise patterns on a given page,
and use such knowledge to build a page-specific recognizer.
Making use of a well-known linguistic observation that over 50% of
words in a typical English page are contained in a very small lexicon
of size less than 150, we developed a bootstrapping strategy
beginning with recognizing the shapes of those most common words.
Text Categorization using Decision Forests
Text categorization is useful for indexing documents for
information retrieval, filtering parts for document understanding,
and summarizing contents of documents of special interests.
We describe a text categorization task and an experiment using
documents from the Reuters and OHSUMED collections.
We applied the Decision Forest classifier and
compared its accuracies to those of C4.5 and kNN classifiers,
using both category dependent and category independent term selection
schemes. It is found that Decision Forest outperforms both C4.5 and kNN
in all cases, and that category dependent term selection yields better
accuracies. Performances of all three classifiers degrade from the
Reuters collection to the OHSUMED collection, but Decision Forest
remains to be superior. (joint work with Hao Chen)
Image Enhancement by Clustering and Averaging
Proper display and accurate recognition of document images
are often hampered by degradations caused by poor scanning or
transmission conditions. We propose a method to enhance such
degraded document images for better display quality and recognition
accuracy. The essence of the method is in finding and averaging
bitmaps of the same symbol that are scattered across a text page.
Outline descriptions of the symbols are then obtained that can be
rendered at arbitrary resolution. (joint work with John Hobby)
Simulation Studies on Character Recognition
Many obstacles to progress in image pattern recognition result from the
fact that per-class distributions are often too irregular to be
well-approximated by simple analytical functions.
We believe that there are fundamental disadvantages in exclusively
relying on implicit descriptions of image recognition problems.
Simulation studies offer one way to circumvent these obstacles.
In this study, we focus on an explicit, quantitative, stochastic
model of degradations that occur in images of documents
as a result of the physics of printing and imaging.
We attempt to specify a recognition problem precisely using the model, and then
study various issues related to the problem. (joint work with Henry Baird)
Estimation of Intrinsic Error Rate
This is an experiment in estimating the Bayes error of a difficult
two--class problem. The Bayes error gives the `intrinsic difficulty' of
the problem since it is the minimum error achievable by any classification
method. For many realistically complex problems,
deriving this analytically appears to be hopeless,
so we approach the task empirically.
We proceed first by expressing the problem precisely in
terms of ideal prototype images and the image defect model,
and then by carrying out the estimation on pseudorandomly
simulated data. The study of the data reveals many interesting statistics,
which allow the prediction of the worst--case time/space
requirements for any given classifier performance.
(joint work with Henry Baird)
Asymptotic Accuracy of Classifiers
Poor quality --- sparse or unrepresentative ---
training data is widely suspected to be one cause of disappointing accuracy of
isolated--character classification in OCR
machines. We conjecture that, for many trainable classification techniques,
it is the dominant factor affecting accuracy.
In this study, we tested this conjecture by comparing the
asymptotic accuracy of three classifiers on the same two--class
problem. Using the defect model, the problem is represented as a
stochastic source of an indefinitely long sequence of simulated
images labeled with ground truth.
Using this sequence, we were able to train all three classifiers to
high and statistically indistinguishable asymptotic accuracies.
This result suggests that the quality of training data is indeed
the dominant factor affecting accuracy.
(joint work with Henry Baird)
Systematic Evaluation of Classifiers
In this study we perform an experiment in which a classifier
is evaluated on synthetic character images created using the defect model.
The use of synthetic data permits control of
the quality of the test data and quantitative
analysis of the effects of image quality on OCR accuracy.
For that particular classifier, we show that good performance
--- accuracy above a given threshold ---
occurs within contiguous regions of the space
defined by the range of key parameters of the defect model.
(joint work with Henry Baird)
Word-Based Recognition Methods
Poorly segmented (touching, fragmented, etc) characters
often lead to major difficulties in automatic recognition.
We developed alternative techniques that operate on the
whole word as unit.
The method is especially suitable for degraded text
such as those created on FAX machines or high-speed scanners.
The method can be used in combination with conventional approaches
for superior results.