Statistical Fraud Detection

Our goal has been to design statistically principled fraud detection algorithms that are able to keep up with the stream of transactions, use all the information about each customer, and detect fraud reliably and accurately with acceptable false alarm rates, even if the customer compromised by fraud has made only one or two transactions in the past. The quickest way to process the data transaction-by-transaction  is to build the algorithms on top of software that is designed to process transactions for billing. So, instead of starting a bill for a new customer and then updating it with each new transaction that the customer makes, we start a signature of predicted usage behavior for each customer, update it with each non-suspicious transaction that the customer makes, score transactions for fraud using predicted behavior for the customer as the baseline, and then accumulate translation scores into account fraud scores that are updated with each new transaction that the customer makes. All this is done with each transaction, so it has to be fast enough to keep up with the flow of transactions.  So, not only do we have to cope with tremendous variability across the customer base, high turnover among customers (so at any point in time there is little information on most customers), but we also have to live with the limited numerical capabilities of the computers that handle transaction processing for billing.

Every step of fraud detection is a fascinating, nonstandard statistical problem. Some background on fraud from both a conventional perspective and our statistical perspective are described in the paper Detecting Fraud in the Real World[PDF][PostScript]. Here are a few of the major statistical issues that arise in fraud detection, and a sketch of our approach to coping with them.

  1. Massive Model Selection
    1.  
      Our algorithms are based on tracking the behavior of each customer with a multivariate probability distribution that can be used to predict the customer's next legitimate call. We call our estimate of this distribution a signature because it captures (ideally, at least) all that is known about the customer's current transaction behavior. But monitoring all possible variables and all their possible interactions for all customers  in complete detail would require too much space and probably too much processing. Monitoring the same set of marginal and conditional distributions for all customers also simplifies processing. Thus, signature design involves choosing a set of marginal and conditional distributions that is best for detecting fraud subject to constraints on space. Statistically speaking, we have to select the form of the model to represent each customer's current behavior, choosing the one that is best for distinguishing fraudulent from legitimate behavior. The criterion for best that we adopt is "best for many and good for most customers" rather than "best averaged over all customers". The details are described in the paper Reducing Transaction Databases, Without Lagging Behind the Data or Losing Information[PDF][PostScript].
       
  2. Initialization of Signatures
    1.  
      To catch fraud quickly, we need to be able to assign a meangingful initial signature or predictive distribution to new customers. We have designed new initialization methods that are based on at most a few transactions for the customer. These methods resemble clustering or customer segmentation, but with two important differences. First, each marginal and conditional distribution in the signature is initialized separately, which gives a huge number of possible multivariate predictive distributions. This is justifiable because the model selection stage retains the important dependencies in signature components. Second, the metrics for optimizing initial distributions are based on the idea of "best for many" and "good for most" rather than on "best averaged over all customers".  Our methods are described in more detail in the paper Reducing Transaction Databases, Without Lagging Behind the Data or Losing Information[PDF][PostScript].
       
  3. Incremental Updating Algorithms
    1.  
      The predictive model for each customer is updated with each transaction that the customer makes. Thus, the customer eventually moves from a customer segment defined by its initial distributions to a personalized segment that reflects its individual behavior. Standard dynamic updating algorithms (like exponential weighted moving averaging) can often be used if the variable being updated can be modeled as a random draw from its signature component. There are some hard questions that updating raises, though, especially when updating has to be fast and space efficient.
       
      1. Chronological variables

      2. These are periodic timing variables, such as day-of-week or hour-of-day. They are not observed at random but "in order", so all the Monday transactions for the week have to be observed before all the Tuesday transactions for the week, for example. We have used a dynamic Poisson process to derive an approximation to a posterior mean that is almost as easy to compute as a linear estimate and that predicts well on both real and simulated data. This work is described in the paper Estimating Millions of Dynamic Patterns in Real-Time[PDF][PostScript] in JASA, March 2001.
         
      3. Quantiles

      4. Incremental estimation of quantiles is difficult because quantiles are not linear estimates and the quantile of a small sample is unstable or even undefined for a sample of size one (the relevant sample size when updating occurs transaction-by-transaction.  We have successfully used a variant of stochastic approximation that we call exponentially weighted stochastic approximation to track quantiles dynamically, though. This work is described in the paper Incremental Quantile Estimation for Massive Tracking[PDF][PostScript] that appears in the Proceedings of KDD2000.
         
      5. Updating Quantized Probabilities

      6. With limited space, each reported probability can assume only a finite number of values (say 256 if each probability is allowed one byte). Often, it is necessary to use a weight for exponentially weighted moving averaging that is smaller than the difference between adjacent probability values to avoid unstable estimates, though. If so, the initial probabilities can never be moved by data with standard exponential weighting. To avoid stagnant probabilities, we use a form of stochastic quantization that is unbiased. This work is described in the paper Reducing Transaction Databases, Without Lagging Behind the Data or Losing Information[PDF][PostScript].
For further information about this work, contact Diane Lambert, José Pinheiro or Don Sun.
       
Back to: [projects] [statistics homepage]