Scalable Top-n Local Outlier Detection
Yizhou Yan (Worcester Polytechnic Institute);Lei Cao (Massachusetts Institute of Technology);Elke Rundensteiner (Worcester Polytechnic Institute)
Abstract
Local Outlier Factor (LOF) method that labels all points with their respective LOF scores to indicate their status is known to be very effective for identifying outliers in datasets with a skewed distribution. Since outliers by definition are the absolute minority in a dataset, the concept of Top-N local outlier was proposed to discover the \textit{n} points with the largest LOF scores. The detection of the Top-N local outliers is prohibitively expensive, since it requires huge number of high complexity k-nearest neighbor ($k$NN) searches. In this work, we present the first scalable Top-N local outlier detection approach called TOLF. The key innovation of \textit{TOLF} is a multi-granularity pruning strategy that quickly prunes most points from the set of potential outlier candidates without computing their exact LOF scores or even without conducting any $k$NN search for them. Our customized density-aware indexing structure not only effectively supports the pruning strategy, but also accelerates the $k$NN search. Our extensive experimental evaluation on OpenStreetMap, SDSS, and TIGER datasets demonstrates the effectiveness of TOLF $-$ up to 35 times faster than the state-of-the-art methods.