KDD Papers

Automatic Application Identification from Billions of Files

Kyle Soska (CMU);Christopher Gates (Symantec);Kevin Roundy (Symantec);Nicolas Christin (CMU)


Understanding how to group a set of binary files into the piece of software they belong to is highly desirable (e.g., for software profiling, malware detection, or enterprise audits, among others). Unfortunately, it is also extremely challenging: there is absolutely no uniformity in the ways different applications rely on different files, in the ways binaries are signed, or in the versioning schemes used across different pieces of software. In this paper, we argue that, by combining information gleaned from a large number of endpoints (millions of hosts), we can accomplish large-scale application identification automatically and reliably. Our approach relies on collecting metadata on billions of files every day, summarizing it into much smaller ``sketches,’’ and performing approximate k-nearest neighbor clustering on non-metric space representations derived from these sketches. We design and implement our proposed system using Apache Spark, show that it manages to process billions of files in a matter of hours and could be used for daily processing, and further show our system manages to successfully identify which files belong to which application with very high precision, and adequate recall.