Kam1n0: MapReduce-based Assembly Clone Search for Reverse Engineering
Steven H. H. Ding, McGill University; Benjamin C. M. Fung*, McGill University; Philippe Charland, Defence Research and Development Canada
Assembly code analysis is one of the critical processes for detecting and proving software plagiarism and software patent infringements when the source code is unavailable. It is also a common practice to discover exploits and vulnerabilities in existing software. However, it is a manually intensive and time-consuming process even for experienced reverse engineers. An eﬀective and eﬃcient assembly code clone search engine can greatly reduce the eﬀort of this process, since it can identify the cloned parts that have been previously analyzed. The assembly code clone search problem belongs to the ﬁeld of software engineering. However, it strongly depends on practical nearest neighbor search techniques in data mining and databases. By closely collaborating with reverse engineers and Defence Research and Development Canada (DRDC ), we study the concerns and challenges that make existing assembly code clone approaches not practically applicable from the perspective of data mining. We propose a new variant of LSH scheme and incorporate it with graph matching to address these challenges. We implement an integrated assembly clone search engine called Kam1n0. It is the ﬁrst clone search engine that can eﬃciently identify the given query assembly function’s subgraph clones from a large assembly code repository. Kam1n0 is built upon the Apache Spark computation framework and Cassandra-like key-value distributed storage. A deployed demo system is publicly available. Extensive experimental results suggest that Kam1n0 is accurate, eﬃcient, and scalable for handling large volume of assembly code.
Filed under: Graph Mining and Social Networks