Signature Limits: An Entire Map of Clone Features and their Discovery in Nearly Linear Time
Abstract
We address the problem of creating entire and complete maps of software code clones (copy features in data) in a corpus of binary artifacts of unknown provenance. We report on a practical methodology, which employs enhanced suffix data structures and partial orderings of clones to compute a compact representation of most interesting clones features in data. The enumeration of clone features is useful for malware triage and prioritization when human exploration, testing and verification is the most costly factor. We further show that the enhanced arrays may be used for discovery of provenance relations in data and we introduce two distinct Jaccard similarity coefficients to measure code similarity in binary artifacts. We illustrate the use of these tools on real malware data including a retro-diction experiment for measuring and enumerating evidence supporting common provenance in Stuxnet and Duqu. The results indicate the practicality and efficacy of mapping completely the clone features in data.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.