Fei Zuo (University of South Carolina), Xiaopeng Li (University of South Carolina), Patrick Young (Temple University), Lannan Luo (University of South Carolina), Qiang Zeng (University of South Carolina), Zhexin Zhang (University of South Carolina)

Binary code analysis allows analyzing binary code without having access to the corresponding source code. It is widely used for vulnerability discovery, malware dissection, attack investigation, etc. A binary, after disassembly, is expressed in an assembly language. This inspires us to approach binary analysis by leveraging ideas and techniques from Natural Language Processing (NLP), a rich area focused on processing text of various natural languages. We notice that binary code analysis and NLP share a lot of analogical topics, such as semantics extraction, summarization, and classification. This work utilizes these ideas to address two important code similarity comparison problems. (I) Given a pair of basic blocks for different
instruction set architectures, determining whether their semantics is similar or not; and (II) given a piece of code of interest, determining if it is contained in another piece of assembly code from a different architecture. The solutions to these two problems have many applications, such as cross-architecture code plagiarism detection, malware identification, and vulnerability discovery.

Despite the evident importance of Problem I, existing solutions are either inefficient or imprecise. Inspired by Neural Machine Translation (NMT), which is a new approach that tackles text across natural languages very well, we regard instructions as words and basic blocks as sentences, and propose a novel cross-(assembly)-lingual deep learning approach to solving the first problem, attaining high efficiency and precision. Regarding Problem II, many solutions have been proposed recently to solve this issue at the function level. However, performing cross-architecture code similarity comparison beyond function pairs is a new and more challenging endeavor. Employing our technique for cross-architecture basic-block comparison, we propose an effective solution to Problem II. We implement a prototype system and perform a comprehensive evaluation. A comparison between our approach and existing approaches to Problem I shows that our system outperforms them in terms of accuracy, efficiency and scalability. And the case studies utilizing the system demonstrate that our solution to Problem II is effective. Moreover, this research showcases how to apply ideas and techniques from NLP to large-scale binary code analysis.

View More Papers

Cybercriminal Minds: An investigative study of cryptocurrency abuses in...

Seunghyeon Lee (KAIST, S2W LAB Inc.), Changhoon Yoon (S2W LAB Inc.), Heedo Kang (KAIST), Yeonkeun Kim (KAIST), Yongdae Kim (KAIST), Dongsu Han (KAIST), Sooel Son (KAIST), Seungwon Shin (KAIST, S2W LAB Inc.)

Read More

Tranco: A Research-Oriented Top Sites Ranking Hardened Against Manipulation

Victor Le Pochat (imec-DistriNet, KU Leuven), Tom Van Goethem (imec-DistriNet, KU Leuven), Samaneh Tajalizadehkhoob (Delft University of Technology), Maciej Korczyński (Grenoble Alps University), Wouter Joosen (imec-DistriNet, KU Leuven)

Read More

Statistical Privacy for Streaming Traffic

Xiaokuan Zhang (The Ohio State University), Jihun Hamm (The Ohio State University), Michael K. Reiter (University of North Carolina at Chapel Hill), Yinqian Zhang (The Ohio State University)

Read More

Master of Web Puppets: Abusing Web Browsers for Persistent...

Panagiotis Papadopoulos (FORTH-ICS, Greece), Panagiotis Ilia (FORTH-ICS), Michalis Polychronakis (Stony Brook University, USA), Evangelos P. Markatos (FORTH-ICS, Greece), Sotiris Ioannidis (FORTH-ICS, Greece), Giorgos Vasiliadis (FORTH-ICS, Greece)

Read More