Zhongjie Ba (Zhejiang University and McGill University), Tianhang Zheng (University of Toronto), Xinyu Zhang (Zhejiang University), Zhan Qin (Zhejiang University), Baochun Li (University of Toronto), Xue Liu (McGill University), Kui Ren (Zhejiang University)
Motion sensors on current smartphones have been exploited for audio eavesdropping due to their sensitivity to vibrations. However, this threat is considered low-risk because of two widely acknowledged limitations: First, unlike microphones, motion sensors can only pick up speech signals traveling through a solid medium. Thus the only feasible setup reported previously is to use a smartphone gyroscope to eavesdrop on a loudspeaker placed on the same table. The second limitation comes from a common sense that these sensors can only pick up a narrow band (85-100Hz) of speech signals due to a sampling ceiling of 200Hz. In this paper, we revisit the threat of motion sensors to speech privacy and propose AccelEve, a new side-channel attack that employs a smartphone's accelerometer to eavesdrop on the speaker in the same smartphone. Specifically, it utilizes the accelerometer measurements to recognize the speech emitted by the speaker and to reconstruct the corresponding audio signals. In contrast to previous works, our setup allows the speech signals to always produce strong responses in accelerometer measurements through the shared motherboard, which successfully addresses the first limitation and allows this kind of attacks to penetrate into real-life scenarios. Regarding the sampling rate limitation, contrary to the widely-held belief, we observe up to 500Hz sampling rates in recent smartphones, which almost covers the entire fundamental frequency band (85-255Hz) of adult speech. On top of these pivotal observations, we propose a novel deep learning based system that learns to recognize and reconstruct speech information from the spectrogram representation of acceleration signals. This system employs adaptive optimization on deep neural networks with skip connections using robust and generalizable losses to achieve robust recognition and reconstruction performance. Extensive evaluations demonstrate the effectiveness and high accuracy of our attack under various settings.