Shicheng Wang (Tsinghua University), Menghao Zhang (Beihang University & Infrawaves), Yuying Du (Information Engineering University), Ziteng Chen (Southeast University), Zhiliang Wang (Tsinghua University & Zhongguancun Laboratory), Mingwei Xu (Tsinghua University & Zhongguancun Laboratory), Renjie Xie (Tsinghua University), Jiahai Yang (Tsinghua University & Zhongguancun Laboratory)
RDMA is being widely used from private data center applications to multi-tenant clouds, which makes RDMA security gain tremendous attention. However, existing RDMA security studies mainly focus on the security of RDMA systems, and the security of the coupled traffic control mechanisms (represented by PFC and DCQCN) in RDMA networks is largely overlooked. In this paper, through extensive experiments and analysis, we demonstrate that concurrent short-duration bursts can cause drastic performance loss on flows across multiple hops via the interaction between PFC and DCQCN. And we also summarize the vulnerabilities between the performance loss and the burst peak rate, as well as the duration. Based on these vulnerabilities, we propose the LoRDMA attack, a low-rate DoS attack against RDMA traffic control mechanisms. By monitoring RTT as the feedback signal, LoRDMA can adaptively 1) coordinate the bots to different target switch ports to cover more victim flows efficiently; 2) schedule the burst parameters to cause significant performance loss efficiently. We conduct and evaluate the LoRDMA attack at both ns-3 simulations and a cloud RDMA cluster. The results show that compared to existing attacks, the LoRDMA attack achieves higher victim flow coverage and performance loss with much lower attack traffic and detectability. And the communication performance of typical distributed machine learning training applications (NCCL Tests) in the cloud RDMA cluster can be degraded from 18.23% to 56.12% under the LoRDMA attack.