Shuo Wang (CSIRO's Data61 & Cybersecurity CRC, Australia), Mahathir Almashor (CSIRO's Data61 & Cybersecurity CRC, Australia), Alsharif Abuadbba (CSIRO's Data61 & Cybersecurity CRC, Australia), Ruoxi Sun (CSIRO's Data61), Minhui Xue (CSIRO's Data61), Calvin Wang (CSIRO's Data61), Raj Gaire (CSIRO's Data61 & Cybersecurity CRC, Australia), Surya Nepal (CSIRO's Data61 & Cybersecurity CRC, Australia), Seyit Camtepe (CSIRO's…
Traditional block/allow lists remain a significant defense against malicious websites, by limiting end-users' access to domain names. However, such lists are often incomplete and reactive in nature. In this work, we first introduce an expansion graph which creates organically grown Internet domain allow-lists based on trust transitivity by crawling hyperlinks. Then, we highlight the gap of monitoring nodes with such an expansion graph, where malicious nodes are buried deep along the paths from the compromised websites, termed as "on-chain compromise". The stealthiness (evasion of detection) and large-scale issues impede the application of existing web malicious analysis methods for identifying on-chain compromises within the sparsely labeled graph. To address the unique challenges of revealing the on-chain compromises, we propose a two-step integrated scheme, DoITrust, leveraging both individual node features and topology analysis: (i) we develop a semi-supervised suspicion prediction scheme to predict the probability of a node being relevant to targets of compromise (i.e., the denied nodes), including a novel node ranking approach as an efficient global propagation scheme to incorporate the topology information, and a scalable graph learning scheme to separate the global propagation from the training of the local prediction model, and (ii) based on the suspicion prediction results, efficient pruning strategies are proposed to further remove highly suspicious nodes from the crawled graph and analyze the underlying indicator of compromise. Experimental results show that DoITrust achieves 90% accuracy using less than 1% labeled nodes for the suspicion prediction, and its learning capability outperforms existing node-based and structure-based approaches. We also demonstrate that DoITrust is portable and practical. We manually review the detected compromised nodes, finding that at least 94.55% of them have suspicious content, and investigate the underlying indicator of on-chain compromise.