GPT-like Pre-Training on Unlabeled System Logs for Malware Detection

Download Slides

Watch Video

In recent years, self-supervised language modeling techniques, such as those used in GPT-like language models, have shown great success in natural language processing tasks, without requiring supervision from domain experts to learn language semantics. In this talk, we explore the transferability of these techniques to system logs and share pre-training methodology of a Transformer model on unlabeled logs for malware detection.

Infrastructures generate vast amounts of system logs suitable for cybersecurity needs, but only a fraction of these logs are labeled and annotated for specific events or anomalies. Our experiments demonstrate that pre-training the model on unlabeled system logs leads to improved performance on the task of malware detection, compared to training on labeled data alone. Moreover, we show that the pre-trained model learns patterns that are similar to what a human engineer would consider relevant in detecting malware.

These findings highlight the potential of pre-training GPT-like models on system logs for cybersecurity applications, and demonstrate the benefits of self-supervised learning approaches in domains where labeled data is scarce. Overall, our work contributes to the growing body of literature on applying language modeling techniques beyond natural language processing and opens up new avenues for research in the field of cybersecurity.

The talk will be structured as follows:

(5-10 min) Detection engineering setup in modern infrastructure. Systems like SIEM apply signatures on system telemetry like sysmon, auditd, kube-audit, etc. Non-trivial analytics like anomaly detection – too verbose and immature. Terabytes of logs are stored, suitable for advanced analytics.
(15 min) Advances in Natural Language Understanding (NLU). Why do systems like ChatGPT succeed? We will cover the timeline of modern NLU methods: (1) neural network architectures advancement (1-d convolutions, RNN, Transformers); (2) self-supervised pre-training techniques.
(15 min) Our experimental setup: (a) malware behavioral telemetry (from the Speakeasy emulator) and how we process it for input to model; (b) Transformer model architecture; (c) details on masked pre-training. Then we explore how models with pre-training on plain data, without labels, perform compared to conventional approaches.
(10 min) We explore what pre-trained and later fine-tuned model learn. As an example, we take a backdoor sample and see that attention activation from Transformer learns an anti-debugging timing technique based on GetTickCount API, corresponding to signatures that human reverse engineers create.
(5 min) Conclusions. Speculate on the potential of these techniques in cybersecurity.

About the Speakers

Luca Demetrio

Luca Demetrio is an Assistant Professor at the University of Genoa (Italy), and previously he was Postdoctoral Researcher at PRA Lab. He received his bachelor, master and Ph.D. degree at the University of Genoa in 2015, 2017 and 2021. His Ph.D. thesis, “Formalizing Evasion Attacks against Security Detectors”, focuses on the application of Adversarial Machine Learning against threat detectors, specifically how to fool Windows malware and SQL injections detectors by applying well-crafted noise to data. As a natural follow-up of his Ph.D. work, he is currently studying the security of Windows malware detectors implemented with Machine Learning techniques. He is part of the development team of SecML, and also he has created and maintained SecML Malware, a Python library for creating adversarial Windows malware. He is also currently involved in the development of techniques that can improve the quality of the evaluation of machine learning models, by providing debugging tools that can spot the failures at attack time.