GPT-like Pre-Training on Unlabeled System Logs for Malware Detection

In recent years, self-supervised language modeling techniques, such as those used in GPT-like language models, have shown great success in natural language processing tasks, without requiring supervision from domain experts to learn language semantics. In this talk, we explore the transferability of these techniques to system logs and share pre-training methodology of a Transformer model on unlabeled logs for malware detection.

Infrastructures generate vast amounts of system logs suitable for cybersecurity needs, but only a fraction of these logs are labeled and annotated for specific events or anomalies. Our experiments demonstrate that pre-training the model on unlabeled system logs leads to improved performance on the task of malware detection, compared to training on labeled data alone. Moreover, we show that the pre-trained model learns patterns that are similar to what a human engineer would consider relevant in detecting malware.

These findings highlight the potential of pre-training GPT-like models on system logs for cybersecurity applications, and demonstrate the benefits of self-supervised learning approaches in domains where labeled data is scarce. Overall, our work contributes to the growing body of literature on applying language modeling techniques beyond natural language processing and opens up new avenues for research in the field of cybersecurity.

The talk will be structured as follows:

  1. (5-10 min) Detection engineering setup in modern infrastructure. Systems like SIEM apply signatures on system telemetry like sysmon, auditd, kube-audit, etc. Non-trivial analytics like anomaly detection – too verbose and immature. Terabytes of logs are stored, suitable for advanced analytics.

  2. (15 min) Advances in Natural Language Understanding (NLU). Why do systems like ChatGPT succeed? We will cover the timeline of modern NLU methods: (1) neural network architectures advancement (1-d convolutions, RNN, Transformers); (2) self-supervised pre-training techniques.

  3. (15 min) Our experimental setup: (a) malware behavioral telemetry (from the Speakeasy emulator) and how we process it for input to model; (b) Transformer model architecture; (c) details on masked pre-training. Then we explore how models with pre-training on plain data, without labels, perform compared to conventional approaches.

  4. (10 min) We explore what pre-trained and later fine-tuned model learn. As an example, we take a backdoor sample and see that attention activation from Transformer learns an anti-debugging timing technique based on GetTickCount API, corresponding to signatures that human reverse engineers create.

  5. (5 min) Conclusions. Speculate on the potential of these techniques in cybersecurity.

About the Speakers