Skip to content

Training a Neural Network for Cybersecurity

Inspiration

Working in the cybersecurity field, I build alert pipelines that categorize threats based on predefined conditions. For example, was a certain IoC observed? Were any CrowdStrike policies activated? Neural networks can help us build better threat detection models by weighting model inputs (e.g., a binary variable indicating the presence of an IoC) through a learning process.

The Model

I based my neural network on the “Windows Malwares” dataset available on Kaggle. This repository describes various characteristics of a set of executables flagged as benign or one of six possible attacks. I chose to use the DLLs_Imported.csv file, which enumerates the shared object libraries that each executable links to, for a multi-class classification model.

Notable Features

  • The first layer is an embedding layer. Embedding layers associate the index of an embedding (a specific DLL) with a vector (in this case, of length 3). Embeddings are typically used in NLP applications to encode semantic information about words.
  • The model excludes the final softmax layer; PyTorch’s CrossEntropyLoss() operates on logits.
  • The model uses an 80%/10%/10% train/validation/test set. The dataset is shuffled before training.
import pandas as pd
from torch.utils.data import random_split
import torch.nn as nn
import torch
import torch.nn.functional as F
from torch.optim import Adam
import numpy as np

TRAIN_SPLIT = 0.8
VAL_SPLIT = 0.1
TEST_SPLIT = 0.1
EPOCHS = 30
BATCH_SIZE = 64

dlls = pd.read_csv('data/DLLs_Imported.csv')
# Shuffle dataset
dlls = dlls.sample(frac=1)

train, val, test = random_split(dlls, [TRAIN_SPLIT, VAL_SPLIT, TEST_SPLIT])

Xtrain = dlls.iloc[train.indices, 2:]
Ytrain = dlls.iloc[train.indices, 1]

print(f'Train Set Features: {Xtrain.shape} / Train Set Labels: {Ytrain.shape}')

Xval = dlls.iloc[val.indices, 2:]
Yval = dlls.iloc[val.indices, 1]

print(f'Validation Set Features: {Xval.shape} / Validation Set Labels: {Yval.shape}')

Xtest = dlls.iloc[test.indices, 2:]
Ytest = dlls.iloc[test.indices, 1]

print(f'Test Set Features: {Xtest.shape} / Test Set Labels: {Ytest.shape}')

class NeuralNetwork(nn.Module):
  def __init__(self, in_features, out_features):
    super().__init__()
    self.model = nn.Sequential(
        nn.Embedding(in_features, 3),
        nn.Flatten(),
        nn.Linear(3 * in_features, 64),
        nn.ReLU(),
        nn.Linear(64, out_features)
    )

  def forward(self, x):
    return self.model(x)

model = NeuralNetwork(Xtrain.shape[1], 7)
optimizer = Adam(model.parameters())
loss = nn.CrossEntropyLoss()
for epoch in range(EPOCHS):
  for batch_start in range(0, len(Xtrain), BATCH_SIZE):
    Xbatch = torch.tensor(Xtrain[batch_start:batch_start + BATCH_SIZE].values)
    Ybatch = torch.tensor(Ytrain[batch_start:batch_start + BATCH_SIZE].values)
    model_output = model.forward(Xbatch)
    output = loss(model_output, Ybatch)
    optimizer.zero_grad()
    output.backward()
    optimizer.step()
  print(f'Epoch: {epoch + 1} Loss: {output}')
  with torch.set_grad_enabled(False):
    val_preds = model.forward(torch.tensor(Xval.values))
    class_predictions = np.argmax(F.softmax(val_preds, dim=1), axis=1)
    acc = 100 * torch.sum(torch.eq(class_predictions, torch.tensor(Yval.values))) / len(Yval)
    print(f'Validation Accuracy: {acc:.2f}%')

with torch.set_grad_enabled(False):
  test_preds = model.forward(torch.tensor(Xtest.values))
  class_predictions = np.argmax(F.softmax(test_preds, dim=1), axis=1)
  acc = 100 * torch.sum(torch.eq(class_predictions, torch.tensor(Ytest.values))) / len(Ytest)
  print(f'Test Set Accuracy: {acc:.2f}%')

Performance

After training for 30 epochs, the model achieves roughly 60% accuracy on the test set. Not a bad start, but there’s certainly room for improvement.

Observations:

  • The model achieves similar accuracy on the testing and validation sets. This suggests that the model isn’t overfitting.
  • Increasing the depth of the network does not considerably change its performance. There is an underlying issue in the dataset. Here’s a visualization of the label (attack type) class distribution:

When I exclude rows in the dataset with a label of 0, the model validation/testing performance increases to 65%.

Improvements:

  • Hyperparameter tuning. K-fold cross validation is the standard approach to set hyperparameter values like learning rate.
  • Use more features. The Kaggle repository contains other dataset files with executable header data and specific API calls. Including more features would improve the model’s predictive power.
  • Use a different model architecture. Traditional machine learning architectures, like K-Nearest-Neighbors (KNN), might be well-suited for this task: Intuitively, we would expect executables linking to similar DLLs to pose similar threats.