9 min read

Building a Simple ML Model to Detect Obfuscated Malware

Table of Contents

Introduction

The world of cybersecurity is a perpetual cat-and-mouse game. As defenders build higher walls, attackers devise cleverer ways to tunnel underneath. One of the most effective tunneling techniques is obfuscation, where malware authors disguise their malicious code to evade signature-based detection systems like traditional antivirus software. Obfuscated malware can pack, encrypt, or polymorph its code, making it look like a harmless program.

So, how do we catch a ghost? We look for its footprints. Instead of analyzing the file on disk (static analysis), which can be misleading, we can observe its behavior in memory during execution (dynamic analysis). When malware unpacks itself to run, its true form and intentions are revealed in the system’s memory.

In this post, we’ll walk through building a simple but effective machine learning model to detect obfuscated malware by analyzing its memory-based features. Our target is not a single piece of software but the behavior of malware itself. Our goal is to train a classifier that can distinguish between malicious and benign processes based on data captured from memory, a technique that is much harder for attackers to evade. We will compare a classic ensemble model (Random Forest) with a simple deep learning model (Convolutional Neural Network - CNN) to see which performs better at this task.

Tools & Environment Setup

To follow along, you’ll need a standard data science environment. We’re not reverse-engineering a binary, but analyzing data derived from that process.

Software:

  • Python 3.8+: The core of our environment.
  • Jupyter Notebook or VS Code: For interactive development and data exploration.
  • Key Python Libraries:
    • pandas: For data manipulation and loading.
    • scikit-learn: For data preprocessing, feature selection, and our Random Forest model.
    • tensorflow or pytorch: For building the CNN model. We’ll use TensorFlow/Keras syntax for this guide.
    • matplotlib & seaborn: For data visualization.
  • Dataset: CIC-MalMem-2022: This is a fantastic, publicly available dataset from the Canadian Institute for Cybersecurity. It contains memory-based features from 58,596 processes, split between benign and various malware families. You can download it from the University of New Brunswick’s website.

Environment Setup:

First, set up a virtual environment to keep your dependencies clean.

# Create a virtual environment
python -m venv malware_env

# Activate it
# On Windows:
# .\malware_env\Scripts\activate
# On macOS/Linux:
source malware_env/bin/activate

Next, install the required libraries. Create a requirements.txt file with the following content:

pandas
numpy
scikit-learn
tensorflow
matplotlib
seaborn

Then, install them using pip:

pip install -r requirements.txt

Finally, download and extract the CIC-MalMem-2022 dataset. It comes in CSV format, ready for us to dive in.

Static Analysis: Data Exploration & Feature Engineering

In traditional reverse engineering, static analysis means examining a binary without running it. In our ML context, it means analyzing our dataset before training a model. This step is crucial for understanding the features we’re working with.

The CIC-MalMem-2022 dataset has 57 memory-related features. These include counts of DLLs, handles, process information, and specific memory access patterns.

1. Loading the Data:

Let’s start by loading the dataset into a pandas DataFrame.

import pandas as pd

# Path to the dataset file
file_path = 'path/to/your/CIC-MalMem-2022.csv'

# Load the data
df = pd.read_csv(file_path)

# Display first few rows and basic info
print(df.head())
print(df.info())

# Check the distribution of the 'Category' column (Benign vs. Malware)
print(df['Category'].value_counts())

You’ll notice the target variable is Category, which we need to convert into a numerical format (e.g., 0 for Benign, 1 for Malware).

2. Feature Selection:

With 57 features, we risk overfitting and long training times. We need to select the most impactful features. A hybrid approach combining Correlation-based Feature Selection (CFS) and Recursive Feature Elimination (RFE) is highly effective. CFS identifies features that are highly correlated with the target variable but uncorrelated with each other, while RFE recursively removes the least important features.

For this guide, we’ll assume this process has been done. The research paper associated with the inspiration GitHub repo found that a specific set of 10-15 features provided the best performance. Let’s say our analysis yielded the following top features:

  • pslist.nprocs64bit: Number of 64-bit processes.
  • handles.nfile: Number of file handles.
  • dlllist.ndlls: Number of loaded DLLs.
  • svcscan.nservices: Number of services.
  • callbacks.ncallbacks: Number of callbacks.
  • memmap.protection: Memory protection flags.
  • …and a few others.

Let’s select these features and prepare our data for training.

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder

# Assume these are our selected features
selected_features = [
    'pslist.nprocs', 'pslist.nppid', 'pslist.avg_threads', 'pslist.nprocs64bit',
    'dlllist.ndlls', 'handles.nhandles', 'handles.nfile', 'svcscan.nservices',
    'callbacks.ncallbacks', 'memmap.npage'
]

X = df[selected_features]
y = df['Category']

# Encode the target variable
le = LabelEncoder()
y_encoded = le.fit_transform(y)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.2, random_state=42, stratify=y_encoded)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Dynamic Analysis: Model Building and Training

Now for the “dynamic” part: running our models. We’ll build and train two distinct architectures.

Model 1: Random Forest

Random Forest is a powerful ensemble learning method that builds multiple decision trees and merges them to get a more accurate and stable prediction. It’s robust and works well on tabular data like ours.

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Initialize the Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)

# Train the model
print("Training Random Forest...")
rf_model.fit(X_train_scaled, y_train)

# Make predictions on the test set
y_pred_rf = rf_model.predict(X_test_scaled)

# Calculate accuracy
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print(f"Random Forest Accuracy: {accuracy_rf * 100:.2f}%")

It’s that simple! scikit-learn provides a high-level, easy-to-use API.

Model 2: Convolutional Neural Network (CNN)

While CNNs are famous for image processing, they can be surprisingly effective on tabular data. By treating the feature set of each sample as a 1D vector, a 1D CNN can learn spatial hierarchies and complex patterns among the features.

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv1D, MaxPooling1D, Flatten, Dense, Dropout

# Reshape data for CNN (samples, timesteps, features)
# We treat our 10 features as 10 timesteps of a 1D signal
X_train_cnn = X_train_scaled.reshape((X_train_scaled.shape[0], X_train_scaled.shape[1], 1))
X_test_cnn = X_test_scaled.reshape((X_test_scaled.shape[0], X_test_scaled.shape[1], 1))

# Build the CNN model
cnn_model = Sequential([
    Conv1D(filters=32, kernel_size=3, activation='relu', input_shape=(X_train_cnn.shape[1], 1)),
    MaxPooling1D(pool_size=2),
    Dropout(0.3),
    Flatten(),
    Dense(64, activation='relu'),
    Dense(1, activation='sigmoid') # Binary classification
])

# Compile the model
cnn_model.compile(optimizer='adam',
                  loss='binary_crossentropy',
                  metrics=['accuracy'])

# Train the model
print("Training CNN...")
history = cnn_model.fit(X_train_cnn, y_train, epochs=20, batch_size=64, validation_split=0.2, verbose=1)

# Evaluate the model
loss, accuracy_cnn = cnn_model.evaluate(X_test_cnn, y_test)
print(f"CNN Accuracy: {accuracy_cnn * 100:.2f}%")

Vulnerability & Exploitation: Model Evaluation

Here, “finding a vulnerability” means correctly identifying malware. Our “proof-of-concept exploit” is the model’s prediction. Let’s evaluate how well our models performed beyond simple accuracy. We’ll use a confusion matrix, precision, recall, and F1-score.

from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# --- Random Forest Evaluation ---
print("--- Random Forest Classification Report ---")
print(classification_report(y_test, y_pred_rf, target_names=le.classes_))

# Confusion Matrix for RF
cm_rf = confusion_matrix(y_test, y_pred_rf)
sns.heatmap(cm_rf, annot=True, fmt='d', cmap='Blues', xticklabels=le.classes_, yticklabels=le.classes_)
plt.title('Random Forest Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()


# --- CNN Evaluation ---
# Get CNN predictions (they are probabilities, so we round them)
y_pred_cnn_prob = cnn_model.predict(X_test_cnn)
y_pred_cnn = (y_pred_cnn_prob > 0.5).astype(int)

print("\n--- CNN Classification Report ---")
print(classification_report(y_test, y_pred_cnn, target_names=le.classes_))

# Confusion Matrix for CNN
cm_cnn = confusion_matrix(y_test, y_pred_cnn)
sns.heatmap(cm_cnn, annot=True, fmt='d', cmap='Oranges', xticklabels=le.classes_, yticklabels=le.classes_)
plt.title('CNN Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

Interpreting the Results:

  • Accuracy: The overall percentage of correct predictions. The reference material suggests the CNN (around 96%) will likely outperform the Random Forest (around 93%).
  • Precision (Malware): Of all the processes we flagged as malware, how many actually were? High precision means a low false positive rate.
  • Recall (Malware): Of all the actual malware in the dataset, how many did we catch? High recall means a low false negative rate, which is critical for a security tool.
  • F1-Score: The harmonic mean of precision and recall. A great single metric for overall performance.
  • Confusion Matrix: A visual breakdown of correct vs. incorrect predictions for each class. We want to minimize the numbers in the top-right (False Positives) and bottom-left (False Negatives).

The results will likely show that the CNN is slightly better at learning the intricate relationships between memory features, leading to higher accuracy and a better F1-score for detecting malware.

Mitigation & Conclusion

The “mitigation” for malware is its detection and prevention. A model like the one we’ve built can be a powerful component of a modern security solution.

How could this be deployed?

  1. Endpoint Detection and Response (EDR): The model could be integrated into an EDR agent. The agent would periodically sample memory features from running processes and feed them to the model for a real-time verdict.
  2. Sandbox Analysis: In a controlled sandbox environment, suspicious files can be executed. Memory feature data is collected during execution and analyzed by the model to produce a detailed report on whether the file is malicious.
  3. Cloud-Based Threat Intelligence: Feature data from endpoints could be sent to a cloud service running the model, allowing for continuous updates and large-scale threat hunting without taxing the client machine.

Conclusion:

Signature-based detection is no longer enough to combat the ever-evolving threat of obfuscated malware. By shifting our focus from what a file looks like to how it behaves in memory, we can build far more resilient detection systems. In this guide, we demonstrated that both traditional machine learning models like Random Forest and deep learning models like CNNs can achieve high accuracy in this task.

Our comparison showed that even a simple 1D-CNN can effectively capture the complex patterns in memory features, often outperforming robust ensemble methods. This project serves as a starting point. Further improvements could involve exploring more complex architectures (like LSTMs for time-series memory data), incorporating a wider range of features, and training on an even larger and more diverse dataset of malware samples. The battle against malware is ongoing, but with data-driven techniques, we have a powerful arsenal at our disposal.