Neural network to detect network botnet traffic
Table of Contents
Goal
In this post, I will summarise a project I made for my master in Cybersecurity for UNED.
Our goal is to be able to detect Botnet traffic.
Using Keras to detect Botnet traffic
Keras is a perfect tool for Machine Learning experts and other developers alike. It can be as complicated as you want to make it, or as simple as you need it to be.
Chose data
One of the most important decisions when attempting a project like so from the ground up is to choose or create the dataset you are going to use in the project wisely.
The quality of this decision will be strongly correlated with the quality of the final Neural Network model.
In this project, I have chosen to use a dataset: ISOT HTTP Botnet Database created by the Victoria University in 2017. This dataset consists of a wide variety of Botnet traffic, along with samples of benign traffic.
The dataset in it’s totality consists of approximately eleven million captured web packets. The dataset network topology:
Selecting properties
Once we have decided the dataset we are going to work with, we will now begin selecting the properties we will include in our study.
Our mail goal is to detect Botnet network packets traveling through our network, so we are going to be interested in the following properties:
ip.src (Categorical)
ip.dst (Categorical)
_ws.col.Protocol (Categorical)
_ws.col.Info (Categorical)
frame.len (Continuous)
Data conversion to .csv
To extract the following data from the pcap
files, we are going to use a command-line utility called tshark.
In our case, as we want to merge all the data captured, we have developed a script that merges the converted .csv
files together.
#!/bin/bash
# Check for the first log.
# We have to remove subsequent headers.
first=true
# Concatenate all the malign files
for file in ./botnet_data/*.pcap; do
echo "Processing $file"
if $first; then
first=false
tshark -r $file -T fields -e frame.time_epoch -e ip.src -e ip.dst -e _ws.col.Protocol -e frame.len -e _ws.col.Info -E separator=, -E header=y > network_malign_traffic.csv
else
tshark -r $file -T fields -e frame.time_epoch -e ip.src -e ip.dst -e _ws.col.Protocol -e frame.len -e _ws.col.Info -E separator=, >> network_malign_traffic.csv
fi
done
head -n 1 network_malign_traffic.csv > merged_network_malign_traffic.csv
tail -n +2 network_malign_traffic.csv >> merged_network_malign_traffic.csv
first=true
# Concatenate all the benign files
for file in ./application_data/*.pcap; do
echo "Processing $file"
if $first; then
first=false
tshark -r $file -T fields -e frame.time_epoch -e ip.src -e ip.dst -e _ws.col.Protocol -e frame.len -e _ws.col.Info -E separator=, -E header=y > network_benign_traffic.csv
else
tshark -r $file -T fields -e frame.time_epoch -e ip.src -e ip.dst -e _ws.col.Protocol -e frame.len -e _ws.col.Info -E separator=, >> network_benign_traffic.csv
fi
done
head -n 1 network_benign_traffic.csv > merged_network_benign_traffic.csv
tail -n +2 network_benign_traffic.csv >> merged_network_benign_traffic.csv
It is important that we separate captures from Botnet and benign traffic until we have added to each row it’s Botnet class. This will be the property the neural network will compare it’s result to in order to generate the weights.
Sanitize data
Following the above conversion, we are now going to clean the dataset.
In our case, we have instances of malformed rows, where there is a excess of ,
separators, values that are not entered, and invalid data in properties.
To clean the dataset, we will use Python.
import sys
import re
import pandas as pd
import fileinput
def delete_extra_commas(file_name):
for line in fileinput.input(file_name, inplace=True):
character_count = 0
for character in line:
if character == ',': character_count = character_count + 1
if character == ',' and character_count > 5:
continue
else:
print('{}'.format(character), end='')
def delete_invalid_entries(file_name):
#df = pd.read_csv(file_name, header=0)
df = pd.read_csv(file_name, header=0,
dtype={'frame.time_epoch': float,
'ip.src': str,
'ip.dst': str,
'_ws.col.Protocol': str,
'frame.len': str,
'_ws.col.Info': str,
'Botnet': str})
ip_addr = re.compile(r"^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$")
bad_rows = []
i = 0
while i < len(df):
if re.match(ip_addr, df['frame.len'][i]):
bad_rows.append(i)
i = i + 1
print("Deleting bad rows: ", bad_rows)
df.drop(df.index[bad_rows], inplace=True)
df.to_csv(file_name, index=False)
def remove_null_values(file_name):
for line in fileinput.input(file_name, inplace=True):
previous_character_is_comma = False
for character in line:
if character == ',' and previous_character_is_comma:
print('{}'.format('UNKNOWN'), end='')
previous_character_is_comma = False
print('{}'.format(character), end='')
if character == ',': previous_character_is_comma = True
If len(sys.argv) == 1:
print("Usage: sanitize.py {file}")
exit()
for file_name in sys.argv[1:]:
print("Processing ", file_name)
delete_extra_commas(file_name)
remove_null_values(file_name)
delete_invalid_entries(file_name)
This generates a valid, sane dataset. The last step needed in order to be used with our model is to append the Botnet
class to every row.
Add Botnet
Class and order
#!/bin/bash
# Add a new column, indicating if it is a botnet or not
echo "Adding y column"
awk -F"," 'BEGIN { OFS = "," } {$7="0"; print}' merged_network_benign_traffic.csv > botnet_identified_network_benign_traffic.csv
awk -F"," 'BEGIN { OFS = "," } {$7="1"; print}' merged_network_malign_traffic.csv > botnet_identified_network_malign_traffic.csv
# Replace the end of the first line with a "Botnet" label instead of a 1
sed -i '1s/0$/Botnet/' botnet_identified_network_benign_traffic.csv
sed -i '1s/1$/Botnet/' botnet_identified_network_malign_traffic.csv
# Merge both files into one
echo "Merging files into merged_traffic.csv"
head -n 1 botnet_identified_network_malign_traffic.csv > merged_traffic.csv
tail -n +2 botnet_identified_network_benign_traffic.csv >> merged_traffic.csv
tail -n +2 botnet_identified_network_malign_traffic.csv >> merged_traffic.csv
# sort on epoch_time
echo "Sorting entries into sorted_merged_traffic.csv"
head -n 1 merged_traffic.csv > sorted_merged_traffic.csv
tail -n +2 merged_traffic.csv | sort -k 1 >> sorted_merged_traffic.csv
Split data
Now our dataset is prepared for analysis, we have to split it in two. We will train our model with one part, and test our model with another.
The goal of this approach is to test the generated model with data we have not used to train it. This mitigates the possibility of over fitting data.
In our case, we have split the dataset in 70% Train, 30% Test.
Train model
During the training phase of the model, we went through several models, including hidden layers, number of neurones per layer, activation functions and all.
model = Sequential()
model.add(Dense(1, input_dim=X_train.shape[1], activation='relu', kernel_initializer='he_normal'))
model.add(Dense(1, activation='sigmoid'))
# compile the keras model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit the keras model on the dataset
# Batch gradient descent
# The number of batches is equal to the number of training set
model.fit(X_train, y_train, epochs=15, batch_size=X_train.shape[1], verbose=2)
Due to the nature of the data selected, the conclusion we have come to is that the dataset can be linearly separated. This explains why we don’t need a hidden layer in our model.
Save model
This is a key step in the model generation. If we don’t save the model, we have to recreate on every time we want to use it.
model.save('modeloSecuencial.h5')
Model results
Even with this extremely simple model, we have managed to achieve a high percentage of Botnet detection model.
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_1 (Dense) (None, 1) 678
_________________________________________________________________
dense_2 (Dense) (None, 1) 2
=================================================================
Total params: 680
Trainable params: 680
Non-trainable params: 0
_________________________________________________________________
Compiling the Keras model
Fitting the Keras model
Epoch 1/15
- 1s - loss: 0.6967 - accuracy: 0.0812
Epoch 2/15
- 1s - loss: 0.6957 - accuracy: 0.4601
Epoch 3/15
- 1s - loss: 0.6946 - accuracy: 0.4885
Epoch 4/15
- 1s - loss: 0.6934 - accuracy: 0.5074
Epoch 5/15
- 1s - loss: 0.6923 - accuracy: 0.5819
Epoch 6/15
- es - loss: 0.6911 - accuracy: 0.6698
Epoch 7/15
- 1s - loss: 0.6900 - accuracy: 0.8539
Epoch 8/15
- 1s - loss: 0.6889 - accuracy: 0.9269
Epoch 9/15
- 1s - loss: 0.6878 - accuracy: 0.9499
Epoch 10/15
- 1s - loss: 0.6867 - accuracy: 0.9540
.
. (Comentado por brevedad)
.
Epoch 15/15
- 1s - loss: 0.6815 - accuracy: 0.9540
Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_1 (Dense) (None, 1) 678
_________________________________________________________________
dense_2 (Dense) (None, 1) 2
=================================================================
Total params: 680
Trainable params: 680
Non-trainable params: 0
_________________________________________________________________
evaluating the Keras model
Accuracy: 94.32
Model saved to disk
Further work
Once we have generated the model we are happy with, the next step is to implement the infrastructure needed to support the given model.
Once we have our static model, we can utilize it to dynamically analyze network traffic and perform certain action if Botnet traffic is detected. This is called: Intrusion Prevention System.
Conclusion
I realized the most difficult part of this project is by far the data acquisition and validation phase.
Another key step is investigating the properties and benefits of every model modifier. Be it activation functions, optimizers, or number of hidden layers and neurones.