CNN检测XSS攻击(Pytorch)

前言

之前在Kaggle上看到了一个XSS的数据集,所以想着用pytorch实现一下,代码参考了kaggle上有人用keras实现的。

https://www.kaggle.com/syedsaqlainhussain/cross-site-scripting-attack-detection-using-cnn

XSS数据集介绍

数据集地址:https://www.kaggle.com/syedsaqlainhussain/cross-site-scripting-xss-dataset-for-deep-learning

数据是csv形式的,一共有三列,第一列是序号,第二列是具体的代码,第三列是标签。一共有13686条数据,没有分训练集和测试集,因此后面我们需要分一下。

思路分析

首先我们应该对数据进行编码。转换成向量的形式,对于训练集和测试集每一行数据,我们都有编码和标签两种数据,之后通过模型进行训练,训练结果与标签进行比对,计算损失,最后通过测试集进行验证。

模块导入

其中cv2是一个进行图像处理的库,sklearn是基于python的机器学习攻击。

1
2
3
4
5
6
7
8
9
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
import torch.optim as optim
import cv2
from sklearn.model_selection import train_test_split

参数定义

1
2
batch_size = 50
epochs = 100

加载数据集

将XSS数据集下载之后,放在和代码同级的目录下。通过pandas模块可以实现对csv文件的读取等操作。

1
2
df = pd.read_csv("XSS_dataset.csv",encoding="utf-8-sig")
sentences = df['Sentence'].values

定义编码函数

对于一些编码后比较大的字符,可以为他们分配一个比较小的值,方便后续进行正则化。将每一条数据都通过一个长度为10000的向量进行存储。之后reshape成一个二维向量,大小是100*100

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
def convert_to_ascii(sentence):
sentence_ascii = []

for i in sentence:
if (ord(i) < 8222):

if (ord(i) == 8217):
sentence_ascii.append(134)

if (ord(i) == 8221):
sentence_ascii.append(129)

if (ord(i) == 8220):
sentence_ascii.append(130)

if (ord(i) == 8216):
sentence_ascii.append(131)

if (ord(i) == 8217):
sentence_ascii.append(132)

if (ord(i) == 8211):
sentence_ascii.append(133)

if (ord(i) <= 128):
sentence_ascii.append(ord(i))

else:
pass

zer = np.zeros((10000)) #初始化一个长度为10000的向量

for i in range(len(sentence_ascii)):
zer[i] = sentence_ascii[i]
zer.shape = (100, 100) #将一维转为二维
return zer

编码转换

首先定一个数组,大小是数据集的长度,类型是一个二维向量,大小是100*100,之后对csv中每一条数据都进行编码转换,并将二维向量中的数据都转为float类型表示。之后得到的data就是对数据集编码后的结果。

1
2
3
4
5
6
7
8
9
10
11
12
arr = np.zeros((len(sentences), 100, 100))
for i in range(len(sentences)):
image = convert_to_ascii(sentences[i])

x = np.asarray(image, dtype='float') #将二维里的数据类型转为float型
image = cv2.resize(x, dsize=(100, 100), interpolation=cv2.INTER_CUBIC)
image /= 128

arr[i] = image

# Reshape data for input to CNN
data = arr.reshape(arr.shape[0],1,100, 100)

获取标签

1
y=df['Label'].values

划分数据集

采用train_test_split函数随机划分数据。其中test_size是指测试数据占样本数据的比例,这里取样本总数的20%作为测试数据,random_state是一个随机数种子。之后通过DataLoader函数设定训练批次大小和shuffle操作,这里需要注意的是,因为我们data和y中的数据都是ndarray类型的,因此我们还需要对他们进行类型转换,转为tensor类型。

1
2
3
4
5
6
7
8
9
trainX, testX, trainY, testY = train_test_split(data,y, test_size=0.2, random_state=42)
trainX = torch.from_numpy(trainX)
trainX = DataLoader(trainX,batch_size=batch_size,shuffle=False)
testX = torch.from_numpy(testX)
testX = DataLoader(testX,batch_size=batch_size,shuffle=False)
trainY = torch.from_numpy(trainY)
trainY = DataLoader(trainY,batch_size=batch_size,shuffle=False)
testY = torch.from_numpy(testY)
testY = DataLoader(testY,batch_size=batch_size,shuffle=False)

模型定义

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
class CNN_XSS_Net(nn.Module):
def __init__(self):
super(CNN_XSS_Net, self).__init__()
self.cnn = nn.Sequential(
nn.Conv2d(1,64,3),
nn.ReLU(),
nn.MaxPool2d(2),
nn.Dropout(0.3),
nn.Conv2d(64,128,3),
nn.ReLU(),
nn.Conv2d(128,256,3),
nn.MaxPool2d(2),
nn.Dropout(0.3),
nn.Relu(),

)
self.fc1 = nn.Linear(123904,256)
self.fc2 = nn.Linear(256,128)
self.fc3 = nn.Linear(128, 64)
self.fc4 = nn.Linear(64, 2)


def forward(self,x):
x = torch.tensor(x, dtype=torch.float32)
cnn_res = self.cnn(x)
# print(cnn_res.shape) #128*256*22*22
cnn_res = cnn_res.view(cnn_res.size(0), -1)
# print(cnn_res.shape) #128*123904
f1 = self.fc1(cnn_res)
f2 = self.fc2(f1)
f3 = self.fc3(f2)
f4 = self.fc4(f3)

return f4

实例化模型

1
2
3
model = CNN_XSS_Net()
optimizer = optim.Adam(model.parameters(),0.001)
criterion = nn.CrossEntropyLoss()

定义训练函数

1
2
3
4
5
6
7
8
9
10
11
12
def train(model,trainX,trainY,optimizer,epochs):
model.train()
i=0
for data, target in zip(trainX,trainY):
i+=1
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
if i % 50 == 0:
print("Train Epoch : {} \t Loss : {:.6f}".format(epochs, loss.item()))

定义测试函数

1
2
3
4
5
6
7
8
9
10
11
12
def test_model(model,testX,testY):
model.eval()
correct = 0.0
test_loss = 0.0
with torch.no_grad():
for data,target in zip(testX,testY):
output = model(data)
test_loss+=F.cross_entropy(output,target).item()
pred = output.max(1,keepdim=True)[1]
correct += pred.eq(target.view_as(pred)).sum().item()
test_loss /= len(testX.dataset)
print("Test ---- Average loss : {:.4f},Accuracy : {:.3f}\n".format(test_loss,100.0*correct/len(testX.dataset)))

模型训练

1
2
3
for epoch in range(epochs):
train(model,trainX,trainY,optimizer,epoch)
test_model(model,testX,testY)

结果展示

1

2

3