实验伪代码

本小妞迷上赌 提交于 2020-02-19 13:29:48

该篇文章讲述了论文实验部分的伪代码,该实验采用python语言编写,框架采用深度学习框架keras,整体实验分为一下几个部分:

1 第一次训练(first.py)

功能实现:
根据输入的数据文件,处理数据后,切割为训练集和测试集,并在本地生成对应的文件。对整体数据,使用CountVectorizer对邮件文本进行向量化,并且生成了一个字典。用词袋模型将训练集的邮件文本数据转化为词袋特征,并用这些特征训练模型,将该模型生成本地文件。最后,加载训练集文件对模型进行评估,自此该文件运行完毕。

输入:

  • 第一次训练的数据文件(trec06.csv)

输出:

  • 字典文件
  • 第一次训练的训练集和测试集
  • 第一次训练的母模型

Pseudocode:

#============================== load data ===============================
firstTrainingData <- read the file based on the path of the first training file
#df['label'] <- change label "spam" to 1 and label "ham" to 0 in firstTrainingData
change label "spam" to 1 and label "ham" to 0 in firstTrainingData

#======================= split data =====================================
messages <- the values of label "message" in firstTrainingData
y <- the values of label "label" in firstTrainingData
messages_train,messages_test,y_train,y_test <- cut messages and y at a fixed ratio and fixed randomness

#============= save train set and test set to csv file ==================
trainData <- merge y_train and messages_train columns
testData <- merge y_test and messages_test columns
save trainData and testData each to a local csv file

#====================== CountVectorizer ================================
dictionary <- use CountVectorizer to convert messages to a dictionary 
save dictionary to a local file

#======================== convert ===============================
x_train <- use the dictionary to transform messages_train into bag-of-words features
x_test <- use the dictionary to transform messages_test into bag-of-words features

#=====================利用keras进行sgd训练===================================
input_dim <- the number of features of x_train 
sgd <- call SGD from keras with lr as 0.2

model <- call Sequential from keras to serialize a model
add a Dense layer to the model with input dimension as input_dim, output dimension as 10, and the activation function as relu
add a Dense layer to the model with input dimension as 10, output dimension as 1, and the activation function as sigmoid
the model is compiled with sgd

the model is trained using x_train and y_train with epochs as 5 and batch_size as 10

#========================= save model ===============================
save model to a local file,including model,struct and weights

#========================= evaluate ===============================
x_test_file,y_test_file <- get the data converted from the dictionary according to the saved test set file
loss, accuracy <- the model evaluates x_test_file and y_test_file
print loss and accuracy

2 加载经由字典转化的数据(loadtestdata.py)

功能实现:
对于给定的文件路径,对文件进行读取后,载入字典对读取数据进行处理,最后返回处理后的结果
输入:

  • 文件路径参数

输出:

  • 字典处理后的结果

Pseudocode:

#=======================预处理,载入第二次的test_data,为模型进行评估===========
function getDataAfterDect(data_path):
    data <- read the file based on data_path      
    messages <- the values of label "message" in data
    label <- the values of label "label" in data
    dictionary <- load dictionary based on local file path
        
    features <- use the dictionary to transform messages into bag-of-words features 
    return features, label 
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!