------------------邮件数据预处理------------------

一：邮件数据读取

with open('emailSample1.txt','r') as fp:
    content = fp.read()　　#一次读取了全部数据
    print(content)

二：预处理操作

（一）预处理内容

预处理主要包括以下9个部分：

1. 将大小写统一成小写字母；
2. 移除所有HTML标签，只保留内容。
3. 将所有的网址替换为字符串 “httpaddr”.
4. 将所有的邮箱地址替换为 “emailaddr”
5. 将所有dollar符号($)替换为“dollar”.
6. 将所有数字替换为“number”
7. 将所有单词还原为词源，词干提取
8. 移除所有非文字类型
9. 去除空字符串‘’

（二）预处理实现读取邮件

import re
import nltk.stem as ns

def preprocessing(email):
    #1. 将大小写统一成小写字母；
    email = email.lower()

    #2. 移除所有HTML标签，只保留内容
    email = re.sub("<[^<>]>"," ",email) #找到<>标签进行替换，注意：我们匹配的<>标签中内部不能含有<>---<<>>---最小匹配，其他方法也可以实现

    #3. 将所有的网址替换为字符串 “httpaddr”.
    email = re.sub("(http|https)://[^\s]*","httpaddr",email)    #匹配网址，直到遇到空格

    #4. 将所有的邮箱地址替换为 “emailaddr”
    email = re.sub('[^\s]+@[^\s]+', 'emailaddr', email) #匹配邮箱 @前后为空为截止

    #5. 将所有dollar符号($)替换为“dollar”
    email = re.sub('[\$]+', 'dollar', email)

    #6. 将所有数字替换为“number”
    email = re.sub('[0-9]+', 'number', email)

    #7. 将所有单词还原为词源，词干提取 https://www.jb51.net/article/63732.htm
    tokens = re.split('[ \@\$\/\#\.\-\:\&\*\+\=\[\]\?\!\(\)\{\}\,\'\"\>\_\<\;\%]', email)   #使用以上字符进行切分内容为单词
    tokenlist = []

    s = ns.SnowballStemmer('english')   #一个词干提取器对象

    for token in tokens:
        #8. 移除所有非文字类型
        token = re.sub("[^a-zA-Z0-9]",'',token)

        #9. 去除空字符串‘’
        if not len(token):
            continue
        # print("---: ",token)
        stemmed = s.stem(token) #获取词根   costs变为cost expecting变为expect
        # print("++++++: ",stemmed)

        tokenlist.append(stemmed)

    return tokenlist

with open('emailSample1.txt','r') as fp:
    content = fp.read()

email = preprocessing(content)

（三）将Email转化为词向量

将提取的单词转换成特征向量：

def email2VocabIndices(email,vocab):
    """
    提取我们邮件email中的单词，在vocab单词表中出现的索引
    """
    token = preprocessing(email)    #获取预处理后的邮件单词列表

    #获取邮件单词在垃圾邮件单词表的索引位置
    index_list = [i for i in range(len(vocab)) if vocab[i] in token]

    return index_list

def email2FeatureVector(email):
    """
    将email转为词向量
    """
    #读取提供给我们的单词（被认为是垃圾邮件的单词）
    df = pd.read_table("vocab.txt",names=['words']) #读取数据
    vocab = df.values   #array类型
    # print(vocab)
    vector = np.zeros(vocab.shape[0])

    #查看我们我们的邮件中是否存在这些单词，如果存在则返回索引（是在上面vocab中的索引）
    index_list = email2VocabIndices(email,vocab)
    print(index_list)

    #将单词索引转向量
    for i in index_list:
        vector[i] = 1

    return vector

with open('emailSample1.txt','r') as fp:
    content = fp.read()

vector = email2FeatureVector(content)
print(vector)

[70, 85, 88, 161, 180, 237, 369, 374, 430, 478, 529, 530, 591, 687, 789, 793, 798, 809, 882, 915, 944, 960, 991, 1001, 1061, 1076, 1119, 1161, 1170, 1181, 1236, 1363, 1439, 1476, 1509, 1546, 1662, 1698, 1757, 1821, 1830, 1892, 1894, 1895]
[0. 0. 0. ... 0. 0. 0.]

print(vector.shape)

------------------垃圾邮件过滤参数问题（线性核函数）------------------

注意：data/spamTrain.mat是对邮件进行预处理后（自然语言处理）获得的向量。

一：数据读取

data1 = sio.loadmat("spamTrain.mat")
X,y = data1['X'],data1['y'].flatten()

data2 = sio.loadmat("spamTest.mat")
Xtest,ytest = data2['Xtest'],data2['ytest'].flatten()

print(X.shape,y.shape)

print(X)　　#每一行代表一个邮件样本，每个样本有1899个特征，特征为1表示在跟垃圾邮件有关的语义库中找到相关单词

print(y)　　# 每一行代表一个邮件样本，等于1表示为垃圾邮件

这里用于判断垃圾邮件的词汇，还是我们预处理中的vocab.txt的单词。

二：获取最佳准确率和参数C（线性核只有C）

def get_best_params(X,y,Xval,yval):
    # C 和 σ 的候选值
    Cvalues = [3, 0.03, 100, 0.01, 0.1, 0.3, 1, 10, 30]  # 9

    best_score = 0  #用于存放最佳准确率
    best_params = 0  #用于存放参数C

    for c in Cvalues:
        clf = svm.SVC(c,kernel='linear')
        clf.fit(X,y)    # 用训练集数据拟合模型
        score = clf.score(Xval,yval)    # 用验证集数据进行评分
        if score > best_score:
            best_score = score
            best_params = c

    return best_score,best_params

best_score,best_params = get_best_params(X,y,Xtest,ytest)
print(best_score,best_params)

clf = svm.SVC(best_params,kernel="linear")
clf.fit(X,y)
score_train = clf.score(X,y)
score_test = clf.score(Xtest,ytest)
print(score_train,score_test)

三:根据我们训练的结果,判断最有可能被判为垃圾邮件的单词

best_params = 0.3

clf = svm.SVC(best_params,kernel="linear")
clf.fit(X,y)    #进行训练

# 读取提供给我们的单词（被认为是垃圾邮件的单词）
df = pd.read_table("vocab.txt", names=['words'])  # 读取数据
vocab = df.values  # array类型

#先获取训练结果中，各个特征的重要性 ---- coef_:各特征的系数（重要性）https://www.cnblogs.com/xxtalhr/p/11166848.html
print(clf.coef_)
indices = np.argsort(clf.coef_).flatten()[::-1] #因为argsort是小到大排序,提取其对应的index(索引)，我们设置步长为-1，倒序获取特征系数
print(indices)  #打印索引

for i in range(15): #我们获取前15个最有可能判为垃圾邮件的单词
    print("{}---{:0.6f}".format(vocab[indices[i]],clf.coef_.flatten()[indices[i]]))

------------------垃圾邮件判断------------------

一:邮件数据

(一)正常邮件

> Anyone knows how much it costs to host a web portal ?
>
Well, it depends on how many visitors you're expecting.
This can be anywhere from less than 10 bucks a month to a couple of $100. 
You should checkout http://www.rackspace.com/ or perhaps Amazon EC2 
if youre running something big..

To unsubscribe yourself from this mailing list, send an email to:
groupname-unsubscribe@egroups.com

emailSample1.txt

Folks,
 
my first time posting - have a bit of Unix experience, but am new to Linux.

 
Just got a new PC at home - Dell box with Windows XP. Added a second hard disk
for Linux. Partitioned the disk and have installed Suse 7.2 from CD, which went
fine except it didn't pick up my monitor.
 
I have a Dell branded E151FPp 15" LCD flat panel monitor and a nVidia GeForce4
Ti4200 video card, both of which are probably too new to feature in Suse's default
set. I downloaded a driver from the nVidia website and installed it using RPM.
Then I ran Sax2 (as was recommended in some postings I found on the net), but
it still doesn't feature my video card in the available list. What next?
 
Another problem. I have a Dell branded keyboard and if I hit Caps-Lock twice,
the whole machine crashes (in Linux, not Windows) - even the on/off switch is
inactive, leaving me to reach for the power cable instead.
 
If anyone can help me in any way with these probs., I'd be really grateful -
I've searched the 'net but have run out of ideas.
 
Or should I be going for a different version of Linux such as RedHat? Opinions
welcome.
 
Thanks a lot,
Peter

-- 
Irish Linux Users' Group: ilug@linux.ie
http://www.linux.ie/mailman/listinfo/ilug for (un)subscription information.
List maintainer: listmaster@linux.ie

emailSample2.txt

(二)垃圾邮件

Do You Want To Make $1000 Or More Per Week?

 

If you are a motivated and qualified individual - I 
will personally demonstrate to you a system that will 
make you $1,000 per week or more! This is NOT mlm.

 

Call our 24 hour pre-recorded number to get the 
details.  

 

000-456-789

 

I need people who want to make serious money.  Make 
the call and get the facts. 

Invest 2 minutes in yourself now!

 

000-456-789

 

Looking forward to your call and I will introduce you 
to people like yourself who
are currently making $10,000 plus per week!

 

000-456-789



3484lJGv6-241lEaN9080lRmS6-271WxHo7524qiyT5-438rjUv5615hQcf0-662eiDB9057dMtVl72

spamSample1.txt

Best Buy Viagra Generic Online

Viagra 100mg x 60 Pills $125, Free Pills & Reorder Discount, Top Selling 100% Quality & Satisfaction guaranteed!

We accept VISA, Master & E-Check Payments, 90000+ Satisfied Customers!
http://medphysitcstech.ru

spamSample2.txt

二:获取各个邮件的词向量特征(预处理获得)

with open('emailSample1.txt','r') as fp:
    content = fp.read()

with open('emailSample2.txt','r') as fp:
    content2 = fp.read()

with open('spamSample1.txt','r') as fp:
    content3 = fp.read()

with open('spamSample2.txt','r') as fp:
    content4 = fp.read()

vector = email2FeatureVector(content)
vector2 = email2FeatureVector(content2)
vector3 = email2FeatureVector(content3)
vector4 = email2FeatureVector(content4)

三:提供训练的参数,进行预测邮件是否为垃圾邮件

res = clf.predict(np.array([vector]))
res2 = clf.predict(np.array([vector2]))
res3 = clf.predict(np.array([vector3]))
res4 = clf.predict(np.array([vector4]))
print(res,res2,res3,res4)