数据挖掘——文本挖掘-绘制词云

文本挖掘是将文本信息转化为可利用的数据的知识。　　

一、创建“语料库”

语料库（Corpus）是我们要分析的所有文档的集合。

将现有的文本文档的内容添加到一个新的语料库中。

实现逻辑：

　　将各文本文件分类放置在一个根目录下，通过遍历读取根目录下所有子目录中的所有文件，

　　然后将读取结果赋值到一个数据框中，得到含有文件路径、文件内容的结果。

代码核心：

　　构建方法：os.walk(fileDir) 对在fileDir目录下的所有文件（for循环）进行操作，得到文件路径

　　文件读取：codecs.open(filepath,medthod,encoding) 文件路径、打开方式（r，w，rw）、文件编码，得到文本内容

#构建语料库       
import codecs
filepaths = []  #构建一个空的‘文件路径’列表
filecontents = [] #构建一个空的‘文件内容’列表
for root, dirs, files in os.walk('.\SogouC.mini\Sample'):
    for name in files:
         #拼接文件路径，得到所有子文件夹下的文件的文件路径的列表 filepaths，包含根目录、子目录和文件名
        filepath = os.path.join(root,name)   
        filepaths.append(filepath) #将所有子文件夹下的文件路径的列表合并到一个新的列表中
        #打开文件，‘r’表示只读，编码方式‘utf-8’
        f = codecs.open(filepath,'r','utf-8')
        filecontent = f.read() #读取文件，并将内容传入到  'filecontent'（文件内容）列表中
        f.close() #关闭文件
        filecontents.append(filecontent) #将所有子文件夹下的文件内容的列表合并到一个新的列表中
        
import pandas as pd
#根据得到的合并后的文件路径和文件内容，得到语料库的数据框
corpos = pd.DataFrame({
        'filePath':filepaths,
        'fileContent':filecontents})
corpos.to_csv('.\corpos.csv',sep=',',encoding='utf_8_sig',index=False)
###防止保存时出现乱码，需要参数encoding='utf_8_sig'

二、中文分词

　　一般使用 jieba 中文分词包，较友好（简单，方便，准确率高)

　　jieba包的部分用法：

jieba.cut('str') 对str进行分词
jieba.add_word() 增加自定义分词
jieba.load_userdict() 通过导入本地文件中的词，将之添加到词库

　　分词实现代码：

import jieba
#创建词组和路径的空列表
segments = []
filepath_2 = []
#对语料库的每行遍历（
for index, row in corpos.iterrows(): 
    filePath = row['filePath']  #文件路径
    fileContent = row['fileContent']  #文本内容
    segs = jieba.cut(fileContent)  #对文本内容分词
    #对分词结果遍历，将每个词及其路径分别添加到segments和filepath_2列表中
    for seg in segs:
        segments.append(seg)
        filepath_2.append(filePath)
#将两个列表合并到数据框中（词，路径）
segmeng_DF = pd.DataFrame({
        'segment': segments,
        'filePath': filepath_2})

　　最终得到各个词及其路径的数据框

三、词频统计

　　得到含有分词结果的数据后，需要对分词出现的次数进行统计，得到词频表

#####词频统计
import numpy as np
#根据前面的分词结果，对每个词的词频进行统计，再根据词频大小排序
segcount = segmeng_DF.groupby(by='segment')['segment'].agg({
        '频数':np.size
        }).reset_index().sort_index(by=['频数'],ascending=False)

help(pd.DataFrame.sort_index)

　　DataFrame不支持sort方法，已更新为sort_index方法

　　词频统计后需要将部分停用词（语气词等等无实际意义的词）进行剔除

　　！！！这里使用导入的方式确定停用词，read_csv对中文路径十分不友好，尽量使用英文路径

stopwords = pd.read_csv(r'D:\python_study\StopwordsCN.txt',encoding='utf-8',index_col=False)

　　stopwords即为分词过程中需要剔除的词

　　剔除停用词的两种思路：

剔除统计词频后的分词结果中含有的停用词，使用isin方法，“~”取反
在构建语料库的时候添加过滤条件

第一种实现方法：

fsegcount = segcount[~segcount.segment.isin(stopwords.stopword)]

第二种实现方法：

#####在读取文件时过滤停用词
import jieba
#创建词组和路径的空列表
segments = []
filepath_2 = []
#对语料库的每行遍历（
for index, row in corpos.iterrows(): 
    filePath = row['filePath']  #文件路径
    fileContent = row['fileContent']  #文本内容
    segs = jieba.cut(fileContent)  #对文本内容分词
    #对分词结果遍历，将每个词及其路径分别添加到segments和filepath_2列表中
    for seg in segs:
        if (seg not in stopwords.stopword.values) and (len(seg.strip())>0):
            segments.append(seg)
            filepath_2.append(filePath)
#将两个列表合并到数据框中（词，路径）
segmeng_DF = pd.DataFrame({
        'segment': segments,
        'filePath': filepath_2})
segcount_1 = segmeng_DF.groupby(by='segment')['segment'].agg({
        '频数':np.size
        }).reset_index().sort_index(by=['频数'],ascending=False)

嵌套if函数，通过判断分词结果是否在stopwords的值中，剔除停用词，得到最终的词频统计表

四、绘制词云

首先需要下载wordcloud程序包，通过whl文件进行库的安装

https://www.lfd.uci.edu/~gohlke/pythonlibs/#wordcloud

在cmd中定位下载目录，输入 pip install wordcloud-1.5.0-cp36-cp36m-win_amd64.whl　　进行安装

from wordcloud  import WordCloud
import matplotlib.pyplot as plt

#传入字体文件的路径及背景颜色两个参数
wordcloud = WordCloud(font_path=r'D:\python_study\python数据挖掘\数据挖掘学习代码\课件\2.4\simhei.ttf',
                      background_color='gray')
#wordcloud方法需要传入字典结构的参数，所以将词频结果（数据框）转换为字典类型
#先将分词设置为数据框的索引，再使用to_dict方法转换为字典
words = segcount_1.set_index('segment').to_dict()
type(segcount_1.set_index('segment'))#只有一列的数据框

wordcloud.fit_words(words['频数'])#根据频数进行作图
plt.imshow(wordcloud)
plt.close()

最终得到类似与右图的结果

五、词云美化

　　将词云的背景替换成与主题相关的图片

　　需要用到的包：

　　from scipy.misc import imread
　　from wordcloud import WordCloud, ImageColorGenerator

　　部分关键方法

　　读取图片背景： bimg = imread(imgFilePath)

　　获取图片颜色： bimgColors = ImageColorGenerator(bimg)

　　重置词云的颜色： wordcloud.recolor(color_func=bimgColors

#词云美化
from scipy.misc import imread
from wordcloud import WordCloud, ImageColorGenerator
#读取需要替换的图片背景：
bimg = imread(r'D:\python_study\python数据挖掘\数据挖掘学习代码\课件\2.5\贾宝玉2.png')
#使用了贾宝玉的上半身作为词云（别问，懒，随手拿的）
wordcloud = WordCloud(
    background_color="white", 
    mask=bimg, font_path=r'D:\python_study\python数据挖掘\数据挖掘学习代码\课件\2.4\simhei.ttf'
)

wordcloud = wordcloud.fit_words(words['频数'])
#设置输出图形参数
plt.figure(
    num=None, 
    figsize=(8, 6), dpi=80, 
    facecolor='w', edgecolor='k')
#获取图片颜色
bimgColors = ImageColorGenerator(bimg)
#移除坐标轴
plt.axis("off")
#重置词云颜色
plt.imshow(wordcloud.recolor(color_func=bimgColors))

plt.show()

最终得到右图的结果

来源：oschina

链接：https://my.oschina.net/u/4257455/blog/3829838

标签

jieba

python