0x00 前言

效果演示

安装jieba库

pip install jieba

jieba三种模式：

　　1.精准模式 lcut函数，返回一个分词列表

　　2.全模式

　　3.搜索引擎模式

词频：

　　<单词>：<出现次数>的键值对

　　IPO描述 imput output process

　　输入　：从文件读取三国演义的内容

　　处理　：采用jiedb进行分词，字典数据结构统计词语出现的频率

　　输出　：文章中出现最对的前10个词

代码：

　　第一步：读取文件

　　第二步：分词

　　第三步：统计

　　第四步：排序

介绍完毕了！那么进入实战吧！

0x02 ʵս

完整代码如下：

 1 import jieba  2   3 content = open(‘三国演义.txt‘, ‘r‘,encoding=‘utf-8‘).read()  4 words =jieba.lcut(content)#分词  5 excludes={"将军","ȴ˵","二人","后主","上马","不知","天子","大叫","众将","不可","主公","蜀兵","只见","如何","商议","都督","一人","汉中","不敢","人马","陛下","魏兵","天下","今日","左右","东吴","于是","荆州","不能","如此","大喜","引兵","次日","军士","军马"}#排除的词汇  6 words=jieba.lcut(content)  7 counts={}  8   9 for word in words: 10     if len(word) == 1: # 排除单个字符的分词结果 11         continue 12     elif word == ‘孔明‘ or word == ‘孔明曰‘: 13        real_word = ‘孔明‘ 14     elif word == ‘关公‘ or word == ‘云长‘: 15        real_word = ‘关羽‘ 16     elif word == ‘孟德‘ or word == ‘丞相‘: 17        real_word = ‘曹操‘ 18     elif word == ‘玄德‘ or word == ‘玄德曰‘: 19        real_word = ‘刘备‘ 20     else: 21         real_word =word 22         counts[word] = counts.get(word, 0) + 1 23  24  25  26 for word in excludes: 27     del(counts[word]) 28 items=list(counts.items()) 29 items.sort(key=lambda x:x[1],reverse=True) 30 for i in range(10): 31     word,count=items[i] 32     print("{0:<10}{1:>5}".format(word,count))