python day 17 文本词频统计

牧云@^-^@ 提交于 2019-12-29 02:19:25

文本词频统计
一、概述
1.需求:一篇文章,出现了哪些词?哪些词出现得最多?
2.首先,要知道英文文本和中文文本的词频统计是不同的
二、“HAMLET”
1.噪音处理:提取单词,去除不必要的其他东西。
2.提取单词,split按空格切分,形成列表
3.统计单词和对应的词频,使用字典
4.词频按关键字:出现次数 排序,使用列表sort method
5.输出

Hamlet

def gettext():
text = open("hamlet.txt",'r').read()
text = text.lower()
for ch in '"#$%^&*()_+-,./<>=@{}[]~'':
text = text.replace(ch,'')
return text
hamlettext = gettext()
words = hamlettext.split()
counts = {}
for word in words:
counts[word]=counts.get(word,0)+1
items = list(counts.items())
items.sort(key = lambda x:x[1],reverse = True)
for i in range (20):
word,count = items[i]
print("{0:<10}{1:>5}".format(word,count))

三、《三国演义》人名出场次数统计
1.第一版

三国演义

first,get words;second,count the times word appear in text;third,print the top 20

import jieba
txt = open('三国演义.txt','r',encoding='utf-8').read()
words = jieba.lcut(txt)
counts = {}
for word in words:
if len(word)==1:
continue
else:
counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key = lambda x:x[1],reverse = True)
for i in range (10):
word,times = items[i]
print('{0:<10}{1:>5}'.format(word,times))

发现问题:
孔明和孔明曰应该算作一个人
荆州等不是人名
改进:
从列表中删除非人名词组
在建立集合统计词语出场次数的时候,把孔明和孔明曰,算作一个次。

2.第二版
import jieba
txt =open('D:/pythonfiles/三国演义.txt','r',encoding='utf-8').read()
excludes = {'将军','却说','荆州','二人','不可','不能','如此','如何','军士','商议','左右','军马','次日','引兵','大喜','天下','东吴','于是','今日','不敢'}
words = jieba.lcut(txt)
counts = {}
for word in words:
if len (word) == 1 :
continue
elif word == '诸葛亮' or word == '孔明曰':
reword = '孔明'
elif word == '玄德' or word == '玄德曰':
reword = '刘备'
elif word == '关公' or word == '云长':
reword = '关羽'
elif word == '孟德' or word == '丞相':
reword = '曹操'
else :
reword = word
counts[reword]=counts.get(reword,0)+1
for word in excludes:
del counts[word]
items = list(counts.items())
items.sort(key= lambda x:x[1],reverse = True)
for i in range(20):
word, count = items[i]
print('{:<10}{:>5}'.format(word,count))

依旧还是老问题,按照改进的方法,进一步优化即可。

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!