本章要点:
使用NLTK了解语料
导入语料
with open("./text.txt") as f:
text = f.read()
print(type(text))
print(text[:200])
<class 'str'>
[ Moby Dick by Herman Melville 1851 ] ETYMOLOGY . ( Supplied by a Late Consumptive Usher to a Grammar School ) The pale Usher -- threadbare in coat , heart , body , and brain ; I see him now . He was
- 这是本地的一个语料,需要的话可以从这里获得:链
接:https://pan.baidu.com/s/1EPWqGO5bpWs80MPr49s1KQ 密码:df6b
NLTK库
了解一下:
import nltk
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
0,本地语料转Text类
text = text.split(' ')
text = nltk.text.Text(text)
type(text)
nltk.text.Text
下面是Text类一些常用方法:
1,搜索文本
文章搜索:concordance()
concordance()函数,词语索引视图显示一个与指定单词的每一次出现,连同一些上下文一起显示。查看《白鲸记》中的词 monstrous:
text.concordance("monstrous")
Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . ... This came towards us ,
ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
th of Radney .'" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo
ght have been rummaged out of this monstrous cabinet there is no telling . But
of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u
相似词搜索:similar()
我们可以通过在被查询的文本名后添加函数名similar,然后在括号中插入相关的词来查找到:可以用.similar方法来识别文章中和搜索词相似的词语:
text.similar("monstrous")
true contemptible christian abundant few part mean careful puzzled
mystifying passing curious loving wise doleful gamesome singular
delightfully perilous fearless
上下文搜索:common_contexts()
函数common_contexts允许我们研究两个或两个以上的词共同的上下文,如monstrous和very。我们必须用方括号和圆括号把这些词括起来,中间用逗号分割:
text.common_contexts(["monstrous","very"])
No common contexts were found
可视化词频:dispersion_plot()
dispersion_plot()函数,用来可视化词在文中的出现情况。判断词在文本中的位置:从文本开头算起在它前面有多少词。这个位置信息 可以用离散图表示。每一个竖线代表一个单词,每一行代表整个文本:
text.dispersion_plot(['the',"monstrous", "whale", "Pictures",
"Scenes", "size"])
2,词汇计数
长度:len()
len(text)
260819
去重:set()
print(len(set(text)))
19317
排序:sorted(set(text))
sorted(set(text))[-10:]
['zag',
'zay',
'zeal',
'zephyr',
'zig',
'zodiac',
'zone',
'zoned',
'zones',
'zoology']
一个词的个数:count()
text.count('the')
13721
3,词频分布
FreqDist()
我们如何能自动识别文本中最能体现文本的主题和风格的词汇?试想一下,要找到一本书中使用最频繁的50 个词你会怎么做?
NLTK 中内置了它们。让我们使用FreqDist 寻找《白鲸记》中最常见的50 个词:
from nltk.book import FreqDist// 如果导入有问题,需要安装语料库,请看最后
fdist1 = FreqDist(text)
fdist1.most_common(10)
[(',', 18713),
('the', 13721),
('.', 6862),
('of', 6536),
('and', 6024),
('a', 4569),
('to', 4542),
(';', 4072),
('in', 3916),
('that', 2982)]
- (左边是单词,右边是该单词在文章中出现的次数)
累积词频图
上一个例子中是否有什么词有助于我们把握这个文本的主题或风格呢?只有一个词,whale,稍微有些信息量!它出现了超过900 次。其余的词没有告诉我们关于文本的信息;它们只是“管道”英语。这些词在文本中占多少比例?我们可以产生一个这些词汇的累积频率图,使用:
fdist1.plot(50, cumulative=True)
有意义的细粒度
通过上面的词频统计出来的词,并没有什么实际意义,我看考虑词的长度在大于7且词频大于7的词来分析
fdist2 = FreqDist(text)
sorted(w for w in set(text) if len(w)>7 and fdist2[w]>7)[:10]
['American',
'Atlantic',
'Bulkington',
'Canallers',
'Christian',
'Commodore',
'Consider',
'Fedallah',
'Greenland',
'Guernsey']
- 这样就可以看出点文本的信息了
词频分布类中定义的函数
4,词语搭配和双联词
一个搭配是经常在一起出现的词序列。 red wine 是一个搭配而 the wine 不是 。collocations()函数可以做这些:
text.collocations()
Sperm Whale; Moby Dick; White Whale; old man; Captain Ahab; sperm
whale; Right Whale; Captain Peleg; New Bedford; Cape Horn; cried Ahab;
years ago; lower jaw; never mind; Father Mapple; cried Stubb; chief
mate; white whale; ivory leg; one hand
安装nltk库及语料
1,安装nltk库
$ pip install nltk
2,安装nltk语料库
自动安装:
如果有梯子的话直接
improt nltk
nltk.download()
离线安装:
由于无法访问国外网站,这里采用离线安装nltk的方式;
可以从这里直接下载,链接:https://pan.baidu.com/s/1Vxc0RT8Vae3A5v1k1FjhTQ 密码:1drd
下载完成后,放到:/user/用户名/nltk_data,即可正常使用
现在导入:
from nltk.book import *
来源:CSDN
作者:肃之为冠
链接:https://blog.csdn.net/StarrySky3/article/details/104592852