nltk | 易学教程

How to iterate through all nodes of a tree?

阅读更多关于 How to iterate through all nodes of a tree?

问题 I want to simplify my parse trees' nodes, i.e. given a node I get rid of the first hyphen and whatever that comes after that hyphen. For example if a node is NP-TMP-FG, I want to make it NP and if it is SBAR-SBJ, I want to make it SBAR and so on. This is an example of one parse tree that I have ( (S (S-TPC-2 (NP-SBJ (NP (DT The) (NN asbestos) (NN fiber) ) (, ,) (NP (NN crocidolite) ) (, ,) ) (VP (VBZ is) (ADJP-PRD (RB unusually) (JJ resilient) ) (SBAR-TMP (IN once) (S (NP-SBJ (PRP it) ) (VP

How to iterate through all nodes of a tree?

阅读更多关于 How to iterate through all nodes of a tree?

stem function error: stem required one positional argument

阅读更多关于 stem function error: stem required one positional argument

问题 here stem function shows error saying that stem required one positional argument in loop as in question? from nltk.stem import PorterStemmer as ps text='my name is pythonly and looking for a pythonian group to be formed by me iteratively' words = word_tokenize(text) for word in words: print(ps.stem(word)) 回答1: You need to instantiate a PorterStemmer object from nltk.stem import PorterStemmer as ps from nltk.tokenize import word_tokenize stemmer = ps() text = 'my name is pythonly and looking

NLTK: How to create a corpus from csv file

阅读更多关于 NLTK: How to create a corpus from csv file

问题 I have a csv file as col1 col2 col3 some text someID some value some text someID some value in each row, col1 corresponds to the text of an entire document. I would like to create a corpus from this csv. my aim is to use sklearn's TfidfVectorizer to compute document similarity and keyword extraction. So consider tfidf = TfidfVectorizer(tokenizer=tokenize, stop_words='english') tfs = tfidf.fit_transform(<my corpus here>) so then i can use str = 'here is some text from a new document' response

This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical

阅读更多关于 This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical

问题 I ma trying to install tensorflow on Ubuntu and I am getting this message : (base) k@k-1005:~/Documents/ClassificationTexte/src$ python tester.py Using TensorFlow backend. RUN: 1 1.1. Training the classifier... LABELS: {'negative', 'neutral', 'positive'} 2019-12-10 11:58:13.428875: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations: SSE4.1 SSE4.2 AVX AVX2 FMA To

Sentence tokenization for texts that contains quotes

阅读更多关于 Sentence tokenization for texts that contains quotes

问题 Code: from nltk.tokenize import sent_tokenize pprint(sent_tokenize(unidecode(text))) Output: [After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker.', 'Finally they pushed you out of the cold emergency room.', 'I failed to protect you.', '"Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.',] Input: After Du died

How to Normalize similarity measures from Wordnet

阅读更多关于 How to Normalize similarity measures from Wordnet

问题 I am trying to calculate semantic similarity between two words. I am using Wordnet-based similarity measures i.e Resnik measure(RES), Lin measure(LIN), Jiang and Conrath measure(JNC) and Banerjee and Pederson measure(BNP). To do that, I am using nltk and Wordnet 3.0. Next, I want to combine the similarity values obtained from different measure. To do that i need to normalize the similarity values as some measure give values between 0 and 1, while others give values greater than 1. So, my

Stanford segmenter nltk Could not find SLF4J in your classpath

阅读更多关于 Stanford segmenter nltk Could not find SLF4J in your classpath

问题 I've set up a nltk and stanford environment, and nltk and stanford jars has downloaded, the program with nltk was ok, but I had a trouble with stanford segmenter. just make a simple program via stanford segmenter, I got a error is Could not find SLF4J in your classpath, although I had exported all jars including slf4j-api.jar . Detail as follows Python3.5 NLTK 3.2.2 Standford jars 3.7 OS: Centos environment variable: export JAVA_HOME=/usr/java/jdk1.8.0_60 export NLTK_DATA=/opt/nltk_data

成功运行topicrank的代码

阅读更多关于成功运行topicrank的代码

topicrank：是文本关键词抽取的一个模型方法，因为课程原因需要了解这篇论文，想跑下它的代码想跑下topicrank的代码，结果还搞了半天，特此记录一下。首先给出代码的链接： https://github.com/smirnov-am/pytopicrank 然后给出环境的配置：（相关环境的要求如下）在配置的文件中给出了了需要的包的版本，python3就可以，我用的是3.6.9 因为在我之前的环境中安装了某个包时出错了，所以我打算重新创建了一个新的环境，这里就要安利一下anaconda的方便之处了。其中出现的主要问题是：一、安装langdetect这个包时，发现老是出现错误，后面我的解决办法是删除了anaconda的一些镜像源地址，然后单独地使用一个镜像源地址进行下载，发现就可以了。在安装的过程中发现windows下可能会出现问题，后面装在Linux的服务器上就没有问题了。 1.删除所有的镜像源，换回默认源：换回默认源：conda config --remove-key channels 2.然后下载包的时候单独添加一个清华的镜像源： conda install -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge/ langdetect==1.0.7 二、在所有的库都安装后

12.03作业

阅读更多关于 12.03作业

要点：理解朴素贝叶斯算法理解机器学习算法建模过程理解文本常用处理流程理解模型评估方法垃圾邮件分类数据准备：用csv读取邮件数据，分解出邮件类别及邮件内容。对邮件内容进行预处理：去掉长度小于3的词，去掉没有语义的词等尝试使用nltk库： pip install nltk nltk.download 不成功：就使用词频统计的处理方法训练集和测试集数据划分 from sklearn.model_selection import train_test_split from nltk.corpue import stopwords stops=stopwords('english') stops tokens=[token for tokens if token not in stops] ' '.join(tokens) text #pip install nltk #nltk.download from sklearn.model_selection import train_test_split import nltk from nltk.stem import WordNetLemmatizer #lemmatizer=WordNetLemmatizer() #lemmatizer.lemmatize('leaves') #垃圾邮件分类 text='''Yes i