nltk

can NLTK/pyNLTK work “per language” (i.e. non-english), and how?

流过昼夜 提交于 2020-01-10 14:12:14
问题 How can I tell NLTK to treat the text in a particular language? Once in a while I write a specialized NLP routine to do POS tagging, tokenizing and etc. on a non-english (but still hindo-European) text domain. This question seem to address only different corpora, not the change in code/settings: POS tagging in German Alternatively,are there any specialized Hebrew/Spanish/Polish NLP modules for python? 回答1: I'm not sure what you're referring to as the changes in code/settings. NLTK mostly

Python text processing: NLTK and pandas

梦想的初衷 提交于 2020-01-10 08:27:10
问题 I'm looking for an effective way to construct a Term Document Matrix in Python that can be used together with extra data. I have some text data with a few other attributes. I would like to run some analyses on the text and I would like to be able to correlate features extracted from text (such as individual word tokens or LDA topics) with the other attributes. My plan was load the data as a pandas data frame and then each response will represent a document. Unfortunately, I ran into an issue:

Windows10,Python3环境下nltk的nltk_data下载缓慢问题

穿精又带淫゛_ 提交于 2020-01-09 15:21:04
Windows10,Python3环境下nltk的nltk_data下载缓慢问题 NLTK是一个高效的Python构建的平台,用来处理人类自然语言数据。它提供了易于使用的接口,通过这些接口可以访问超过50个语料库和词汇资源(如WordNet),还有一套用于分类、标记化、词干标记、解析和语义推理的文本处理库,以及工业级NLP库的封装器和一个活跃的讨论论坛。 但其在windows平台下的下载与安装常由于nltk_data等文件过于庞大,容易卡住或中断,现提供离线下载安装方法如下: 第一步 在github上下载nltk_data,网址为https://github.com/nltk/nltk_data,支持python3。下载packages目录,将packages目录下文件夹中所有的压缩包解压出来。 第二步 在Python终端下输入: import nltk nltk . data . find ( "." ) 第三步 将第一步得到的若干文件夹移动至第二步得到的路径下(我的是C:\Users\Username\AppData\Roaming\nltk_data) 第四步 在终端输入代码进行测试: from nltk . book import 若出现如下结果,则代表安装有效: ** * Introductory Examples for the NLTK Book ** *

Python NLTK Lemmatization of the word 'further' with wordnet

旧时模样 提交于 2020-01-09 05:33:50
问题 I'm working on a lemmatizer using python, NLTK and the WordNetLemmatizer. Here is a random text that output what I was expecting from nltk.stem import WordNetLemmatizer from nltk.corpus import wordnet lem = WordNetLemmatizer() lem.lemmatize('worse', pos=wordnet.ADJ) // here, we are specifying that 'worse' is an adjective Output: 'bad' lem.lemmatize('worse', pos=wordnet.ADV) // here, we are specifying that 'worse' is an adverb Output: 'worse' Well, everything here is fine. The behaviour is the

Changing K mean clustering distance metric to canberra distance or any other distance metric on python

青春壹個敷衍的年華 提交于 2020-01-07 08:07:08
问题 How do I change the distance metric of k mean clustering to canberra distance or any other distance metric? From my understanding, sklearn only supports euclidean distance and nltk doesn't seem to support canberra distance but I may be wrong. Thank you! 回答1: from scipy.spatial import distance from nltk.cluster.kmeans import KMeansClusterer obj = KMeansCluster(num_cluster, distance = distance.canberra) 来源: https://stackoverflow.com/questions/59554641/changing-k-mean-clustering-distance-metric

将字符串拆分为具有多个单词边界定界符的单词

北城以北 提交于 2020-01-06 23:34:31
【推荐】2019 Java 开发者跳槽指南.pdf(吐血整理) >>> 我认为我想做的是一项相当普通的任务,但是我在网络上找不到任何参考。 我的文字带有标点符号,我想要一个单词列表。 "Hey, you - what are you doing here!?" 应该 ['hey', 'you', 'what', 'are', 'you', 'doing', 'here'] 但是Python的 str.split() 仅适用于一个参数,因此在用空格分割后,所有单词都带有标点符号。 有任何想法吗? #1楼 我正在重新熟悉Python,并需要同样的东西。 findall解决方案可能更好,但是我想到了: tokens = [x.strip() for x in data.split(',')] #2楼 正则表达式合理的情况: import re DATA = "Hey, you - what are you doing here!?" print re.findall(r"[\w']+", DATA) # Prints ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here'] #3楼 re.split() re.split(pattern,string [,maxsplit = 0]) 通过模式的出现来分割字符串。

Python code flow does not work as expected?

北城以北 提交于 2020-01-06 18:10:58
问题 I am trying to process various texts by regex and NLTK of python -which is at http://www.nltk.org/book-. I am trying to create a random text generator and I am having a slight problem. Firstly, here is my code flow: Enter a sentence as input -this is called trigger string, is assigned to a variable- Get longest word in trigger string Search all Project Gutenberg database for sentences that contain this word -regardless of uppercase lowercase- Return the longest sentence that has the word I

different nltk results in django and at command line

故事扮演 提交于 2020-01-06 17:10:51
问题 I have a django 1.8 view that looks like this: def sourcedoc_parse(request, sourcedoc_id): sourcedoc = Sourcedoc.objects.get(pk=sourcedoc_id) nltk.data.path.append('/root/nltk_data') new_words = [] english_vocab = set(w.lower() for w in nltk.corpus.gutenberg.words()) #<---the line where the error occurs results = {} template = 'sourcedoc_parse.html' params = {'sourcedoc': sourcedoc,'results': results, 'new_words': new_words, 'BASE_URL': BASE_URL} return render_to_response(template, params,

different nltk results in django and at command line

血红的双手。 提交于 2020-01-06 17:10:06
问题 I have a django 1.8 view that looks like this: def sourcedoc_parse(request, sourcedoc_id): sourcedoc = Sourcedoc.objects.get(pk=sourcedoc_id) nltk.data.path.append('/root/nltk_data') new_words = [] english_vocab = set(w.lower() for w in nltk.corpus.gutenberg.words()) #<---the line where the error occurs results = {} template = 'sourcedoc_parse.html' params = {'sourcedoc': sourcedoc,'results': results, 'new_words': new_words, 'BASE_URL': BASE_URL} return render_to_response(template, params,

Unicode Tagging in Python NLTK

≡放荡痞女 提交于 2020-01-06 13:55:04
问题 I am working on a python NLTK tagging program. My input file is Hindi text containing several lines. On tokenizing the text and using pos_tag the output I get is with NN tag only. but with English sentence as input it does proper tagging. Kindly Help. Version - Python 3.4.1, from NLTK 3.0 documentation Kindly help! here is what I tried. word_to_be_tagged = u"ताजो स्वास आनी चकचकीत दांत तुमचें व्यक्तीमत्व परजळायतात." from nltk.corpus import indian train_data = indian.tagged_sents('hindi.pos')[