nltk | 易学教程

can NLTK/pyNLTK work “per language” (i.e. non-english), and how?

阅读更多关于 can NLTK/pyNLTK work “per language” (i.e. non-english), and how?

问题 How can I tell NLTK to treat the text in a particular language? Once in a while I write a specialized NLP routine to do POS tagging, tokenizing and etc. on a non-english (but still hindo-European) text domain. This question seem to address only different corpora, not the change in code/settings: POS tagging in German Alternatively,are there any specialized Hebrew/Spanish/Polish NLP modules for python? 回答1: I'm not sure what you're referring to as the changes in code/settings. NLTK mostly

Python text processing: NLTK and pandas

阅读更多关于 Python text processing: NLTK and pandas

问题 I'm looking for an effective way to construct a Term Document Matrix in Python that can be used together with extra data. I have some text data with a few other attributes. I would like to run some analyses on the text and I would like to be able to correlate features extracted from text (such as individual word tokens or LDA topics) with the other attributes. My plan was load the data as a pandas data frame and then each response will represent a document. Unfortunately, I ran into an issue:

Windows10，Python3环境下nltk的nltk_data下载缓慢问题

阅读更多关于 Windows10，Python3环境下nltk的nltk_data下载缓慢问题

Windows10，Python3环境下nltk的nltk_data下载缓慢问题 NLTK是一个高效的Python构建的平台，用来处理人类自然语言数据。它提供了易于使用的接口，通过这些接口可以访问超过50个语料库和词汇资源（如WordNet），还有一套用于分类、标记化、词干标记、解析和语义推理的文本处理库，以及工业级NLP库的封装器和一个活跃的讨论论坛。但其在windows平台下的下载与安装常由于nltk_data等文件过于庞大，容易卡住或中断，现提供离线下载安装方法如下：第一步在github上下载nltk_data，网址为https://github.com/nltk/nltk_data，支持python3。下载packages目录，将packages目录下文件夹中所有的压缩包解压出来。第二步在Python终端下输入： import nltk nltk . data . find ( "." ) 第三步将第一步得到的若干文件夹移动至第二步得到的路径下(我的是C:\Users\Username\AppData\Roaming\nltk_data) 第四步在终端输入代码进行测试： from nltk . book import 若出现如下结果，则代表安装有效: ** * Introductory Examples for the NLTK Book ** *

Python NLTK Lemmatization of the word 'further' with wordnet

阅读更多关于 Python NLTK Lemmatization of the word 'further' with wordnet

问题 I'm working on a lemmatizer using python, NLTK and the WordNetLemmatizer. Here is a random text that output what I was expecting from nltk.stem import WordNetLemmatizer from nltk.corpus import wordnet lem = WordNetLemmatizer() lem.lemmatize('worse', pos=wordnet.ADJ) // here, we are specifying that 'worse' is an adjective Output: 'bad' lem.lemmatize('worse', pos=wordnet.ADV) // here, we are specifying that 'worse' is an adverb Output: 'worse' Well, everything here is fine. The behaviour is the

Changing K mean clustering distance metric to canberra distance or any other distance metric on python

阅读更多关于 Changing K mean clustering distance metric to canberra distance or any other distance metric on python

问题 How do I change the distance metric of k mean clustering to canberra distance or any other distance metric? From my understanding, sklearn only supports euclidean distance and nltk doesn't seem to support canberra distance but I may be wrong. Thank you! 回答1: from scipy.spatial import distance from nltk.cluster.kmeans import KMeansClusterer obj = KMeansCluster(num_cluster, distance = distance.canberra) 来源： https://stackoverflow.com/questions/59554641/changing-k-mean-clustering-distance-metric

将字符串拆分为具有多个单词边界定界符的单词

阅读更多关于将字符串拆分为具有多个单词边界定界符的单词

【推荐】2019 Java 开发者跳槽指南.pdf(吐血整理) >>> 我认为我想做的是一项相当普通的任务，但是我在网络上找不到任何参考。我的文字带有标点符号，我想要一个单词列表。 "Hey, you - what are you doing here!?" 应该 ['hey', 'you', 'what', 'are', 'you', 'doing', 'here'] 但是Python的 str.split() 仅适用于一个参数，因此在用空格分割后，所有单词都带有标点符号。有任何想法吗？ #1楼我正在重新熟悉Python，并需要同样的东西。 findall解决方案可能更好，但是我想到了： tokens = [x.strip() for x in data.split(',')] #2楼正则表达式合理的情况： import re DATA = "Hey, you - what are you doing here!?" print re.findall(r"[\w']+", DATA) # Prints ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here'] #3楼 re.split（） re.split（pattern，string [，maxsplit = 0]）通过模式的出现来分割字符串。

Python code flow does not work as expected?

阅读更多关于 Python code flow does not work as expected?

问题 I am trying to process various texts by regex and NLTK of python -which is at http://www.nltk.org/book-. I am trying to create a random text generator and I am having a slight problem. Firstly, here is my code flow: Enter a sentence as input -this is called trigger string, is assigned to a variable- Get longest word in trigger string Search all Project Gutenberg database for sentences that contain this word -regardless of uppercase lowercase- Return the longest sentence that has the word I

different nltk results in django and at command line

阅读更多关于 different nltk results in django and at command line

问题 I have a django 1.8 view that looks like this: def sourcedoc_parse(request, sourcedoc_id): sourcedoc = Sourcedoc.objects.get(pk=sourcedoc_id) nltk.data.path.append('/root/nltk_data') new_words = [] english_vocab = set(w.lower() for w in nltk.corpus.gutenberg.words()) #<---the line where the error occurs results = {} template = 'sourcedoc_parse.html' params = {'sourcedoc': sourcedoc,'results': results, 'new_words': new_words, 'BASE_URL': BASE_URL} return render_to_response(template, params,

different nltk results in django and at command line

阅读更多关于 different nltk results in django and at command line

Unicode Tagging in Python NLTK

阅读更多关于 Unicode Tagging in Python NLTK

问题 I am working on a python NLTK tagging program. My input file is Hindi text containing several lines. On tokenizing the text and using pos_tag the output I get is with NN tag only. but with English sentence as input it does proper tagging. Kindly Help. Version - Python 3.4.1, from NLTK 3.0 documentation Kindly help! here is what I tried. word_to_be_tagged = u"ताजो स्वास आनी चकचकीत दांत तुमचें व्यक्तीमत्व परजळायतात." from nltk.corpus import indian train_data = indian.tagged_sents('hindi.pos')[