nlp

Fast shell command to remove stop words in a text file

一世执手 提交于 2019-12-13 15:03:27
问题 I have a 2GB text file. I am trying to remove frequently occurring english stop words from this file. I have stopwords.txt containing like this.. a an the for and I What is the fast method to do this using shell command such as tr, sed or awk? 回答1: Here's a method using the command line and perl : Save the text below as replacesw.sh : #! /bin/bash MYREGEX=\\b\(`perl -pe 's/\n/|/g' $1`\)\\b perl -pe "s/$MYREGEX//g" $2 Then if you have saved your file above as stopwords.txt , and have a second

What does “document” mean in a NLP context?

僤鯓⒐⒋嵵緔 提交于 2019-12-13 14:23:03
问题 As I was reading about tf–idf on Wiki, I was confused by what it means by the word "document". Does it mean paragraph? "The inverse document frequency is a measure of how much information the word provides, that is, whether the term is common or rare across all documents. It is the logarithmically scaled inverse fraction of the documents that contain the word, obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of

How to detect homophone

♀尐吖头ヾ 提交于 2019-12-13 12:42:38
问题 I am fairly new to speech processing, but wondering how homophones are detected. I am in search for an API which gives similarity between two words on the basis of how they are pronounced. for example: "to" and "two" are highly similar in terms of how they sound with respect to say "to" and "from". 回答1: You might want to try calculating the edit distance not on the original strings, but on pronunciations, like they are available in the CMU Pronouncing Dictionary at http://www.speech.cs.cmu

What's needed for NLP?

若如初见. 提交于 2019-12-13 12:22:06
问题 assuming that I know nothing about everything and that I'm starting in programming TODAY what do you say would be necessary for me to learn in order to start working with Natural Language Processing? I've been struggling with some string parsing methods but so far it is just annoying me and making me create ugly code. I'm looking for some fresh new ideas on how to create a Remember The Milk API like to parse user's input in order to provide an input form for fast data entry that are not based

noun countability

元气小坏坏 提交于 2019-12-13 12:07:13
问题 Are there any recourses on determining the countability of nouns? Either some way to work it out or a dictionary that records whether a noun is likely to countable or not countable? I'm not interested in whether the noun can be countable but more is it likely to be countable. for instance rice can go to rices which means it can be countable but in most cases it wont be. 回答1: This is a tough one. Many English words can be both (beer, time, glass, language, etc etc) depending on the context

how to count average sentence length (in words) from a text file contains 100 sentences using python [closed]

ぃ、小莉子 提交于 2019-12-13 10:09:58
问题 It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center. Closed 6 years ago . I have a text file which contains 100 sentences. i want to write a python script that will count average sentence length (in words) from a text file contains 100 sentences. Thanks 回答1: The naive way: sents = text

Gensim example, TypeError:between str and int error

时间秒杀一切 提交于 2019-12-13 09:37:54
问题 When running the below code. this Python 3.6, latest Gensim library in Jupyter for model in models: print(str(model)) pprint(model.docvecs.most_similar(positive=["Machine learning"], topn=20)) [1]: https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-wikipedia.ipynb 回答1: string= "machine learning".split() doc_vector = model.infer_vector(string) out= model.docvecs.most_similar([doc_vector]) I'm not sure 100% since I'm using a more recent release, but I think that the

stanford corenlp sentiment training set

陌路散爱 提交于 2019-12-13 08:33:17
问题 I am new to the area of NLP and sentiment analysis in particular. My goal is to train the Stanford CoreNLP sentiment model. I am aware that the sentences provided as training data should be in the following format. (3 (2 (2 The) (2 Rock)) (4 (3 (2 is) (4 (2 destined) (2 (2 (2 (2 (2 to) (2 (2 be) (2 (2 the) (2 (2 21st) (2 (2 (2 Century) (2 's)) (2 (3 new) (2 (2 ``) (2 Conan)))))))) (2 '')) (2 and)) (3 (2 that) (3 (2 he) (3 (2 's) (3 (2 going) (3 (2 to) (4 (3 (2 make) (3 (3 (2 a) (3 splash)) (2

Identify an english word as a thing or product?

对着背影说爱祢 提交于 2019-12-13 07:42:41
问题 Write a program with the following objective - be able to identify whether a word/phrase represents a thing/product. For example - 1) "A glove comprising at least an index finger receptacle, a middle finger receptacle.." <-Be able to identify glove as a thing/product. 2) "In a window regulator , especially for automobiles, in which the window is connected to a drive..." <- be able to identify regulator as a thing. Doing this tells me that the text is talking about a thing/product. as a

How can i add custom annotations to default ANNIE gazetteer?

久未见 提交于 2019-12-13 06:42:37
问题 I'm using the GATE * SDK * and would like to modify the default ANNIE Gazetteer to include a simple annotation based on a new list definition I have created. I've added my list definition to GATE-HOME\plugins\ANNIE\resources\gazetteer I've added an entry in the lists.def file to point to my new list file. E.g. *open_source_software:opensouce* I've created an annotation schema and added to the GATE-HOME\plugins\ANNIE\resources\schema When i load ANNIE and run the application it does not