nlp | 易学教程

Converting natural language to a math equation

阅读更多关于 Converting natural language to a math equation

问题 I've got a home automation system working in Java, and I want to add simple math capabilities such as addition, subtraction, multiplication, division, roots, and powers. At the system current state, it can convert a phrase into tags, as shown in the following examples: example: Phrase: "what is one hundred twenty two to the power of seven" Tagged: {QUESTION/math} {NUMBER/122} {MATH/pwr} {NUMBER/7} example: Phrase: "twenty seven plus pi 3 squared" Tagged: {NUMBER/27} {MATH/add} {NUMBER/3.14159

NLTK word_tokenize on French text is not woking properly

阅读更多关于 NLTK word_tokenize on French text is not woking properly

问题 I'm trying to use NLTK word_tokenize on a text in French by using : txt = ["Le télétravail n'aura pas d'effet sur ma vie"] print(word_tokenize(txt,language='french')) it should print: ['Le', 'télétravail', 'n'','aura', 'pas', 'd'','effet', 'sur', 'ma', 'vie','.'] But I get: ['Le', 'télétravail', "n'aura", 'pas', "d'effet", 'sur', 'ma', 'vie','.'] Does anyone know why it's not spliting tokens properly in French and how to overcome this (and other potential issues) when doing NLP in French? 回答1

Bigram to a vector

阅读更多关于 Bigram to a vector

问题 I want to construct word embeddings for documents using word2vec tool. I know how to find a vector embedding corresponding to a single word(unigram). Now, I want to find a vector for a bigram. Is it possible to do using word2vec? If yes, how? 回答1: The following snippet will get you the vector representation of a bigram. Note that the bigram you want to convert to a vector needs to have an underscore instead of a space between the words, e.g. bigram2vec(unigrams, "this report") is wrong, it

Embedding 3D data in Pytorch

阅读更多关于 Embedding 3D data in Pytorch

问题 I want to implement character-level embedding. This is usual word embedding. Word Embedding Input: [ [‘who’, ‘is’, ‘this’] ] -> [ [3, 8, 2] ] # (batch_size, sentence_len) -> // Embedding(Input) # (batch_size, seq_len, embedding_dim) This is what i want to do. Character Embedding Input: [ [ [‘w’, ‘h’, ‘o’, 0], [‘i’, ‘s’, 0, 0], [‘t’, ‘h’, ‘i’, ‘s’] ] ] -> [ [ [2, 3, 9, 0], [ 11, 4, 0, 0], [21, 10, 8, 9] ] ] # (batch_size, sentence_len, word_len) -> // Embedding(Input) # (batch_size, sentence

NLTK named entity recognition in dutch

阅读更多关于 NLTK named entity recognition in dutch

问题 I am trying to extract named entities from dutch text. I used nltk-trainer to train a tagger and a chunker on the conll2002 dutch corpus. However, the parse method from the chunker is not detecting any named entities. Here is my code: str = 'Christiane heeft een lam.' tagger = nltk.data.load('taggers/dutch.pickle') chunker = nltk.data.load('chunkers/dutch.pickle') str_tags = tagger.tag(nltk.word_tokenize(str)) print str_tags str_chunks = chunker.parse(str_tags) print str_chunks And the output

Natural Language Processing Toolkit for .NET [closed]

阅读更多关于 Natural Language Processing Toolkit for .NET [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 6 years ago . Can you give me some toolkits and libraries for natural language processing in .NET. Are there tools like UIMA for .NET? 回答1: There is SharpNLP .... 来源： https://stackoverflow.com/questions/6136436/natural-language-processing-toolkit-for-net

Semantic analysis of text

阅读更多关于 Semantic analysis of text

问题 Which tools would you recommend to look into for semantic analysis of text? Here is my problem: I have a corpus of words (keywords, tags). I need to process sentences, input by users and find if they are semantically close to words in the corpus that I have. Any kind of suggestions (books or actual toolkits / APIs) are very welcome. Regards, 回答1: Some useful links to begin with: http://alias-i.com/lingpipe/demos/tutorial/sentiment/read-me.html http://kmandcomputing.blogspot.com/2008/06

Semantic analysis of text

阅读更多关于 Semantic analysis of text

Stanford NER Features

阅读更多关于 Stanford NER Features

问题 I am currently trying to use the Stanford NER system and I am trying to see what features can be extracted through setting of the flags in a properties file. It seems that the features documented at http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/ie/NERFeatureFactory.html are not comprehensive. For example, all the feature flags related to dist similarity and clustering are not included (e.g. useDistSim, etc.). Is there a more complete list of all the features and corresponding

text classification with SciKit-learn and a large dataset

阅读更多关于 text classification with SciKit-learn and a large dataset

问题 First of all I started with python yesterday. I'm trying to do text classification with SciKit and a large dataset (250.000 tweets). For the algorithm, every tweet will be represented as a 4000 x 1 vector, so this means the input is 250.000 rows and 4000 columns. When i try to construct this in python, I run out of memory after 8500 tweets (when working with a list and appending it) and when I preallocate the memory I just get the error: MemoryError (np.zeros(4000,2500000)). Is SciKit not