nlp

Splitting chinese document into sentences [closed]

99封情书 提交于 2020-01-01 11:50:32
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 2 years ago . I have to split Chinese text into multiple sentences. I tried the Stanford DocumentPreProcessor. It worked quite well for English but not for Chinese. Please can you let me know any good sentence splitters for Chinese preferably in Java or Python. 回答1: Using some regex tricks in Python (c.f. a modified regex of

Kneser-Ney smoothing of trigrams using Python NLTK

百般思念 提交于 2020-01-01 09:18:29
问题 I'm trying to smooth a set of n-gram probabilities with Kneser-Ney smoothing using the Python NLTK. Unfortunately, the whole documentation is rather sparse. What I'm trying to do is this: I parse a text into a list of tri-gram tuples. From this list I create a FreqDist and then use that FreqDist to calculate a KN-smoothed distribution. I'm pretty sure though, that the result is totally wrong. When I sum up the individual probabilities I get something way beyond 1. Take this code example:

Strategy for parsing natural language descriptions into structured data

穿精又带淫゛_ 提交于 2020-01-01 08:19:10
问题 I have a set of requirements and I'm looking for the best Java-based strategy / algorthm / software to use. Basically, I want to take a set of recipe ingredients entered by real people in natural english and parse out the meta-data into a structured format (see requirements below to see what I'm trying to do). I've looked around here and other places, but have found nothing that gives a high-level advice on what direction follow. So, I'll put it to the smart people :-): What's the best /

How I classify a word of a text in things like names, number, money, date,etc?

人走茶凉 提交于 2020-01-01 07:30:52
问题 I did some questions about text-mining a week ago, but I was a bit confused and still, but now I know wgat I want to do. The situation: I have a lot of download pages with HTML content. Some of then can bean be a text from a blog, for example. They are not structured and came from different sites. What I want to do: I will split all the words with whitespace and I want to classify each one or a group of ones in some pre-defined itens like names, numbers, phone, email, url, date, money,

True definition of an English word?

我是研究僧i 提交于 2020-01-01 06:34:21
问题 What would be the best definition of an English word? What are the other cases of an English word than just \w+ ? Some may include \w+-\w+ or \w+'\w+ ; some may exclude cases like \b[0-9]+\b . But I haven't seen any general consensus on those cases. Do we have a formal defintion of such? Can any of you clarify? (Edit: broaden the question so it doesn't depend on regexp only.) 回答1: I really don't think a regex is going to help you here, the problem with English (or any language for that matter

Horizontal Markovization

一笑奈何 提交于 2020-01-01 05:40:18
问题 I have to implement horizontal markovization (NLP concept) and I'm having a little trouble understanding what the trees will look like. I've been reading the Klein and Manning paper, but they don't explain what the trees with horizontal markovization of order 2 or order 3 will look like. Could someone shed some light on the algorithm and what the trees are SUPPOSED to look like? I'm relatively new to NLP. 回答1: So, let's say you have a bunch of flat rules like: NP NNP NNP NNP NNP or VP V Det

How to extract character ngram from sentences? - python

百般思念 提交于 2020-01-01 05:39:10
问题 The following word2ngrams function extracts character 3grams from a word: >>> x = 'foobar' >>> n = 3 >>> [x[i:i+n] for i in range(len(x)-n+1)] ['foo', 'oob', 'oba', 'bar'] This post shows the character ngrams extraction for a single word, Quick implementation of character n-grams using python. But what if i have sentences and i want to extract the character ngrams, is there a faster method other than iteratively call the word2ngram() ? What will be the regex version of achieving the same

FreqDist in NLTK not sorting output

*爱你&永不变心* 提交于 2020-01-01 04:40:08
问题 I'm new to Python and I'm trying to teach myself language processing. NLTK in python has a function called FreqDist that gives the frequency of words in a text, but for some reason it's not working properly. This is what the tutorial has me write: fdist1 = FreqDist(text1) vocabulary1 = fdist1.keys() vocabulary1[:50] So basically it's supposed to give me a list of the 50 most frequent words in the text. When I run the code, though, the result is the 50 least frequent words in order of least

Encoding for Multilingual .py Files

本小妞迷上赌 提交于 2020-01-01 04:31:07
问题 I am writing a .py file that contains strings from multiple charactersets, including English, Spanish, and Russian. For example, I have something like: string_en = "The quick brown fox jumped over the lazy dog." string_es = "El veloz murciélago hindú comía feliz cardillo y kiwi." string_ru = "В чащах юга жил бы цитрус? Да, но фальшивый экземпляр!" I am having trouble figuring out how to encode my file to avoid generating syntax errors like the one below when my file is run: SyntaxError: Non

How to use vector representation of words (as obtained from Word2Vec,etc) as features for a classifier?

我们两清 提交于 2020-01-01 04:11:45
问题 I am familiar with using BOW features for text classification, wherein we first find the size of the vocabulary for the corpus which becomes the size of our feature vector. For each sentence/document, and for all its constituent words, we then put 0/1 depending on the absence/presence of that word in that sentence/document. However, now that I am trying to use vector representation of each word, is creating a global vocabulary essential? 回答1: Suppose the size of the vectors is N (usually