nlp

Training and evaluating spaCy model by sentences or paragraphs

柔情痞子 提交于 2019-12-11 06:15:38
问题 Observation: Paragraph: I love apple. I eat one banana a day Sentence: I love apple. , I eat one banana a day There are two sentences in this paragraph, I love apple and I eat one banana a day . If I put the whole paragraph into spaCy, it'll recognize only one entity, for example, apple , but if I put the sentences in paragraph one by one, spaCy can recognize two entities, apple and banana .( This is just an example to show my point, the actual recognition result could be different )

tfidf vectorizer process shows error

白昼怎懂夜的黑 提交于 2019-12-11 06:15:00
问题 I am working on non-Engish corpus analysis but facing several problems. One of those problems is tfidf_vectorizer. After importing concerned liberaries, I processed following code to get results contents = [open("D:\test.txt", encoding='utf8').read()] #define vectorizer parameters tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000, min_df=0.2, stop_words=stopwords, use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(3,3)) %time tfidf_matrix = tfidf_vectorizer.fit_transform

Gensim Phrases usage to filter n-grams

久未见 提交于 2019-12-11 06:07:25
问题 I am using Gensim Phrases to identify important n-grams in my text as follows. bigram = Phrases(documents, min_count=5) trigram = Phrases(bigram[documents], min_count=5) for sent in documents: bigrams_ = bigram[sent] trigrams_ = trigram[bigram[sent]] However, this detects uninteresting n-grams such as special issue , important matter , high risk etc. I am particularly, interested in detecting concepts in the text such as machine learning , human computer interaction etc. Is there a way to

spacy rule-matcher extract value from matched sentence

点点圈 提交于 2019-12-11 05:37:02
问题 I have a custom rule matching in spacy, and I am able to match some sentences in a document. I would like to extract some numbers now from the matched sentences. However, the matched sentences do not have always have the same shape and form. What is the best way to do this? # case 1: texts = ["the surface is 31 sq", "the surface is sq 31" ,"the surface is square meters 31" ,"the surface is 31 square meters" ,"the surface is about 31,2 square" ,"the surface is 31 kilograms"] pattern = [ {

Looking for a way to optimize this algorithm for parsing a very large string

岁酱吖の 提交于 2019-12-11 05:26:24
问题 The following class parses through a very large string (an entire novel of text) and breaks it into consecutive 4-character strings that are stored as a Tuple. Then each tuple can be assigned a probability based on a calculation. I am using this as part of a monte carlo/ genetic algorithm to train the program to recognize a language based on syntax only (just the character transitions). I am wondering if there is a faster way of doing this. It takes about 400ms to look up the probability of

How to parse texts separated by line breaks

雨燕双飞 提交于 2019-12-11 05:26:09
问题 How can I parse tokens separated by line break such as the one below: Wolff PERSON is O in O Argentina LOCATION The O US LOCATION Envoy O noted O into full sentences like this using python? Wolff is in Argentina The US Envoy noted 回答1: You can use itertools.groupby for this: >>> from StringIO import StringIO >>> from itertools import groupby >>> s = '''Wolff PERSON is O in O Argentina LOCATION The O US LOCATION Envoy O noted O''' >>> c = StringIO(s) >>> for k, g in groupby(c, key=str.isspace)

How to create a seq2seq without specifying a fixed decoder length?

情到浓时终转凉″ 提交于 2019-12-11 05:09:43
问题 Based on the model presented in this answer: def create_seq2seq(features_num,latent_dim,decoder_length): ## encoder_inputs = Input(shape=(None, features_num)) encoded = LSTM(latent_dim, return_state=False ,return_sequences=True)(encoder_inputs) encoded = LSTM(latent_dim, return_state=False ,return_sequences=True)(encoded) encoded = LSTM(latent_dim, return_state=False ,return_sequences=True)(encoded) encoded = LSTM(latent_dim, return_state=True)(encoded) encoder = Model (input=encoder_inputs,

Document similarity in Spacy vs Word2Vec

荒凉一梦 提交于 2019-12-11 05:05:23
问题 I have a niche corpus of ~12k docs, and I want to test near-duplicate documents with similar meanings across it - think article about the same event covered by different news organisations. I have tried gensim's Word2Vec, which gives me terrible similarity score(<0.3) even when the test document is within the corpus, and I have tried SpaCy, which gives me >5k documents with similarity > 0.9. I tested SpaCy's most similar documents, and it was mostly useless. This is the relevant code. tfidf =

what will be CNF form of this probabilistic grammar?

删除回忆录丶 提交于 2019-12-11 04:52:38
问题 If PCFG is like, NP -> ADJ N [0.6] NP -> N [0.4] N -> cat [0.2] N -> dog [0.8] What will be CNF form? Will it be the following? NP -> ADJ NP [0.6] NP -> cat [0.08] NP -> dog [0.32] or somethings else? 回答1: NP -> ADJ NP [0.6] NP -> cat [0.08] NP -> dog [0.32] Your answer is correct because you need to get the same probability for the result by applying both the original and the converted set of rules (in CNF). 来源: https://stackoverflow.com/questions/39769119/what-will-be-cnf-form-of-this

Note Taking Program with NLTK and Wordnet doesnt work, Error message says its because of wordnet

*爱你&永不变心* 提交于 2019-12-11 04:47:24
问题 I am trying to make a program in python that will take notes on a passage that I input. It will sort out the first and last sentence of the paragraph and the sentences with dates and numbers. It would then replace some words with synonyms, and get rid of useless adjectives. I am know the generic stuff with python, but I am new to nltk and WordNet. I've started a prototype program that will replace words in a sentence with all the random synonyms, however I keep getting an error that says