nlp

自然语言处理之序列标注问题

≯℡__Kan透↙ 提交于 2020-01-13 20:38:42
  序列标注问题是自然语言中最常见的问题,在深度学习火起来之前,常见的序列标注问题的解决方案都是借助于HMM模型,最大熵模型,CRF模型。尤其是CRF,是解决序列标注问题的主流方法。随着深度学习的发展,RNN在序列标注问题中取得了巨大的成果。而且深度学习中的end-to-end,也让序列标注问题变得更简单了。   序列标注问题包括自然语言处理中的分词,词性标注,命名实体识别,关键词抽取,词义角色标注等等。我们只要在做序列标注时给定特定的标签集合,就可以进行序列标注。   序列标注问题是NLP中最常见的问题,因为绝大多数NLP问题都可以转化为序列标注问题,虽然很多NLP任务看上去大不相同,但是如果转化为序列标注问题后其实面临的都是同一个问题。所谓“序列标注”,就是说对于一个一维线性输入序列:        给线性序列中的每个元素打上标签集合中的某个标签:        所以,其本质上是对线性序列中每个元素根据上下文内容进行分类的问题。一般情况下,对于NLP任务来说,线性序列就是输入的文本,往往可以把一个汉字看做线性序列的一个元素,而不同任务其标签集合代表的含义可能不太相同,但是相同的问题都是:如何根据汉字的上下文给汉字打上一个合适的标签(无论是分词,还是词性标注,或者是命名实体识别,道理都是想通的)。 序列标注问题之中文分词   以中文分词任务来说明序列标注的过程。假设现在输入句子

Understanding input and labels in word2vec (TensorFlow)

萝らか妹 提交于 2020-01-13 19:07:39
问题 I am trying to properly understand the batch_input and batch_labels from the tensorflow "Vector Representations of Words" tutorial. For instance, my data 1 1 1 1 1 1 1 1 5 251 371 371 1685 ... ... starts with skip_window = 2 # How many words to consider left and right. num_skips = 1 # How many times to reuse an input to generate a label. Then the generated input array is: bach_input = 1 1 1 1 1 1 5 251 371 .... This makes sense, starts from after 2 (= window size) and then continuous. The

Understanding input and labels in word2vec (TensorFlow)

拜拜、爱过 提交于 2020-01-13 19:07:23
问题 I am trying to properly understand the batch_input and batch_labels from the tensorflow "Vector Representations of Words" tutorial. For instance, my data 1 1 1 1 1 1 1 1 5 251 371 371 1685 ... ... starts with skip_window = 2 # How many words to consider left and right. num_skips = 1 # How many times to reuse an input to generate a label. Then the generated input array is: bach_input = 1 1 1 1 1 1 5 251 371 .... This makes sense, starts from after 2 (= window size) and then continuous. The

Apache OpenNLP: java.io.FileInputStream cannot be cast to opennlp.tools.util.InputStreamFactory

狂风中的少年 提交于 2020-01-13 13:09:11
问题 I am trying to build a custom NER using Apache OpenNLP 1.7. From the documentation available Here, I have developed the following code import java.io.BufferedOutputStream; import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.IOException; import java.nio.charset.Charset; import opennlp.tools.namefind.NameFinderME; import opennlp.tools.namefind.NameSample; import opennlp.tools.namefind.NameSampleDataStream; import opennlp.tools.namefind.TokenNameFinderFactory; import

Named entity recognition with Java

穿精又带淫゛_ 提交于 2020-01-13 10:22:29
问题 I would like to use named entity recognition (NER) to find adequate tags for texts in a database. Instead of using tools like NLTK or Lingpipe I want to build my own tool. So my questions are: Which algorithm should I use? How hard is to build this tool? 回答1: I did this some time ago when I studied Markov chains. Anyway, the answers are: Which algorithm should I use? Stanford NLP for example uses Conditional Random Field (CRF). If you are not trying to do this effectively, you are like dude

Named entity recognition with Java

北慕城南 提交于 2020-01-13 10:22:15
问题 I would like to use named entity recognition (NER) to find adequate tags for texts in a database. Instead of using tools like NLTK or Lingpipe I want to build my own tool. So my questions are: Which algorithm should I use? How hard is to build this tool? 回答1: I did this some time ago when I studied Markov chains. Anyway, the answers are: Which algorithm should I use? Stanford NLP for example uses Conditional Random Field (CRF). If you are not trying to do this effectively, you are like dude

Neural Network Stanford parser word2vector format error during training

情到浓时终转凉″ 提交于 2020-01-13 05:32:30
问题 I am trying to train a model with Stanford neural network dependency parser for English. It does not accept a standard word2vector file with 100 dimensions. It generates an error message. I am using the embedded words as defined in this Web page: [https://drive.google.com/file/d/0B8nESzOdPhLsdWF2S1Ayb1RkTXc/view?usp=sharing][1] I have dowloaded the data as a text file in myPC. I am using the parameter -embeddingSize 100 but the parser generates an error message: Embedding File /../.../sskip

Sentence Segmentation using Spacy

拥有回忆 提交于 2020-01-13 05:17:06
问题 I am new to Spacy and NLP. Facing the below issue while doing sentence segmentation using Spacy. The text I am trying to tokenise into sentences contains numbered lists(with space between numbering and actual text) . Like below. import spacy nlp = spacy.load('en_core_web_sm') text = "This is first sentence.\nNext is numbered list.\n1. Hello World!\n2. Hello World2!\n3. Hello World!" text_sentences = nlp(text) for sentence in text_sentences.sents: print(sentence.text) Output (1.,2.,3. are

Spacy - Tokenize quoted string

僤鯓⒐⒋嵵緔 提交于 2020-01-12 20:53:40
问题 I am using spacy 2.0 and using a quoted string as input. Example string "The quoted text 'AA XX' should be tokenized" and expecting to extract [The, quoted, text, 'AA XX', should, be, tokenized] I however get some strange results while experimenting. Noun chunks and ents looses one of the quote. import spacy nlp = spacy.load('en') s = "The quoted text 'AA XX' should be tokenized" doc = nlp(s) print([t for t in doc]) print([t for t in doc.noun_chunks]) print([t for t in doc.ents]) Result [The,

Conduit: Multiple Stream Consumers

戏子无情 提交于 2020-01-12 17:25:30
问题 I write a program which counts the frequencies of NGrams in a corpus. I already have a function that consumes a stream of tokens and produces NGrams of one single order: ngram :: Monad m => Int -> Conduit t m [t] trigrams = ngram 3 countFreq :: (Ord t, Monad m) => Consumer [t] m (Map [t] Int) At the moment i just can connect one stream consumer to a stream source: tokens --- trigrams --- countFreq How do I connect multiple stream consumers to the same stream source? I would like to have