nlp

How to handle <UKN> tokens in text generation

北城余情 提交于 2019-12-14 03:28:12
问题 In my text generation dataset, I have converted all infrequent words into the token (unknown word), as suggested by most text-generation literature. However, when training an RNN to take in part of a sentence as input and predict the rest of the sentence, I am not sure how I should stop the network from generating tokens. When the network encounters an unknown (infrequent) word in the training set, what should its output be? Example: Sentence: I went to the mall and bought a <ukn> and some

Difference in padding integer and string in keras

自古美人都是妖i 提交于 2019-12-14 02:36:33
问题 I'm trying to pad a text for a seq2seq model. from keras_preprocessing.sequence import pad_sequences x=[["Hello, I'm Bhaskar", "This is Keras"], ["This is an", "experiment"]] pad_sequences(sequences=x, maxlen=5, dtype='object', padding='pre', value="<PAD>") I encounter following error: ValueError: `dtype` object is not compatible with `value`'s type: <class 'str'> You should set `dtype=object` for variable length strings. However, when I try to do same for integer it works well. x=[[1, 2, 3],

Doc2vec : TaggedLineDocument()

て烟熏妆下的殇ゞ 提交于 2019-12-14 02:09:10
问题 So,I'm trying to learn and understand Doc2Vec. I'm following this tutorial. My input is a list of documents i.e list of lists of words. This is what my code looks like: input = [["word1","word2",..."wordn"],["word1","word2",..."wordn"],...] documents = TaggedLineDocument(input) model = doc2vec.Doc2Vec(documents,size = 50, window = 10, min_count = 2, workers=2) But I am getting some unicode error(tried googling this error, but no good ): TypeError('don\'t know how to handle uri %s' % repr(uri)

How to compute letter frequency similarity?

喜你入骨 提交于 2019-12-14 00:21:52
问题 Given this data (relative letter frequency from both languages): spanish => 'e' => 13.72, 'a' => 11.72, 'o' => 8.44, 's' => 7.20, 'n' => 6.83, english => 'e' => 12.60, 't' => 9.37, 'a' => 8.34, 'o' => 7.70, 'n' => 6.80, And then computing the letter frequency for the string "this is a test" gives me: "t"=>21.43, "s"=>14.29, "i"=>7.14, "r"=>7.14, "y"=>7.14, "'"=>7.14, "h"=>7.14, "e"=>7.14, "l"=>7.14 So, what would be a good approach for matching the given string letter frequency with a

Python fuzzy search and replace

半腔热情 提交于 2019-12-13 19:35:36
问题 I need to perfom fuzzy search for sub-string in string and replace that part. For example: str_a = "Alabama" str_b = "REPLACED" orig_str = "Flabama is a state located in the southeastern region of the United States." print(fuzzy_replace(str_a, str_b, orig_str)) # fuzzy_replace code should be implemented # Output: REPLACED is a state located in the southeastern region of the United States. The search itself is simple with fuzzywuzzy module, but it gives me only ratio of difference between

Associating free text statements with pre-defined attributes

懵懂的女人 提交于 2019-12-13 19:07:39
问题 I have a list of several dozen product attributes that people are concerned with, like Financing Manufacturing quality Durability Sales experience and several million free-text statements from customers about the product, e.g. "The financing was easy but the housing is flimsy." I would like to score each free text statement in terms of how strongly it relates to each of the attributes, and whether that is a positive or negative association. In the given example, there would be a strong

Does the NLTK sentence tokenizer assume correct punctuation and spacing?

拈花ヽ惹草 提交于 2019-12-13 17:56:35
问题 I'm trying to split sentences using NLTK and I've noticed it treats sentences without a whitespace in between as one sentence. For instance: text = 'Today is Monday.I went shopping.' sentences = sent_tokenize(text) # 1) Today is Monday.I went shopping. text = 'Today is Monday. I went shopping.' sentences = sent_tokenize(text) # 1) Today is Monday. # 2) I went shopping. Is there a way to properly split mispunctuated/misspaced sentences? 回答1: While sentence segmentation is not very complicated

CoreNLP Stanford Dependency Format

丶灬走出姿态 提交于 2019-12-13 16:43:26
问题 Bills on ports and immigration were submitted by Senator Brownback, Republican of Kansas From the above sentence, I am looking to obtain the following typed dependencies: nsubjpass(submitted, Bills) auxpass(submitted, were) agent(submitted, Brownback) nn(Brownback, Senator) appos(Brownback, Republican) prep_of(Republican, Kansas) prep_on(Bills, ports) conj_and(ports, immigration) prep_on(Bills, immigration) This should be possible as per Table 1, Figure 1 on the documentation for Stanford

Extract business titles and time periods from string

戏子无情 提交于 2019-12-13 16:27:08
问题 I am extracting information about certain companies from Reuters using Python. I have been able to get the officer/executive names, biographies, and compensation from this page Now, I want to extract previous position titles and companies from the biography section, which looks something like this: Mr. Donald T. Grimes is Senior Vice President, Chief Financial Officer and Treasurer of Wolverine World Wide, Inc., since May 2008. From 2007 to 2008, he was the Executive Vice President and Chief

AssertionError when installing pyrouge

試著忘記壹切 提交于 2019-12-13 15:14:07
问题 pyrouge : 0.1.3 rouge : downloaded from [here][1] since http://www.berouge.com/Pages/default.aspx is not accessable. I have installed XML::DOM and set the rouge path. And I've also tried suggestion from Errors installing Pyrouge. However there are still several assertion error here. Any suggestion to work it out? ====================================================================== FAIL: test_config_file (pyrouge.tests.Rouge155_test.PyrougeTest) ----------------------------------------------