nlp | 易学教程

How to handle <UKN> tokens in text generation

阅读更多关于 How to handle tokens in text generation

问题 In my text generation dataset, I have converted all infrequent words into the token (unknown word), as suggested by most text-generation literature. However, when training an RNN to take in part of a sentence as input and predict the rest of the sentence, I am not sure how I should stop the network from generating tokens. When the network encounters an unknown (infrequent) word in the training set, what should its output be? Example: Sentence: I went to the mall and bought a <ukn> and some

Difference in padding integer and string in keras

阅读更多关于 Difference in padding integer and string in keras

问题 I'm trying to pad a text for a seq2seq model. from keras_preprocessing.sequence import pad_sequences x=[["Hello, I'm Bhaskar", "This is Keras"], ["This is an", "experiment"]] pad_sequences(sequences=x, maxlen=5, dtype='object', padding='pre', value="<PAD>") I encounter following error: ValueError: `dtype` object is not compatible with `value`'s type: <class 'str'> You should set `dtype=object` for variable length strings. However, when I try to do same for integer it works well. x=[[1, 2, 3],

Doc2vec : TaggedLineDocument()

阅读更多关于 Doc2vec : TaggedLineDocument()

问题 So,I'm trying to learn and understand Doc2Vec. I'm following this tutorial. My input is a list of documents i.e list of lists of words. This is what my code looks like: input = [["word1","word2",..."wordn"],["word1","word2",..."wordn"],...] documents = TaggedLineDocument(input) model = doc2vec.Doc2Vec(documents,size = 50, window = 10, min_count = 2, workers=2) But I am getting some unicode error(tried googling this error, but no good ): TypeError('don\'t know how to handle uri %s' % repr(uri)

How to compute letter frequency similarity?

阅读更多关于 How to compute letter frequency similarity?

问题 Given this data (relative letter frequency from both languages): spanish => 'e' => 13.72, 'a' => 11.72, 'o' => 8.44, 's' => 7.20, 'n' => 6.83, english => 'e' => 12.60, 't' => 9.37, 'a' => 8.34, 'o' => 7.70, 'n' => 6.80, And then computing the letter frequency for the string "this is a test" gives me: "t"=>21.43, "s"=>14.29, "i"=>7.14, "r"=>7.14, "y"=>7.14, "'"=>7.14, "h"=>7.14, "e"=>7.14, "l"=>7.14 So, what would be a good approach for matching the given string letter frequency with a

Python fuzzy search and replace

阅读更多关于 Python fuzzy search and replace

问题 I need to perfom fuzzy search for sub-string in string and replace that part. For example: str_a = "Alabama" str_b = "REPLACED" orig_str = "Flabama is a state located in the southeastern region of the United States." print(fuzzy_replace(str_a, str_b, orig_str)) # fuzzy_replace code should be implemented # Output: REPLACED is a state located in the southeastern region of the United States. The search itself is simple with fuzzywuzzy module, but it gives me only ratio of difference between

Associating free text statements with pre-defined attributes

阅读更多关于 Associating free text statements with pre-defined attributes

问题 I have a list of several dozen product attributes that people are concerned with, like Financing Manufacturing quality Durability Sales experience and several million free-text statements from customers about the product, e.g. "The financing was easy but the housing is flimsy." I would like to score each free text statement in terms of how strongly it relates to each of the attributes, and whether that is a positive or negative association. In the given example, there would be a strong

Does the NLTK sentence tokenizer assume correct punctuation and spacing?

阅读更多关于 Does the NLTK sentence tokenizer assume correct punctuation and spacing?

问题 I'm trying to split sentences using NLTK and I've noticed it treats sentences without a whitespace in between as one sentence. For instance: text = 'Today is Monday.I went shopping.' sentences = sent_tokenize(text) # 1) Today is Monday.I went shopping. text = 'Today is Monday. I went shopping.' sentences = sent_tokenize(text) # 1) Today is Monday. # 2) I went shopping. Is there a way to properly split mispunctuated/misspaced sentences? 回答1: While sentence segmentation is not very complicated

CoreNLP Stanford Dependency Format

阅读更多关于 CoreNLP Stanford Dependency Format

问题 Bills on ports and immigration were submitted by Senator Brownback, Republican of Kansas From the above sentence, I am looking to obtain the following typed dependencies: nsubjpass(submitted, Bills) auxpass(submitted, were) agent(submitted, Brownback) nn(Brownback, Senator) appos(Brownback, Republican) prep_of(Republican, Kansas) prep_on(Bills, ports) conj_and(ports, immigration) prep_on(Bills, immigration) This should be possible as per Table 1, Figure 1 on the documentation for Stanford

Extract business titles and time periods from string

阅读更多关于 Extract business titles and time periods from string

问题 I am extracting information about certain companies from Reuters using Python. I have been able to get the officer/executive names, biographies, and compensation from this page Now, I want to extract previous position titles and companies from the biography section, which looks something like this: Mr. Donald T. Grimes is Senior Vice President, Chief Financial Officer and Treasurer of Wolverine World Wide, Inc., since May 2008. From 2007 to 2008, he was the Executive Vice President and Chief

AssertionError when installing pyrouge

阅读更多关于 AssertionError when installing pyrouge

问题 pyrouge : 0.1.3 rouge : downloaded from [here][1] since http://www.berouge.com/Pages/default.aspx is not accessable. I have installed XML::DOM and set the rouge path. And I've also tried suggestion from Errors installing Pyrouge. However there are still several assertion error here. Any suggestion to work it out? ====================================================================== FAIL: test_config_file (pyrouge.tests.Rouge155_test.PyrougeTest) ----------------------------------------------