nlp

Python: Quote strings in multiple CSVs and merge files together

自作多情 提交于 2019-12-13 04:44:53
问题 I have a directory of roughly 600 CSV files that contain twitter data with multiple fields of various types (ints, floats, and strings). I have a script that can merge the files together, but the string fields can contain commas themselves are not quoted causing the string fields to separate and force text on new lines. Is it possible to quote the strings in each file and then merge them into a single file? Below is the script I use to merge the files and some sample data. Merger script: %

Text data replacement using dictionary

ぐ巨炮叔叔 提交于 2019-12-13 04:10:43
问题 Dataframe with below structure - ID text 0 Language processing in python th is great 1 Relace the string Dictionary named custom fix {'Relace': 'Replace', 'th' : 'three'} Tried the code and the output is coming as - Current output - ID text 0 Language processing in pythirdon three is great 1 Replace threee string Code: def multiple_replace(dict, text): # Create a regular expression from the dictionary keys regex = re.compile("(%s)" % "|".join(map(re.escape, dict.keys()))) # For each match,

spaCy: errors attempting to load serialized Doc

…衆ロ難τιáo~ 提交于 2019-12-13 03:44:13
问题 I am trying to serialize/deserialize spaCy documents (setup is Windows 7, Anaconda) and am getting errors. I haven't been able to find any explanations. Here is a snippet of code and the error it generates: import spacy nlp = spacy.load('en') text = 'This is a test.' doc = nlp(text) fout = 'test.spacy' # <-- according to the API for Doc.to_disk(), this needs to be a directory (but for me, spaCy writes a file) doc.to_disk(fout) doc.from_disk(fout) Traceback (most recent call last): File "

Why Mallet text classification output the same value 1.0 for all test files?

旧巷老猫 提交于 2019-12-13 03:38:23
问题 I am learning Mallet text classification command lines. The output values for estimating differrent classes are all the same 1.0. I do not know where I am incorrect. Can you help? mallet version: E:\Mallet\mallet-2.0.8RC3 //there is a txt file about cat breed (catmaterial.txt) in cat dir. //command 1 C:\Users\toshiba>mallet import-dir --input E:\Mallet\testmaterial\cat --output E :\Mallet\testmaterial\cat.mallet --remove-stopwords //command 1 output Labels = E:\Mallet\testmaterial\cat /

Unnest grab keywords/nextwords/beforewords function

折月煮酒 提交于 2019-12-13 03:33:29
问题 Background I have the following code to create a df : import pandas as pd word_list = ['crayons', 'cars', 'camels'] l = ['there are many different crayons in the bright blue box and crayons of all different colors', 'i like a lot of sports cars because they go really fast' 'the middle east has many camels to ride and have fun', 'all camels are fun'] df = pd.DataFrame(l, columns=['Text']) the df looks like this Text 0 there are many different crayons in the bright blue box and crayons of all

Better way to use SpaCy to parse sentences?

依然范特西╮ 提交于 2019-12-13 03:31:24
问题 I'm using SpaCy to find sentences that contain 'is' or 'was' that have pronouns as their subjects and return the object of the sentence. My code works, but I feel like there must be a much better way to do this. import spacy nlp = spacy.load('en_core_web_sm') ex_phrase = nlp("He was a genius. I really liked working with him. He is a dog owner. She is very kind to animals.") #create an empty list to hold any instance of this particular construction list_of_responses = [] #split into sentences

How to calculate the similarity measure of text document?

早过忘川 提交于 2019-12-13 03:07:39
问题 I have CSV file that looks like: idx messages 112 I have a car and it is blue 114 I have a bike and it is red 115 I don't have any car 117 I don't have any bike I would like to have the code that reads the file and performs the similarity difference. I have looked into many posts regarding this such as 1 2 3 4 but either it is hard for me to understand or not exactly what I want. based on some posts and webpages that saying "a simple and effective one is Cosine similarity" or "Universal

Unknown symbol in nltk pos tagging for Arabic

孤人 提交于 2019-12-13 02:59:48
问题 I have used nltk to tokenize some arabic text However, i ended up with some results like (u'an arabic character/word', '``') or (u'an arabic character/word', ':') However, they do not provide the `` or : in the documentation. hence i would like to find out what is this from nltk.toeknize.punkt import PunktWordTokenizer z = "أنا تسلق شجرة" tkn = PunkWordTokenizer sen = tkn.tokenize(z) tokens = nltk.pos_tag(sent) print tokens 回答1: The default NLTK POS tag is trained on English texts and is

How to implement network using Bert as a paragraph encoder in long text classification, in keras?

ぃ、小莉子 提交于 2019-12-13 02:59:20
问题 I am doing a long text classification task, which has more than 10000 words in doc, I am planing to use Bert as a paragraph encoder, then feed the embeddings of paragraph to BiLSTM step by step. The network is as below: Input: (batch_size, max_paragraph_len, max_tokens_per_para,embedding_size) bert layer: (max_paragraph_len,paragraph_embedding_size) lstm layer: ??? output layer: (batch_size,classification_size) How to implement it with keras? I am using keras's load_trained_model_from

Python pandas extracting hyphenated words from cells with phrases

橙三吉。 提交于 2019-12-13 02:57:50
问题 I have a dataframe which contain phrases and I want to extract only compound words separated by a hyphen from the dataframe and place them in another dataframe. df=pd.DataFrame({'Phrases': ['Trail 1 Yellow-Green','Kim Jong-il was here', 'President Barack Obama', 'methyl-butane', 'Derp da-derp derp', 'Pok-e-mon'],}) So far here is what I got so far: import pandas as pd df=pd.DataFrame({'Phrases': ['Trail 1 Yellow-Green','Kim Jong-il was here', 'President Barack Obama', 'methyl-butane', 'Derp