nlp

Python NLP Text Tokenization based on custom regex

时光毁灭记忆、已成空白 提交于 2020-05-09 16:02:28
问题 I am processing large amount of text for custom (NER) Named Entity Recognition using Spacy. For text pre-processing I am using nltk for tokenization..etc. I am able to process one of my custom entities which is based on simple strings. But the other custom entity is a combination of number and certain text (20 BBLs for example). The word_tokenize method from nltk.tokenize tokenizes 20 and 'BBLs' separately each as a separate token. What I want is to treat them (the number and the 'BBLs'

Python NLP Text Tokenization based on custom regex

五迷三道 提交于 2020-05-09 16:00:03
问题 I am processing large amount of text for custom (NER) Named Entity Recognition using Spacy. For text pre-processing I am using nltk for tokenization..etc. I am able to process one of my custom entities which is based on simple strings. But the other custom entity is a combination of number and certain text (20 BBLs for example). The word_tokenize method from nltk.tokenize tokenizes 20 and 'BBLs' separately each as a separate token. What I want is to treat them (the number and the 'BBLs'

How to properly encode UTF-8 txt files for R topic model

二次信任 提交于 2020-04-30 09:27:18
问题 Similar issues have been discussed on this forum (e.g. here and here), but I have not found the one that solves my problem, so I apologize for a seemingly similar question. I have a set of .txt files with UTF-8 encoding (see the screenshot). I am trying to run a topic model in R using tm package. However, despite using encoding = "UTF-8" when creating the corpus, I get obvious problems with encoding. For instance, I get < U+FB01 >scal instead of fiscal , in< U+FB02>uenc instead of influence ,

Document Similarity - Multiple documents ended with same similarity score

旧巷老猫 提交于 2020-04-30 06:32:06
问题 I have been working in a business problem where i need to find a similarity of new document with existing one. I have used various approach as below 1.Bag of words + Cosine similarity 2.TFIDF + Cosine similarity 3.Word2Vec + Cosine similarity None of them worked as expected. But finally i found an approach which works better its Word2vec + Soft cosine similarity But the new challenge is i ended up with multiple documents with same similarity score . Most of them are relevant but few of them

Tweet classification into multiple categories on (Unsupervised data/tweets)

不想你离开。 提交于 2020-04-18 13:00:26
问题 I want to classify the tweets into predefined categories (like: sports, health, and 10 more). If I had labeled data, I would be able to do the classification by training Naive Bayes or SVM. As described in http://cucis.ece.northwestern.edu/publications/pdf/LeePal11.pdf But I cannot figure out a way with unlabeled data. One possibility could be using Expectation-Maximization and generating clusters and label those clusters. But as said earlier I have predefined set of classes, so clustering

Problem in coding a Welcome Message along with options in RASA

血红的双手。 提交于 2020-04-18 05:49:44
问题 I read this answer on How to code a Welcome Message in RASA, accordingly, I did write a custom action but it is not displaying the message as soon as the session starts, instead, it replies after the user has sent a message. Below is my code for printing just the welcome message. I had put this in my "actions.py" file. Please help me to fix this problem. The image below is an example of How I want my bot to start, It would start up with a general message and then it would give options which

nltk-比较中文文档相似度-完整实例

爷,独闯天下 提交于 2020-04-18 04:35:51
nltk同时也能处理中文的场景,只要做如下改动: 使用中文分词器(如我选用了结巴分词) 对中文字符做编码处理,使用unicode编码方式 python的源码编码统一声明为 gbk 使用支持中文的语料库 代码如下,需要jieba的支持 #!/usr/bin/env python #-*-coding=gbk-*- """ 原始数据,用于建立模型 """ #缩水版的courses,实际数据的格式应该为 课程名\t课程简介\t课程详情,并已去除html等干扰因素 courses = [ u'Writing II: Rhetorical Composing', u'Genetics and Society: A Course for Educators', u'General Game Playing', u'Genes and the Human Condition (From Behavior to Biotechnology)', u'A Brief History of Humankind', u'New Models of Business in Society', u'Analyse Numrique pour Ingnieurs', u'Evolution: A Course for Educators', u'Coding the Matrix: Linear

How to count the frequency of words existing in a text using nltk

混江龙づ霸主 提交于 2020-04-17 20:48:07
问题 I have a python script that reads the text and applies preprocess functions in order to do the analysis. The problem is that I want to count the frequency of words but the system crash and displays the below error. File "F:\AIenv\textAnalysis\setup.py", line 208, in tag_and_save file.write(word+"/"+tag+" (frequency="+str(freq_tagged_data[word])+")\n") TypeError: tuple indices must be integers or slices, not str I am trying to count the frequency and then write to a text file . def get_freq

Filling torch tensor with zeros after certain index

二次信任 提交于 2020-04-14 08:24:12
问题 Given a 3d tenzor, say: batch x sentence length x embedding dim a = torch.rand((10, 1000, 96)) and an array(or tensor) of actual lengths for each sentence lengths = torch .randint(1000,(10,)) outputs tensor([ 370., 502., 652., 859., 545., 964., 566., 576.,1000., 803.]) How to fill tensor ‘a’ with zeros after certain index along dimension 1 (sentence length) according to tensor ‘lengths’ ? I want smth like that : a[ : , lengths : , : ] = 0 One way of doing it (slow if batch size is big enough)

NLP(natural language processing) How to detect question with any method?

醉酒当歌 提交于 2020-04-12 07:35:31
问题 I search a machine learning method detecting some question. Example, User: Please tell me your name ? AI : (AI find User want to know his name) My name is [AI's name]. My dataset is as follows. [label], [question] 1 , What's your name? 1 , Tell me your name. ... But the problem is to include something that is not a question in the input. Example, User: Hello, my name is [User name] AI : (this is not a question) (throw another process) (->) Nice to meet you. The number of Question's categories