nlp

how to preserve number of records in word2vec?

梦想与她 提交于 2020-01-07 03:48:06
问题 I have 45000 text records in my dataframe. I wanted to convert those 45000 records into word vectors so that I can train a classifier on the word vector. I am not tokenizing the sentences. I just split the each entry into list of words. After training word2vec model with 300 features, the shape of the model resulted in only 26000. How can I preserve all of my 45000 records ? In the classifier model, I need all of those 45000 records, so that it can match 45000 output labels. 回答1: If you are

Computing cosine similarities on a large corpus in R using quanteda

断了今生、忘了曾经 提交于 2020-01-07 03:04:47
问题 I am trying to work with a very large corpus of about 85,000 tweets that I'm trying to compare to dialog from television commercials. However, due to the size of my corpus, I am unable to process the cosine similarity measure without getting the "Error: cannot allocate vector of size n" message, (26 GB in my case). I am already running R 64 bit on a server with lots of memory. I've also tried using the AWS on the server with the most memory, (244 GB), but to no avail, (same error). Is there a

Spring Boot Service works locally but not remotely

人走茶凉 提交于 2020-01-06 19:53:18
问题 I'm trying to create a very simple RESTful Web Service with Spring Boot to perform NLP to the content passed as a parameter. You can find it on my GitHub. For some reason, I can't deploy it to my Tomcat container in my home server as a WAR (see here), therefore I decided at least to try to set it up as a runnable JAR. If I run it on my development machine by invoking: java -jar -Xss32M -Xmx8G -XX:+UseG1GC -XX:+UseStringDeduplication ClearWS-0.1.0.jar it works like a charm. If I point my

深度学习_05

筅森魡賤 提交于 2020-01-06 15:35:45
自然语言处理 NLP常见任务 自动摘要 - seq2seq 指代消解 - 小明放学了, 妈妈去接他, 这个他就是小明 机器翻译 - 统计机器语言的模型SMT 词性标注 - heat(v.) water(n.) 分词 (中文, 日文等) - 大水沟/很/难/过 主题识别 文本分类 NLP处理方法 通过参数去描绘这个分布 词编码需要保持词的相似性 语义的近似性 空间分布的相似性 空间向量的子结构 在计算机中表示一个词 wordnet组成一个字典 离散表示, One-Hot 表示 离散表示:Bag of Words 词权重 TF-IDF 词在文档中的重要程度log(1+N/n) Binary weighting 短文本相似性 语言模型 离散表示的问题 分布式表示 共现矩阵 用于主题模型 来源: https://www.cnblogs.com/jly1/p/12153293.html

Botpress native NLU languages supported

狂风中的少年 提交于 2020-01-06 05:51:05
问题 Is there a list of languages supported by Botpress Native NLP? Is Czech languages supported by default? Or only by using third party as Watson? 回答1: I assume you were using early Botpress 11. In those versions, botpress was using fasttext to train word vectors on the data you were providing with a basic whitespace tokenizer. Couple of pre-trained languages were offered but not Czech sadly. In the newer version of Botpress (12.x), NLU structure seems to have drastically changed and more

Chatbot that will answer from the given Information/Documents

家住魔仙堡 提交于 2020-01-06 05:35:07
问题 I want to make a chatbot that will answer the questions based on the given document. E.g, if I have hundreds of documents and I want to get some information from it but don't know which information is on which line of the page so I have to spend some time and effort to search. I want a chatbot that will learn from those documents and give answers form that documents. Is there any available service that can full fill my needs? What if I want to make a model by myself what tools/libraries do I

prolog function to infer new facts from data

断了今生、忘了曾经 提交于 2020-01-06 02:50:07
问题 I have a dataset containing "facts" recognizable to prolog, i.e.: 'be'('mr jiang', 'representative of china'). 'support'('the establishment of the sar', 'mr jiang'). 'be more than'('# distinguished guests', 'the principal representatives'). 'end with'('the playing of the british national anthem', 'hong kong'). 'follow at'('the stroke of midnight', 'this'). 'take part in'('the ceremony', 'both countries'). 'start at about'('# pm', 'the ceremony'). 'end about'('# am', 'the ceremony'). I want

Sorting FreqDist in NLTK with get vs get()

被刻印的时光 ゝ 提交于 2020-01-06 02:35:11
问题 I am playing around with NLTK and the module freqDist import nltk from nltk.corpus import gutenberg print(gutenberg.fileids()) from nltk import FreqDist fd = FreqDist() for word in gutenberg.words('austen-persuasion.txt'): fd[word] += 1 newfd = sorted(fd, key=fd.get, reverse=True)[:10] So I am playing around with NLTK and have a question regarding the sort portion. When I run the code like this it properly sorts the freqDist object. However when I run it with get() instead of get I encounter

Force Stanford CoreNLP Parser to Prioritize 'S' Label at Root Level

旧时模样 提交于 2020-01-06 01:32:23
问题 Greetings NLP Experts, I am using the Stanford CoreNLP software package to produce constituency parses, using the most recent version (3.9.2) of the English language models JAR, downloaded from the CoreNLP Download page. I access the parser via the Python interface from the NLTK module nltk.parse.corenlp. Here is a snippet from the top of my main module: import nltk from nltk.tree import ParentedTree from nltk.parse.corenlp import CoreNLPParser parser = CoreNLPParser(url='http://localhost

深度学习_05

一世执手 提交于 2020-01-05 19:17:18
自然语言处理 NLP常见任务 自动摘要 - seq2seq 指代消解 - 小明放学了, 妈妈去接他, 这个他就是小明 机器翻译 - 统计机器语言的模型SMT 词性标注 - heat(v.) water(n.) 分词 (中文, 日文等) - 大水沟/很/难/过 主题识别 文本分类 NLP处理方法 通过参数去描绘这个分布 词编码需要保持词的相似性 语义的近似性 空间分布的相似性 空间向量的子结构 在计算机中表示一个词 wordnet组成一个字典 离散表示, One-Hot 表示 离散表示:Bag of Words 词权重 TF-IDF 词在文档中的重要程度log(1+N/n) Binary weighting 短文本相似性 语言模型 离散表示的问题 分布式表示 共现矩阵 用于主题模型 来源: https://www.cnblogs.com/jly1/p/12153293.html