nlp

Python Arabic NLP

有些话、适合烂在心里 提交于 2019-12-18 10:14:05
问题 I'm in the process of assessing the capabilities of the NLTK in processing Arabic text in a research to analyze and extract sentiments. Question is as follows: Is the NTLK capable of handling and allows the analysis of Arabic text? Is python capable of manipulating\tokenizing Arabic text? Will I be able to parse and store Arabic text using Python? If python and NTLK aren't the tools for this job, what tools would you recommend (if existent)? Thank you. EDIT Based on research: NTLK is only

What are the major differences and benefits of Porter and Lancaster Stemming algorithms? [closed]

非 Y 不嫁゛ 提交于 2019-12-18 10:04:29
问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 3 years ago . I'm Working on document classification tasks in java. Both algorithms came highly recommended, what are the benefits and disadvantages of each and which is more commonly used in the literature for Natural Language Processing tasks? 回答1: At the very basics of it, the major

word2vec: negative sampling (in layman term)?

﹥>﹥吖頭↗ 提交于 2019-12-18 09:54:36
问题 I'm reading the paper below and I have some trouble , understanding the concept of negative sampling. http://arxiv.org/pdf/1402.3722v1.pdf Can anyone help , please? 回答1: The idea of word2vec is to maximise the similarity (dot product) between the vectors for words which appear close together (in the context of each other) in text, and minimise the similarity of words that do not. In equation (3) of the paper you link to, ignore the exponentiation for a moment. You have v_c * v_w -------------

Alter text in pandas column based on names

南笙酒味 提交于 2019-12-18 09:51:22
问题 Background I have the following sample df import pandas as pd df = pd.DataFrame({'Text' : ['Jon J Mmith is Here from **BLOCK** until **BLOCK**', 'No P_Name Found here', 'Jane Ann Doe is Also here until **BLOCK** ', '**BLOCK** was **BLOCK** Tom Tcker is Not here but **BLOCK** '], 'P_ID': [1,2,3,4], 'P_Name' : ['Mmith, Jon J', 'Hder, Mary', 'Doe, Jane Ann', 'Tcker, Tom'], 'N_ID' : ['A1', 'A2', 'A3', 'A4'] }) #rearrange columns df = df[['Text','N_ID', 'P_ID', 'P_Name']] df Text N_ID P_ID P_Name

match POS tag and sequence of words

大兔子大兔子 提交于 2019-12-18 09:38:18
问题 I have the following two strings with their POS tags: Sent1 : " something like how writer pro or phraseology works would be really cool. " [('something', 'NN'), ('like', 'IN'), ('how', 'WRB'), ('writer', 'NN'), ('pro', 'NN'), ('or', 'CC'), ('phraseology', 'NN'), ('works', 'NNS'), ('would', 'MD'), ('be', 'VB'), ('really', 'RB'), ('cool', 'JJ'), ('.', '.')] Sent2 : " more options like the syntax editor would be nice " [('more', 'JJR'), ('options', 'NNS'), ('like', 'IN'), ('the', 'DT'), ('syntax

Parse sentence Stanford Parser by passing String not an array of strings

梦想的初衷 提交于 2019-12-18 09:29:21
问题 Is it possible to parse a sentence using the Stanford Parser by passing a string and not an array of strings. This is the example they gave in their short tutorial (See Docs) : Here's example: import java.util.*; import edu.stanford.nlp.ling.*; import edu.stanford.nlp.trees.*; import edu.stanford.nlp.parser.lexparser.LexicalizedParser; class ParserDemo { public static void main(String[] args) { LexicalizedParser lp = LexicalizedParser.loadModel("edu/stanford/nlp/models/lexparser/englishPCFG

stemDocment in tm package not working on past tense word

别来无恙 提交于 2019-12-18 09:13:45
问题 I have a file 'check_text.txt' that contains " said say says make made ". I'd like to perform stemming on it to get "say say say make make". I tried to use stemDocument in tm package, as the following, but only get "said say say make made". Is there a way to perform stemming on past tense words? Is it necessary to do so in real-world natural language processing? Thanks! filename = 'check_text.txt' con <- file(filename, "rb") text_data <- readLines(con,skipNul = TRUE) close(con) text_VS <-

Weka ignoring unlabeled data

谁说胖子不能爱 提交于 2019-12-18 08:55:45
问题 I am working on an NLP classification project using Naive Bayes classifier in Weka. I intend to use semi-supervised machine learning, hence working with unlabeled data. When I test the model obtained from my labeled training data on an independent set of unlabeled test data, Weka ignores all the unlabeled instances. Can anybody please guide me how to solve this? Someone has already asked this question here before but there wasn't any appropriate solution provided. Here is a sample test file:

OpenNLP: foreign names does not get recognized

霸气de小男生 提交于 2019-12-18 08:27:31
问题 I just started using openNLP to recognize names. I am using the model (en-ner-person.bin) that comes with open NLP. I noticed that while it recognizes us, uk, and european names, it fails to recognize Indian or Japanese names. My questions are (1) is there already models available that I can use to recognize foreign names (2) If not, then I believe I will need to generate new models. In that case, is there a copora available that I can use? 回答1: You can make your own model with your data

NLP中文句子类型判别和分类实现

依然范特西╮ 提交于 2019-12-18 08:21:15
目录 一、中文句子类型主要类别 1、陈述句(statement) 2、特殊句(special) 3、疑问句(question) 二、中文句子类型简单分析 三、将句法分析与正则结合标注句子类型 四、句子类型调研及规则总结 五、中文句子类型分类工具sentypes实现 一、中文句子类型主要类别 1、陈述句(statement) 主语为首(subject_front),例:大家对这件事都很热心 主题为首(theme_front),例:红绿灯,真好玩 复合句(complex),例:他们飞的好高好远,穿过白云,越过海洋 2、特殊句(special) 把字句(ba_struct),例:阳光把冷冷的冬天赶走了 被字句(bei_struct),例:衣服被雨淋湿了 存在句(exist),例:门口有两头狮子 感叹句(sigh),例:真谢谢你! 祈使句(Imperative),例:小心! 连字句(lian_struct),例:我不但眼睛不舒服,好像连耳朵也优点疼 是字句(shi_struct),例:我的爸爸是老师 比较句(compare),例:我的力气比你大 3、疑问句(question) 疑问词问句(特指问句)(question_words),例:你什么时候回来 是非问句(whether),例:你今天会准时下课吗 选择问句(choice),例:他是坐火车来的,还是坐汽车来的 正反问句(pos_and