nlp | 易学教程

Python Arabic NLP

阅读更多关于 Python Arabic NLP

问题 I'm in the process of assessing the capabilities of the NLTK in processing Arabic text in a research to analyze and extract sentiments. Question is as follows: Is the NTLK capable of handling and allows the analysis of Arabic text? Is python capable of manipulating\tokenizing Arabic text? Will I be able to parse and store Arabic text using Python? If python and NTLK aren't the tools for this job, what tools would you recommend (if existent)? Thank you. EDIT Based on research: NTLK is only

What are the major differences and benefits of Porter and Lancaster Stemming algorithms? [closed]

阅读更多关于 What are the major differences and benefits of Porter and Lancaster Stemming algorithms? [closed]

问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 3 years ago . I'm Working on document classification tasks in java. Both algorithms came highly recommended, what are the benefits and disadvantages of each and which is more commonly used in the literature for Natural Language Processing tasks? 回答1: At the very basics of it, the major

word2vec: negative sampling (in layman term)?

阅读更多关于 word2vec: negative sampling (in layman term)?

问题 I'm reading the paper below and I have some trouble , understanding the concept of negative sampling. http://arxiv.org/pdf/1402.3722v1.pdf Can anyone help , please? 回答1: The idea of word2vec is to maximise the similarity (dot product) between the vectors for words which appear close together (in the context of each other) in text, and minimise the similarity of words that do not. In equation (3) of the paper you link to, ignore the exponentiation for a moment. You have v_c * v_w -------------

Alter text in pandas column based on names

阅读更多关于 Alter text in pandas column based on names

问题 Background I have the following sample df import pandas as pd df = pd.DataFrame({'Text' : ['Jon J Mmith is Here from **BLOCK** until **BLOCK**', 'No P_Name Found here', 'Jane Ann Doe is Also here until **BLOCK** ', '**BLOCK** was **BLOCK** Tom Tcker is Not here but **BLOCK** '], 'P_ID': [1,2,3,4], 'P_Name' : ['Mmith, Jon J', 'Hder, Mary', 'Doe, Jane Ann', 'Tcker, Tom'], 'N_ID' : ['A1', 'A2', 'A3', 'A4'] }) #rearrange columns df = df[['Text','N_ID', 'P_ID', 'P_Name']] df Text N_ID P_ID P_Name

match POS tag and sequence of words

阅读更多关于 match POS tag and sequence of words

问题 I have the following two strings with their POS tags: Sent1 : " something like how writer pro or phraseology works would be really cool. " [('something', 'NN'), ('like', 'IN'), ('how', 'WRB'), ('writer', 'NN'), ('pro', 'NN'), ('or', 'CC'), ('phraseology', 'NN'), ('works', 'NNS'), ('would', 'MD'), ('be', 'VB'), ('really', 'RB'), ('cool', 'JJ'), ('.', '.')] Sent2 : " more options like the syntax editor would be nice " [('more', 'JJR'), ('options', 'NNS'), ('like', 'IN'), ('the', 'DT'), ('syntax

Parse sentence Stanford Parser by passing String not an array of strings

阅读更多关于 Parse sentence Stanford Parser by passing String not an array of strings

问题 Is it possible to parse a sentence using the Stanford Parser by passing a string and not an array of strings. This is the example they gave in their short tutorial (See Docs) : Here's example: import java.util.*; import edu.stanford.nlp.ling.*; import edu.stanford.nlp.trees.*; import edu.stanford.nlp.parser.lexparser.LexicalizedParser; class ParserDemo { public static void main(String[] args) { LexicalizedParser lp = LexicalizedParser.loadModel("edu/stanford/nlp/models/lexparser/englishPCFG

stemDocment in tm package not working on past tense word

阅读更多关于 stemDocment in tm package not working on past tense word

问题 I have a file 'check_text.txt' that contains " said say says make made ". I'd like to perform stemming on it to get "say say say make make". I tried to use stemDocument in tm package, as the following, but only get "said say say make made". Is there a way to perform stemming on past tense words? Is it necessary to do so in real-world natural language processing? Thanks! filename = 'check_text.txt' con <- file(filename, "rb") text_data <- readLines(con,skipNul = TRUE) close(con) text_VS <-

Weka ignoring unlabeled data

阅读更多关于 Weka ignoring unlabeled data

问题 I am working on an NLP classification project using Naive Bayes classifier in Weka. I intend to use semi-supervised machine learning, hence working with unlabeled data. When I test the model obtained from my labeled training data on an independent set of unlabeled test data, Weka ignores all the unlabeled instances. Can anybody please guide me how to solve this? Someone has already asked this question here before but there wasn't any appropriate solution provided. Here is a sample test file:

OpenNLP: foreign names does not get recognized

阅读更多关于 OpenNLP: foreign names does not get recognized

问题 I just started using openNLP to recognize names. I am using the model (en-ner-person.bin) that comes with open NLP. I noticed that while it recognizes us, uk, and european names, it fails to recognize Indian or Japanese names. My questions are (1) is there already models available that I can use to recognize foreign names (2) If not, then I believe I will need to generate new models. In that case, is there a copora available that I can use? 回答1: You can make your own model with your data

NLP中文句子类型判别和分类实现

阅读更多关于 NLP中文句子类型判别和分类实现

目录一、中文句子类型主要类别 1、陈述句（statement） 2、特殊句（special） 3、疑问句（question）二、中文句子类型简单分析三、将句法分析与正则结合标注句子类型四、句子类型调研及规则总结五、中文句子类型分类工具sentypes实现一、中文句子类型主要类别 1、陈述句（statement）主语为首（subject_front），例：大家对这件事都很热心主题为首（theme_front），例：红绿灯，真好玩复合句（complex），例：他们飞的好高好远，穿过白云，越过海洋 2、特殊句（special）把字句（ba_struct），例：阳光把冷冷的冬天赶走了被字句（bei_struct），例：衣服被雨淋湿了存在句（exist），例：门口有两头狮子感叹句（sigh），例：真谢谢你！祈使句（Imperative），例：小心！连字句（lian_struct），例：我不但眼睛不舒服，好像连耳朵也优点疼是字句（shi_struct），例：我的爸爸是老师比较句（compare），例：我的力气比你大 3、疑问句（question）疑问词问句（特指问句）（question_words），例：你什么时候回来是非问句（whether），例：你今天会准时下课吗选择问句（choice），例：他是坐火车来的，还是坐汽车来的正反问句（pos_and