pos-tagger | 易学教程

match POS tag and sequence of words

阅读更多关于 match POS tag and sequence of words

问题 I have the following two strings with their POS tags: Sent1 : " something like how writer pro or phraseology works would be really cool. " [('something', 'NN'), ('like', 'IN'), ('how', 'WRB'), ('writer', 'NN'), ('pro', 'NN'), ('or', 'CC'), ('phraseology', 'NN'), ('works', 'NNS'), ('would', 'MD'), ('be', 'VB'), ('really', 'RB'), ('cool', 'JJ'), ('.', '.')] Sent2 : " more options like the syntax editor would be nice " [('more', 'JJR'), ('options', 'NNS'), ('like', 'IN'), ('the', 'DT'), ('syntax

Train spaCy's existing POS tagger with my own training examples

阅读更多关于 Train spaCy's existing POS tagger with my own training examples

问题 I am trying to train the existing POS tagger on my own lexicon, not starting off from scratch (I do not want to create an "empty model"). In spaCy's documentation, it says "Load the model you want to stat with", and the next step is "Add the tag map to the tagger using add_label method". However, when I try to load the English small model, and add the tag map, it throws this error: ValueError: [T003] Resizing pre-trained Tagger models is not currently supported. I was wondering how it can be

How do I do use non-integer string labels with SVM from scikit-learn? Python

阅读更多关于 How do I do use non-integer string labels with SVM from scikit-learn? Python

问题 Scikit-learn has fairly user-friendly python modules for machine learning. I am trying to train an SVM tagger for Natural Language Processing (NLP) where my labels and input data are words and annotation. E.g. Part-Of-Speech tagging, rather than using double/integer data as input tuples [[1,2], [2,0]] , my tuples will look like this [['word','NOUN'], ['young', 'adjective']] Can anyone give an example of how i can use the SVM with string tuples? the tutorial/documentation given here are for

Unknown symbol in nltk pos tagging for Arabic

阅读更多关于 Unknown symbol in nltk pos tagging for Arabic

问题 I have used nltk to tokenize some arabic text However, i ended up with some results like (u'an arabic character/word', '``') or (u'an arabic character/word', ':') However, they do not provide the `` or : in the documentation. hence i would like to find out what is this from nltk.toeknize.punkt import PunktWordTokenizer z = "أنا تسلق شجرة" tkn = PunkWordTokenizer sen = tkn.tokenize(z) tokens = nltk.pos_tag(sent) print tokens 回答1: The default NLTK POS tag is trained on English texts and is

extracting sentences from pos-tagged corpus with certain word, tag combos

阅读更多关于 extracting sentences from pos-tagged corpus with certain word, tag combos

问题 I'm playing with the brown corpus, specifically the tagged sentences in "news." I've found that "to" is the word with the most ambiguous word tags (TO, IN, TO-HL, IN-HL, IN-TL, NPS). I'm trying to write a code that will print one sentence from the corpus for each tag associated with "to". The sentences do not need to be "cleaned" of the tags, but just contain both "to" and one each of the associated pos-tags. brown_sents = nltk.corpus.brown.tagged_sents(categories="news") for sent in brown

How to build POS-tagged corpus with NLTK?

阅读更多关于 How to build POS-tagged corpus with NLTK?

问题 I try to build a POS-tagged corpus from external .txt files for chunking and entity and relation extraction. So far I have found a cumbersome multistep solution: Read files with into a plain text corpus: from nltk.corpus.reader import PlaintextCorpusReader my_corp = PlaintextCorpusReader(".", r".*\.txt") Tag corpus with built-in Penn POS-tagger: my_tagged_corp= nltk.batch_pos_tag(my_corp.sents()) (By the way, at this pont Python threw an error: NameError: name 'batch' is not defined ) Write

Stanford Core NLP how to get the probability & margin of error

阅读更多关于 Stanford Core NLP how to get the probability & margin of error

问题 When using the parser or for the matter any of the Annotation in Core NLP, is there a way to access the probability or the margin of error? To put my question into context, I am trying to understand if there is a way programmatically to detect a case of ambiguity. For instance in the sentence below the verb desire is detected as a noun. I would like to be able to know so kind of measure I can access or calculate from the Core NLP APi to tell me there could be an ambiguity. (NP (NP (NNP

Remove tags of POS tagger

阅读更多关于 Remove tags of POS tagger

问题 Is it possible to remove the tags from the sentences? One can accomplish it by scanning through the file and finding tags and removing them, but since there are many tags( some models have 30+, some have around 48-50, they basically follow the penn treebank pos tags ), is there a fast and sweet way to remove tags in a more efficient manner? I did check the API, but there was no such method for removal of tags. 回答1: There's nothing special built in for this, but since the output includes both

Running Stanford POS tagger in NLTK leads to “not a valid Win32 application” on Windows

阅读更多关于 Running Stanford POS tagger in NLTK leads to “not a valid Win32 application” on Windows

问题 I am trying to use stanford POS tagger in NLTK by the following code: import nltk from nltk.tag.stanford import POSTagger st = POSTagger('E:\Assistant\models\english-bidirectional-distsim.tagger', 'E:\Assistant\stanford-postagger.jar') st.tag('What is the airspeed of an unladen swallow?'.split()) and here is the output: Traceback (most recent call last): File "E:\J2EE\eclipse\WSNLP\nlp\src\tagger.py", line 5, in <module> st.tag('What is the airspeed of an unladen swallow?'.split()) File "C:

How to take the suffix in smoothing of Part of speech tagging

阅读更多关于 How to take the suffix in smoothing of Part of speech tagging

问题 I am making a "Part of speech Tagger". I am handling the unknown word with the suffix. But the main issue is that how would i decide the number of suffix... should it be pre-decided (like Weischedel approach) or I have to take the last few alphabets of the words(like Samuelsson approach). Which approach would be better...... 回答1: Quick googling suggests that the Weischedel approach is sufficient for English, which has only rudimentary morphological inflection. The Samuelsson approach seems to