POS tagging in German

后端 未结 5 429
予麋鹿
予麋鹿 2020-12-12 21:28

I am using NLTK to extract nouns from a text-string starting with the following command:

tagged_text = nltk.pos_tag(nltk.Text(nltk.word_tokenize(some_string)         


        
相关标签:
5条回答
  • 2020-12-12 22:05

    Natural language software does its magic by leveraging corpora and the statistics they provide. You'll need to tell nltk about some German corpus to help it tokenize German correctly. I believe the EUROPARL corpus might help get you going.

    See nltk.corpus.europarl_raw and this answer for example configuration.

    Also, consider tagging this question with "nlp".

    0 讨论(0)
  • 2020-12-12 22:10

    Part-of-Speech (POS) tagging is very specific to a particular [natural] language. NLTK includes many different taggers, which use distinct techniques to infer the tag of a given token in a given token. Most (but not all) of these taggers use a statistical model of sorts as the main or sole device to "do the trick". Such taggers require some "training data" upon which to build this statistical representation of the language, and the training data comes in the form of corpora.

    The NTLK "distribution" itself includes many of these corpora, as well a set of "corpora readers" which provide an API to read different types of corpora. I don't know the state of affairs in NTLK proper, and if this includes any german corpus. You can however locate free some free corpora which you'll then need to convert to a format that satisfies the proper NTLK corpora reader, and then you can use this to train a POS tagger for the German language.

    You can even create your own corpus, but that is a hell of a painstaking job; if you work in a univeristy, you gotta find ways of bribing and otherwise coercing students to do that for you ;-)

    0 讨论(0)
  • 2020-12-12 22:16

    The Pattern library includes a function for parsing German sentences and the result includes the part-of-speech tags. The following is copied from their documentation:

    from pattern.de import parse, split
    s = parse('Die Katze liegt auf der Matte.')
    s = split(s)
    print s.sentences[0]
    
    >>>   Sentence('Die/DT/B-NP/O Katze/NN/I-NP/O liegt/VB/B-VP/O'
         'auf/IN/B-PP/B-PNP der/DT/B-NP/I-PNP Matte/NN/I-NP/I-PNP ././O/O')
    

    If you prefer the SSTS tag set you can set the optional parameter tagset="STTS".

    Update: Another option is spacy, there is a quick example in this blog article:

    import spacy
    
    nlp = spacy.load('de')
    doc = nlp(u'Ich bin ein Berliner.')
    
    # show universal pos tags
    print(' '.join('{word}/{tag}'.format(word=t.orth_, tag=t.pos_) for t in doc))
    # output: Ich/PRON bin/AUX ein/DET Berliner/NOUN ./PUNCT
    
    0 讨论(0)
  • Possibly you can use the Stanford POS tagger. Below is a recipe I wrote. There are python recipes for German NLP that I've compiled and you can access them on http://htmlpreview.github.io/?https://github.com/alvations/DLTK/blob/master/docs/index.html

    #-*- coding: utf8 -*-
    
    import os, glob, codecs
    
    def installStanfordTag():
        if not os.path.exists('stanford-postagger-full-2013-06-20'):
            os.system('wget http://nlp.stanford.edu/software/stanford-postagger-full-2013-06-20.zip')
            os.system('unzip stanford-postagger-full-2013-06-20.zip')
        return
    
    def tag(infile):
        cmd = "./stanford-postagger.sh "+models[m]+" "+infile
        tagout = os.popen(cmd).readlines()
        return [i.strip() for i in tagout]
    
    def taglinebyline(sents):
        tagged = []
        for ss in sents:
            os.popen("echo '''"+ss+"''' > stanfordtemp.txt")
            tagged.append(tag('stanfordtemp.txt')[0])
        return tagged
    
    installStanfordTag()
    stagdir = './stanford-postagger-full-2013-06-20/'
    models = {'fast':'models/german-fast.tagger',
              'dewac':'models/german-dewac.tagger',
              'hgc':'models/german-hgc.tagger'}
    os.chdir(stagdir)
    print os.getcwd()
    
    
    m = 'fast' # It's best to use the fast german tagger if your data is small.
    
    sentences = ['Ich bin schwanger .','Ich bin wieder schwanger .','Ich verstehe nur Bahnhof .']
    
    tagged_sents = taglinebyline(sentences) # Call the stanford tagger
    
    for sent in tagged_sents:
        print sent
    
    0 讨论(0)
  • 2020-12-12 22:26

    I have written a blog-post about how to convert the German annotated TIGER Corpus in order to use it with the NLTK. Have a look at it here.

    0 讨论(0)
提交回复
热议问题