python re.compile and split with ÆØÅ charcters

问题

I am very new in Python. I do have a file with a list of words. They contain Danish letters (ÆØÅ) but the re.compile do not understand theses characters. The function split the words by each ÆØÅ. The text are downloade from Twitter and Facebook and do not always contain only letters.

text = "Rød grød med fløde.... !! :)"
pattern_split = re.compile(r"\W+")
words = pattern_split.split(text.lower())
words = ['r', 'd', 'gr', 'd', 'med', 'fl', 'de']

The right result should be

    words = ['rød', 'grød', 'med', 'fløde']

How do I get the right result?

Full code

#!/usr/bin/python 
# -*- coding: utf-8 -*-

import math, re, sys, os
reload(sys)
sys.setdefaultencoding('utf-8')

# AFINN-111 is as of June 2011 the most recent version of AFINN
#filenameAFINN = 'AFINN/AFINN-111.txt'

# Get location of file
__location__ = os.path.realpath(
    os.path.join(os.getcwd(), os.path.dirname(__file__)))


filenameAFINN = __location__ + '/AFINN/AFINN-111DK.txt'
afinn = dict(map(lambda (w, s): (w, int(s)), [ 
            ws.strip().split('\t') for ws in open(filenameAFINN) ]))

# Word splitter pattern
pattern_split = re.compile(r"\W+")
#pattern_split = re.compile('[ .,:();!?]+')

def sentiment(text):
    print(text)
    words = pattern_split.split(text.lower().strip())
    print(words)
    sentiments = map(lambda word: afinn.get(word, 0), words)
    if sentiments:
        sentiment = float(sum(sentiments))/math.sqrt(len(sentiments))

    else:
        sentiment = 0
    return sentiment


# Print result
text = "ånd ånd med fløde... :)asd "
id = 999
split = "###"
print("%6.2f%s%s%s%s" % (sentiment(text), split, id, split, text))

回答1:

Reworking your script to use best practices:

import csv
import math
import os
import re

LOCATION = os.path.dirname(os.path.abspath(__file__))
afinn_filename = os.path.join(LOCATION, '/AFINN/AFINN-111DK.txt')

pattern_split = re.compile(r"\W+")

with open(afinn_filename, encoding='utf8', newline='') as infile:
    reader = csv.reader(infile, delimiter='\t')
    afinn = {key: int(score) for key, score in reader}


def sentiment(text):
    words = pattern_split.split(text.lower().strip())
    if not words:
        return 0
    sentiments = [afinn.get(word, 0) for word in words]
    return sum(sentiments) / math.sqrt(len(sentiments))


# Print result
text = "ånd ånd med fløde... :)asd "
id = 999
split = "###"
print('{sentiment:6.2f}{split}{id}{split}{text}'.format(
    sentiment=sentiment(text), id=id, split=split, text=text))

Running this with Python 3 means that text is a Unicode object and that the regular expression is interpreted with the re.UNICODE set.

In Python 2, you'd use:

text = u"ånd ånd med fløde... :)asd "

(note the leading u prefix on the string) and

pattern_split = re.compile(ur"\W+", re.UNICODE)

Your AFINN file would be read as CSV still, but decoding the key from UTF8 after the fact, with:

with open(afinn_filename, 'rb') as infile:
    reader = csv.reader(infile, delimiter='\t')
    afinn = {key.decode('utf8'): int(score) for key, score in reader}

回答2:

I like to point to my afinn Python package which should work with international character sets, including the Danish one, and (some versions of) Python 2 and 3. There is an English and Danish word list. I will probably solve your problem.

Here Python 2.7 or Python 3.4:

>>> from afinn import Afinn
>>> afinn = Afinn(language='da', emoticons=True)
>>> afinn.score(u"ånd ånd med fløde... :)asd ")
4.0
>>> afinn.score('Hvis ikke det er det mest afskyelige flueknepperi...')
-6.0

You can get the library here:

https://github.com/fnielsen/afinn

or at the Python Package Index for pip install afinn

来源：https://stackoverflow.com/questions/16549161/python-re-compile-and-split-with-%c3%86%c3%98%c3%85-charcters

标签

python

regex

split

python-2.x