问题
I am very new in Python. I do have a file with a list of words. They contain Danish letters (ÆØÅ) but the re.compile do not understand theses characters. The function split the words by each ÆØÅ. The text are downloade from Twitter and Facebook and do not always contain only letters.
text = "Rød grød med fløde.... !! :)"
pattern_split = re.compile(r"\W+")
words = pattern_split.split(text.lower())
words = ['r', 'd', 'gr', 'd', 'med', 'fl', 'de']
The right result should be
words = ['rød', 'grød', 'med', 'fløde']
How do I get the right result?
Full code
#!/usr/bin/python
# -*- coding: utf-8 -*-
import math, re, sys, os
reload(sys)
sys.setdefaultencoding('utf-8')
# AFINN-111 is as of June 2011 the most recent version of AFINN
#filenameAFINN = 'AFINN/AFINN-111.txt'
# Get location of file
__location__ = os.path.realpath(
os.path.join(os.getcwd(), os.path.dirname(__file__)))
filenameAFINN = __location__ + '/AFINN/AFINN-111DK.txt'
afinn = dict(map(lambda (w, s): (w, int(s)), [
ws.strip().split('\t') for ws in open(filenameAFINN) ]))
# Word splitter pattern
pattern_split = re.compile(r"\W+")
#pattern_split = re.compile('[ .,:();!?]+')
def sentiment(text):
print(text)
words = pattern_split.split(text.lower().strip())
print(words)
sentiments = map(lambda word: afinn.get(word, 0), words)
if sentiments:
sentiment = float(sum(sentiments))/math.sqrt(len(sentiments))
else:
sentiment = 0
return sentiment
# Print result
text = "ånd ånd med fløde... :)asd "
id = 999
split = "###"
print("%6.2f%s%s%s%s" % (sentiment(text), split, id, split, text))
回答1:
Reworking your script to use best practices:
import csv
import math
import os
import re
LOCATION = os.path.dirname(os.path.abspath(__file__))
afinn_filename = os.path.join(LOCATION, '/AFINN/AFINN-111DK.txt')
pattern_split = re.compile(r"\W+")
with open(afinn_filename, encoding='utf8', newline='') as infile:
reader = csv.reader(infile, delimiter='\t')
afinn = {key: int(score) for key, score in reader}
def sentiment(text):
words = pattern_split.split(text.lower().strip())
if not words:
return 0
sentiments = [afinn.get(word, 0) for word in words]
return sum(sentiments) / math.sqrt(len(sentiments))
# Print result
text = "ånd ånd med fløde... :)asd "
id = 999
split = "###"
print('{sentiment:6.2f}{split}{id}{split}{text}'.format(
sentiment=sentiment(text), id=id, split=split, text=text))
Running this with Python 3 means that text
is a Unicode object and that the regular expression is interpreted with the re.UNICODE
set.
In Python 2, you'd use:
text = u"ånd ånd med fløde... :)asd "
(note the leading u
prefix on the string) and
pattern_split = re.compile(ur"\W+", re.UNICODE)
Your AFINN file would be read as CSV still, but decoding the key
from UTF8 after the fact, with:
with open(afinn_filename, 'rb') as infile:
reader = csv.reader(infile, delimiter='\t')
afinn = {key.decode('utf8'): int(score) for key, score in reader}
回答2:
I like to point to my afinn
Python package which should work with international character sets, including the Danish one, and (some versions of) Python 2 and 3. There is an English and Danish word list. I will probably solve your problem.
Here Python 2.7 or Python 3.4:
>>> from afinn import Afinn
>>> afinn = Afinn(language='da', emoticons=True)
>>> afinn.score(u"ånd ånd med fløde... :)asd ")
4.0
>>> afinn.score('Hvis ikke det er det mest afskyelige flueknepperi...')
-6.0
You can get the library here:
https://github.com/fnielsen/afinn
or at the Python Package Index for pip install afinn
来源:https://stackoverflow.com/questions/16549161/python-re-compile-and-split-with-%c3%86%c3%98%c3%85-charcters