python re.compile and split with ÆØÅ charcters

五迷三道 提交于 2020-01-04 05:21:10

问题


I am very new in Python. I do have a file with a list of words. They contain Danish letters (ÆØÅ) but the re.compile do not understand theses characters. The function split the words by each ÆØÅ. The text are downloade from Twitter and Facebook and do not always contain only letters.

text = "Rød grød med fløde.... !! :)"
pattern_split = re.compile(r"\W+")
words = pattern_split.split(text.lower())
words = ['r', 'd', 'gr', 'd', 'med', 'fl', 'de']

The right result should be

    words = ['rød', 'grød', 'med', 'fløde']

How do I get the right result?

Full code

#!/usr/bin/python 
# -*- coding: utf-8 -*-

import math, re, sys, os
reload(sys)
sys.setdefaultencoding('utf-8')

# AFINN-111 is as of June 2011 the most recent version of AFINN
#filenameAFINN = 'AFINN/AFINN-111.txt'

# Get location of file
__location__ = os.path.realpath(
    os.path.join(os.getcwd(), os.path.dirname(__file__)))


filenameAFINN = __location__ + '/AFINN/AFINN-111DK.txt'
afinn = dict(map(lambda (w, s): (w, int(s)), [ 
            ws.strip().split('\t') for ws in open(filenameAFINN) ]))

# Word splitter pattern
pattern_split = re.compile(r"\W+")
#pattern_split = re.compile('[ .,:();!?]+')

def sentiment(text):
    print(text)
    words = pattern_split.split(text.lower().strip())
    print(words)
    sentiments = map(lambda word: afinn.get(word, 0), words)
    if sentiments:
        sentiment = float(sum(sentiments))/math.sqrt(len(sentiments))

    else:
        sentiment = 0
    return sentiment


# Print result
text = "ånd ånd med fløde... :)asd "
id = 999
split = "###"
print("%6.2f%s%s%s%s" % (sentiment(text), split, id, split, text))

回答1:


Reworking your script to use best practices:

import csv
import math
import os
import re

LOCATION = os.path.dirname(os.path.abspath(__file__))
afinn_filename = os.path.join(LOCATION, '/AFINN/AFINN-111DK.txt')

pattern_split = re.compile(r"\W+")

with open(afinn_filename, encoding='utf8', newline='') as infile:
    reader = csv.reader(infile, delimiter='\t')
    afinn = {key: int(score) for key, score in reader}


def sentiment(text):
    words = pattern_split.split(text.lower().strip())
    if not words:
        return 0
    sentiments = [afinn.get(word, 0) for word in words]
    return sum(sentiments) / math.sqrt(len(sentiments))


# Print result
text = "ånd ånd med fløde... :)asd "
id = 999
split = "###"
print('{sentiment:6.2f}{split}{id}{split}{text}'.format(
    sentiment=sentiment(text), id=id, split=split, text=text))

Running this with Python 3 means that text is a Unicode object and that the regular expression is interpreted with the re.UNICODE set.

In Python 2, you'd use:

text = u"ånd ånd med fløde... :)asd "

(note the leading u prefix on the string) and

pattern_split = re.compile(ur"\W+", re.UNICODE)

Your AFINN file would be read as CSV still, but decoding the key from UTF8 after the fact, with:

with open(afinn_filename, 'rb') as infile:
    reader = csv.reader(infile, delimiter='\t')
    afinn = {key.decode('utf8'): int(score) for key, score in reader}



回答2:


I like to point to my afinn Python package which should work with international character sets, including the Danish one, and (some versions of) Python 2 and 3. There is an English and Danish word list. I will probably solve your problem.

Here Python 2.7 or Python 3.4:

>>> from afinn import Afinn
>>> afinn = Afinn(language='da', emoticons=True)
>>> afinn.score(u"ånd ånd med fløde... :)asd ")
4.0
>>> afinn.score('Hvis ikke det er det mest afskyelige flueknepperi...')
-6.0

You can get the library here:

https://github.com/fnielsen/afinn

or at the Python Package Index for pip install afinn



来源:https://stackoverflow.com/questions/16549161/python-re-compile-and-split-with-%c3%86%c3%98%c3%85-charcters

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!