Removing punctuation using spaCy; AttribueError

。_饼干妹妹 提交于 2021-02-19 03:00:24

问题


Currently I'm using the following code to lemmatize and calculate TF-IDF values for some text data using spaCy:

lemma = []

for doc in nlp.pipe(df['col'].astype('unicode').values, batch_size=9844,
                        n_threads=3):
    if doc.is_parsed:
        lemma.append([n.lemma_ for n in doc if not n.lemma_.is_punct | n.lemma_ != "-PRON-"])
    else:
        lemma.append(None)

df['lemma_col'] = lemma

vect = sklearn.feature_extraction.text.TfidfVectorizer()
lemmas = df['lemma_col'].apply(lambda x: ' '.join(x))
vect = sklearn.feature_extraction.text.TfidfVectorizer()
features = vect.fit_transform(lemmas)

feature_names = vect.get_feature_names()
dense = features.todense()
denselist = dense.tolist()

df = pd.DataFrame(denselist, columns=feature_names)
df = pd.DataFrame(denselist, columns=feature_names)
lemmas = pd.concat([lemmas, df])
df= pd.concat([df, lemmas])

I need to strip out proper nouns, punctuation, and stop words but am having some trouble doing that within my current code. I've read some documentation and other resources, but am now running into an error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-21-e924639f7822> in <module>()
      7     if doc.is_parsed:
      8         tokens.append([n.text for n in doc])
----> 9         lemma.append([n.lemma_ for n in doc if not n.lemma_.is_punct or n.lemma_ != "-PRON-"])
     10         pos.append([n.pos_ for n in doc])
     11     else:

<ipython-input-21-e924639f7822> in <listcomp>(.0)
      7     if doc.is_parsed:
      8         tokens.append([n.text for n in doc])
----> 9         lemma.append([n.lemma_ for n in doc if not n.lemma_.is_punct or n.lemma_ != "-PRON-"])
     10         pos.append([n.pos_ for n in doc])
     11     else:

AttributeError: 'str' object has no attribute 'is_punct'

Is there an easier way to strip this stuff out of the text, without having to drastically change my approach?

Full code available here.


回答1:


From what I can see, your main problem here is actually quite simple: n.lemma_ returns a string, not a Token object. So it doesn't have an is_punct attribute. I think what you were looking for here is n.is_punct (whether the token is punctuation).

If you want to do this more elegantly, check out spaCy's new custom processing pipeline components (requires v2.0+). This lets you wrap your logic in a function which is run automatically when you call nlp() on your text. You could even take this one step further, and add a custom attribute to your Doc – for example, doc._.my_stripped_doc or doc._.pd_columns or something. The advantage here is that you can keep using spaCy's performant, built-in data structures like the Doc (and its views Token and Span) as the "single source of truth" of your application. This way, no information is lost and you'll always keep a reference to the original document – which is also very useful for debugging.



来源:https://stackoverflow.com/questions/47144311/removing-punctuation-using-spacy-attribueerror

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!