spaCy sentence segmentation failing on quotes

末鹿安然 提交于 2019-12-11 04:38:27

问题


I am parsing some news data with spaCy and am noticing a consistent failure regarding sentence segmentation where there is a quote. Has anyone else solved this issue?

Here is a reproducible example - note sentence 4 in the output below. spaCy fails to split at the start of the quote, and this is consistent through other news articles I'm working with.

Thanks a lot.

Example:

Raw data:

u'body': u'\n LONDON Nov 4 Britons hurt by lower incomes and rising food prices after the financial crisis have cut back on fruit and vegetables and turned instead to fatty, sugary, processed food, an academic study showed on Monday.Britain has seen food prices rise much more sharply than most other developed economies between 2005 and 2012, while wage growth has been low and unemployment has risen.The net effect has been that Britons are spending 8.5 percent less in real terms on food purchased at home than before the recession - with the trend even greater for pensioners and families with young children.The research is likely to be politically sensitive at a time when Britain\'s Conservative-led government is under pressure from the opposition Labour Party, over declining standards of living and sharply rising demand at food banks which hand out free food to the poorest Britons. People have economised by buying less food, measured in number of calories, but also on its quality, picking products that are less nutritious and higher in saturated fat and sugar."Various measures of nutritional quality declined over this period, with bigger decreases for pensioner households and households with young children," said the Institute for Fiscal Studies, an economics research body.OBESITY Families with children were prone to switching to more sugary food, while pensioners favoured food high in saturated fat, the study showed. Both groups often have lower incomes.While the economy is starting to show signs of growth after suffering the biggest hit to economic growth since records began during the 2008-09 recession, households\' disposable incomes are no higher than a decade ago. However, the IFS said a lower-quality diet was not an inevitable consequence of having less money, and that some households had been able to eat as healthily as before while spending less. More research was needed to see why this was not the case for other households, the researchers added.The study looked at data on more than 15,000 households\' shopping habits collected by market research company Kantar Worldpanel between 2005 and 2012.The figures do not include meals purchased or provided away from home, for example in restaurants or at schools, which in England provide free lunches for poorer pupils.The study was released alongside a piece of longer-term research from the IFS, which showed the English now consume 15-30 percent fewer calories than in 1980, despite higher obesity rates probably due to less physical activity.This contrasts with the United States, where calorie consumption has risen as well as obesity. The IFS said it was were researching further into trends in Britons\' physical activity over the period.',

Code to split:

from __future__ import unicode_literals
import spacy
nlp = spacy.load('en')
doc1 = nlp(article_to_json['body'].decode('utf-8'), parse=True)

for number, sent in enumerate(doc1.sents):
    print number, sent, "\n"

Output:

0 LONDON Nov 4 Britons hurt by lower incomes and rising food prices after the financial crisis have cut back on fruit and vegetables and turned instead to fatty, sugary, processed food, an academic study showed on Monday.

1 Britain has seen food prices rise much more sharply than most other developed economies between 2005 and 2012, while wage growth has been low and unemployment has risen.

2 The net effect has been that Britons are spending 8.5 percent less in real terms on food purchased at home than before the recession - with the trend even greater for pensioners and families with young children.

3 The research is likely to be politically sensitive at a time when Britain's Conservative-led government is under pressure from the opposition Labour Party, over declining standards of living and sharply rising demand at food banks which hand out free food to the poorest Britons.

4 People have economised by buying less food, measured in number of calories, but also on its quality, picking products that are less nutritious and higher in saturated fat and sugar."Various measures of nutritional quality declined over this period, with bigger decreases for pensioner households and households with young children," said the Institute for Fiscal Studies, an economics research body.

5 OBESITY Families with children were prone to switching to more sugary food, while pensioners favoured food high in saturated fat, the study showed.

6 Both groups often have lower incomes.

7 While the economy is starting to show signs of growth after suffering the biggest hit to economic growth since records began during the 2008-09 recession, households' disposable incomes are no higher than a decade ago.

8 However, the IFS said a lower-quality diet was not an inevitable consequence of having less money, and that some households had been able to eat as healthily as before while spending less.

9 More research was needed to see why this was not the case for other households, the researchers added.

10 The study looked at data on more than 15,000 households' shopping habits collected by market research company Kantar Worldpanel between 2005 and 2012.The figures do not include meals purchased or provided away from home, for example in restaurants or at schools, which in England provide free lunches for poorer pupils.

11 The study was released alongside a piece of longer-term research from the IFS, which showed the English now consume 15-30 percent fewer calories than in 1980, despite higher obesity rates probably due to less physical activity.

12 This contrasts with the United States, where calorie consumption has risen as well as obesity.

13 The IFS said it was were researching further into trends in Britons' physical activity over the period.


回答1:


I googled the original news article to try to figure out why your data looks like it does (missing whitespace between sentences where I wouldn't expect it in a formal news article), and it looks like the original problem is that no whitespace is inserted between HTML paragraphs. If you can fix that problem with how the article is extracted from the original HTML (insert whitespace when you run into <p> or </p>), you won't have this problem with spacy or other tools.

The models available in standard tools will often be trained on news data and it's reasonable to expect them to work well for data like this, but they expect whitespace between sentences. Unless you retrain the models with data including missing whitespace between sentences (or preprocess your data as suggested in a comment), you're going have these kinds of problems.



来源:https://stackoverflow.com/questions/44853107/spacy-sentence-segmentation-failing-on-quotes

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!