问题
Anyone have dataset download link for text summarization like DUC 2007 or TREC? Please, help me.
回答1:
You can use http://archive.ics.uci.edu/ml/datasets/Legal+Case+Reports for extraction based text summarization approach. It contains catchPhrase, which can be act as selected sentence for training. But catchphrase may not be as much appropriate.
回答2:
You can access DUC dataset after completing some organization and individual agreements ..kindly refer http://www-nlpir.nist.gov/projects/duc/data.html for more information
回答3:
You can write a sitemap crawler in scrapy for
- buzzfeed
- huffingtonpost
- deadspin
- gizmodo
That may give you around 1.45 million abstract and articles.
Also you can check this harvardnlp sent summary dataset and CNN Dailymail dataset, which can give some articles story.
Warning: As all these are different sources, their way of writing may differ.
回答4:
You could try to use "BBC News Summary" dataset from Kaggle: link
Inside you will find two folders: with original articles and with their summaries. There are 5 categories of news: business, entertainment, politics, sport, tech. It's around 500 article-summary couples for each of those topics.
来源:https://stackoverflow.com/questions/14959104/dataset-link-for-text-summarization