Python: Newspaper Module - Any way to pool getting articles straight from URLs?

后端 未结 4 1076
广开言路
广开言路 2021-01-01 04:06

I\'m using the Newspaper module for python found here.

In the tutorials, it describes how you can pool the building of different newspapers s.t. it generates them at

4条回答
  •  不思量自难忘°
    2021-01-01 04:28

    I know this question is really old but it's one of the first links that shows up when I googled how to get multithread newspaper. While Kyles answer is very helpful, it is not complete and I think it has some typos...

    import newspaper
    
    urls = [
    'http://www.baltimorenews.net/index.php/sid/234363921',
    'http://www.baltimorenews.net/index.php/sid/234323971',
    'http://www.atlantanews.net/index.php/sid/234323891',
    'http://www.wpbf.com/news/funeral-held-for-gabby-desouza/33874572',  
    ]
    
    class SingleSource(newspaper.Source):
    def __init__(self, articleURL):
        super(SingleSource, self).__init__("http://localhost")
        self.articles = [newspaper.Article(url=articleURL)]
    
    sources = [SingleSource(articleURL=u) for u in urls]
    
    newspaper.news_pool.set(sources)
    newspaper.news_pool.join()
    

    I changed the Stubsource to Singlesource and one of the urls to articleURL. Of course this just downloads the webpages, you still need to parse them to be able to get the text.

    multi=[]
    i=0
    for s in sources:
        i+=1
        try:
            (s.articles[0]).parse()
            txt = (s.articles[0]).text
            multi.append(txt)
        except:
            pass
    

    In my sample of 100 urls, this took half the time compared to just working with each url in sequence. (Edit: After increasing the sample size to 2000 there is a reduction of about a quarter.)

    (Edit: Got the whole thing working with multithreading!) I used this very good explanation for my implementation. With a sample size of 100 urls, using 4 threads takes comparable time to the code above but increasing the thread count to 10 gives a further reduction of about a half. A larger sample size needs more threads to give a comparable difference.

    import newspaper
    from multiprocessing.dummy import Pool as ThreadPool
    
    def getTxt(url):
        article = Article(url)
        article.download()
        try:
            article.parse()
            txt=article.text
            return txt
        except:
            return ""
    
    pool = ThreadPool(10)
    
    # open the urls in their own threads
    # and return the results
    results = pool.map(getTxt, urls)
    
    # close the pool and wait for the work to finish 
    pool.close() 
    pool.join()
    

提交回复
热议问题