Python urllib2.urlopen() is slow, need a better way to read several urls

前端 未结 9 2236
谎友^
谎友^ 2020-11-28 04:48

As the title suggests, I\'m working on a site written in python and it makes several calls to the urllib2 module to read websites. I then parse them with BeautifulSoup.

9条回答
  •  感情败类
    2020-11-28 05:17

    First, you should try multithreading/multiprocessing packages. Currently, the three popular ones are multiprocessing;concurrent.futures and [threading][3]. Those packages could help you to open multi url at the same time, which could increase the speed.

    More importantly, after using multithread processing, and if you try to open hundreds urls at the same time, you will find urllib.request.urlopen is very slow, and opening and read the context become the most time-consuming part. So if you want to make it even faster, you should try requests packages, requests.get(url).content() is faster than urllib.request.urlopen(url).read().

    So, here I list two example to do fast multi url parsing, and the speed is faster than the other answers. The first example use classical threading package and generate hundreds thread at the same time. (One trivial shortcoming is it cannot keep the original order of the ticker.)

    import time
    import threading
    import pandas as pd
    import requests
    from bs4 import BeautifulSoup
    
    
    ticker = pd.ExcelFile('short_tickerlist.xlsx')
    ticker_df = ticker.parse(str(ticker.sheet_names[0]))
    ticker_list = list(ticker_df['Ticker'])
    
    start = time.time()
    
    result = []
    def fetch(ticker):
        url = ('http://finance.yahoo.com/quote/' + ticker)
        print('Visit ' + url)
        text = requests.get(url).content
        soup = BeautifulSoup(text,'lxml')
        result.append([ticker,soup])
        print(url +' fetching...... ' + str(time.time()-start))
    
    
    
    if __name__ == '__main__':
        process = [None] * len(ticker_list)
        for i in range(len(ticker_list)):
            process[i] = threading.Thread(target=fetch, args=[ticker_list[i]])
    
        for i in range(len(ticker_list)):    
            print('Start_' + str(i))
            process[i].start()
    
    
    
        # for i in range(len(ticker_list)):
        #     print('Join_' + str(i))    
        #     process[i].join()
    
        print("Elapsed Time: %ss" % (time.time() - start))
    

    The second example uses multiprocessing package, and it is little more straightforward. Since you just need to state the number of pool and map the function. The order will not change after fetching the context and the speed is similar to the first example but much faster than other method.

    from multiprocessing import Pool
    import requests
    from bs4 import BeautifulSoup
    import pandas as pd
    import os
    import time
    
    os.chdir('file_path')
    
    start = time.time()
    
    def fetch_url(x):
        print('Getting Data')
        myurl = ("http://finance.yahoo.com/q/cp?s=%s" % x)
        html = requests.get(myurl).content
        soup = BeautifulSoup(html,'lxml')
        out = str(soup)
        listOut = [x, out]
        return listOut
    
    tickDF = pd.read_excel('short_tickerlist.xlsx')
    li = tickDF['Ticker'].tolist()    
    
    if __name__ == '__main__':
        p = Pool(5)
        output = p.map(fetch_url, ji, chunksize=30)
        print("Time is %ss" %(time.time()-start))
    

提交回复
热议问题