Python urllib2.urlopen() is slow, need a better way to read several urls

前端未结

关注

 9  2260

As the title suggests, I\'m working on a site written in python and it makes several calls to the urllib2 module to read websites. I then parse them with BeautifulSoup.

相关标签:

9条回答

旧时难觅i

2020-11-28 04:54
Edit: Please take a look at Wai's post for a better version of this code. Note that there is nothing wrong with this code and it will work properly, despite the comments below.

The speed of reading web pages is probably bounded by your Internet connection, not Python.

You could use threads to load them all at once.
```
import thread, time, urllib
websites = {}
def read_url(url):
  websites[url] = urllib.open(url).read()

for url in urls_to_load: thread.start_new_thread(read_url, (url,))
while websites.keys() != urls_to_load: time.sleep(0.1)

# Now websites will contain the contents of all the web pages in urls_to_load
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
甜味超标

2020-11-28 04:54
How about using pycurl?

You can apt-get it by
```
$ sudo apt-get python-pycurl
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

梦如初夏

2020-11-28 05:03

It is maby not perfect. But when I need the data from a site. I just do this:

import socket
def geturldata(url):
    #NO HTTP URLS PLEASE!!!!! 
    server = url.split("/")[0]
    args = url.replace(server,"")
    returndata = str()
    s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    s.connect((server, 80)) #lets connect :p

    s.send("GET /%s HTTP/1.0\r\nHost: %s\r\n\r\n" % (args, server)) #simple http request
    while 1:
        data = s.recv(1024) #buffer
        if not data: break
        returndata = returndata + data
    s.close()
    return returndata.split("\n\r")[1]

0 讨论(0)

生来不讨喜

2020-11-28 05:06

As a general rule, a given construct in any language is not slow until it is measured.

In Python, not only do timings often run counter to intuition but the tools for measuring execution time are exceptionally good.

0 讨论(0)
发布评论:

提交评论
- 加载中...
萌比男神i

2020-11-28 05:15

Scrapy might be useful for you. If you don't need all of its functionality, you might just use twisted's twisted.web.client.getPage instead. Asynchronous IO in one thread is going to be way more performant and easy to debug than anything that uses multiple threads and blocking IO.

0 讨论(0)
发布评论:

提交评论
- 加载中...

感情败类

2020-11-28 05:17

First, you should try multithreading/multiprocessing packages. Currently, the three popular ones are multiprocessing;concurrent.futures and [threading][3]. Those packages could help you to open multi url at the same time, which could increase the speed.

More importantly, after using multithread processing, and if you try to open hundreds urls at the same time, you will find urllib.request.urlopen is very slow, and opening and read the context become the most time-consuming part. So if you want to make it even faster, you should try requests packages, requests.get(url).content() is faster than urllib.request.urlopen(url).read().

So, here I list two example to do fast multi url parsing, and the speed is faster than the other answers. The first example use classical threading package and generate hundreds thread at the same time. (One trivial shortcoming is it cannot keep the original order of the ticker.)

import time
import threading
import pandas as pd
import requests
from bs4 import BeautifulSoup


ticker = pd.ExcelFile('short_tickerlist.xlsx')
ticker_df = ticker.parse(str(ticker.sheet_names[0]))
ticker_list = list(ticker_df['Ticker'])

start = time.time()

result = []
def fetch(ticker):
    url = ('http://finance.yahoo.com/quote/' + ticker)
    print('Visit ' + url)
    text = requests.get(url).content
    soup = BeautifulSoup(text,'lxml')
    result.append([ticker,soup])
    print(url +' fetching...... ' + str(time.time()-start))



if __name__ == '__main__':
    process = [None] * len(ticker_list)
    for i in range(len(ticker_list)):
        process[i] = threading.Thread(target=fetch, args=[ticker_list[i]])

    for i in range(len(ticker_list)):    
        print('Start_' + str(i))
        process[i].start()



    # for i in range(len(ticker_list)):
    #     print('Join_' + str(i))    
    #     process[i].join()

    print("Elapsed Time: %ss" % (time.time() - start))

The second example uses multiprocessing package, and it is little more straightforward. Since you just need to state the number of pool and map the function. The order will not change after fetching the context and the speed is similar to the first example but much faster than other method.

from multiprocessing import Pool
import requests
from bs4 import BeautifulSoup
import pandas as pd
import os
import time

os.chdir('file_path')

start = time.time()

def fetch_url(x):
    print('Getting Data')
    myurl = ("http://finance.yahoo.com/q/cp?s=%s" % x)
    html = requests.get(myurl).content
    soup = BeautifulSoup(html,'lxml')
    out = str(soup)
    listOut = [x, out]
    return listOut

tickDF = pd.read_excel('short_tickerlist.xlsx')
li = tickDF['Ticker'].tolist()    

if __name__ == '__main__':
    p = Pool(5)
    output = p.map(fetch_url, ji, chunksize=30)
    print("Time is %ss" %(time.time()-start))

0 讨论(0)

1 2 下一页