Urllib2 & BeautifulSoup : Nice couple but too slow - urllib3 & threads?

前端 未结 3 577
醉酒成梦
醉酒成梦 2021-01-31 06:38

I was looking to find a way to optimize my code when I heard some good things about threads and urllib3. Apparently, people disagree which solution is the best.

The pro

3条回答
  •  天命终不由人
    2021-01-31 07:27

    I don't think urllib or BeautifulSoup is slow. I run your code in my local machine with a modified version ( removed the excel stuff ). It took around 100ms to open the connection, download the content, parse it , and print it to the console for a country.

    10ms is the total amount of time that BeautifulSoup spent to parse the content, and print to the console per country. That is fast enough.

    Neither I do believe using Scrappy or Threading is going to solve the problem. Because the problem is the expectation that it is going to be fast.

    Welcome to the world of HTTP. It is going to be slow sometimes, sometimes it will be very fast. Couple of slow connection reasons

    • because of the server handling your request( return 404 sometimes )
    • DNS resolve ,
    • HTTP handshake,
    • your ISP's connection stability,
    • your bandwidth rate,
    • packet loss rate

    etc..

    Don't forget, you are trying to make 121 HTTP Requests to a server consequently and you don't know what kind of servers do they have. They might also ban your IP address because of consequent calls.

    Take a look at Requests lib. Read their documentation. If you're doing this to learn Python more, don't jump into Scrapy directly.

提交回复
热议问题