Multithreading to Scrape Yahoo Finance

纵然是瞬间 提交于 2021-02-08 11:49:44

问题


I'm running a program to pull some info from Yahoo! Finance. It runs fine as a For loop, however it takes a long time (about 10 minutes for 7,000 inputs) because it has to process each request.get(url) individually (or am I mistaken on the major bottlenecker?)

Anyway, I came across multithreading as a potential solution. This is what I have tried:

import requests
import pprint
import threading

with open('MFTop30MinusAFew.txt', 'r') as ins: #input file for tickers
    for line in ins:
        ticker_array = ins.read().splitlines()

ticker = ticker_array
url_array = []
url_data = []
data_array =[]

for i in ticker:
    url = 'https://query2.finance.yahoo.com/v10/finance/quoteSummary/'+i+'?formatted=true&crumb=8ldhetOu7RJ&lang=en-US&region=US&modules=defaultKeyStatistics%2CfinancialData%2CcalendarEvents&corsDomain=finance.yahoo.com'
    url_array.append(url) #loading each complete url at one time 

def fetch_data(url):
    urlHandler = requests.get(url)
    data = urlHandler.json()
    data_array.append(data)

pprint.pprint(data_array)

threads = [threading.Thread(target=fetch_data, args=(url,)) for url in url_array]

for thread in threads:
    thread.start()
for thread in threads:
    thread.join()

fetch_data(url_array)

The error I get is InvalidSchema: No connection adapters were found for '['https://query2.finance.... [url continues].

PS. I've also read that using multithread approach to scrape websites is bad/can get you blocked. Would Yahoo! Finance mind if I'm pulling data from a couple thousand tickers at once? Nothing happened when I did them sequentially.


回答1:


If you look carefully at the error you will notice that it doesn't show one url but all the urls you appended, enclosed with brackets. Indeed the last line of your code actually call your method fetch_data with the full array as a parameter, which does't make sense. If you remove this last line the code runs just fine, and your threads are called as expected.



来源:https://stackoverflow.com/questions/39358982/multithreading-to-scrape-yahoo-finance

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!