问题
I will start out by saying I'm not endorsing scraping of sites that do not allow it in their terms of service and this is purely for academic research of hypothetical gathering of financial data from various websites.
If one wanted to look at this link:
https://finviz.com/screener.ashx?v=141&f=geo_usa,ind_stocksonly,sh_avgvol_o100,sh_price_o1&o=ticker
...which is stored in a URLs.csv file, and wanted to scrape columns 2-5 (ie. Ticker, Perf Week, Perf Month, Perf Quarter) and wanted to export that to a CSV file, what might the code look like?
Trying to use another user's answer from a past question I had, so far I have something that looks like this:
from bs4 import BeautifulSoup
import requests
import csv, random, time
# Open 'URLs.csv' to read list of URLs in the list
with open('URLs.csv', newline='') as f_urls, open('Results.csv', 'w', newline='') as f_output:
csv_urls = csv.reader(f_urls)
csv_output = csv.writer(f_output, delimiter=',')
headers = requests.utils.default_headers()
headers['User-Agent'] = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'
csv_output.writerow(['Ticker', 'Perf Week', 'Perf Month', 'Perf Quarter'])
# Start to read the first URL in the .csv and loop for each URL/row in the .csv
for line in csv_urls:
# Start at first url and look for items
page = requests.get(line[0])
soup = BeautifulSoup(page.text, 'html.parser')
symbol = soup.findAll('a', {'class':'screener-link-primary'})
perfdata = soup.findAll('a', {'class':'screener-link'})
lines = list(zip(perfdata, symbol))
# pair up every two teams
for perfdata1, symbol1 in zip(lines[1::2], lines[::2]):
# extract string items
a1, a2, a3, _ = (x.text for x in symbol1 + perfdata1)
# reorder and write row
row = a1, a2, a3
print(row)
csv_output.writerow(row)
...I get the following output:
('1', 'A', '7.52%')
('-0.94%', 'AABA', '5.56%')
('10.92%', 'AAL', '-0.58%')
('4.33%', 'AAOI', '2.32%')
('2.96%', 'AAP', '1.80')
('2.83M', 'AAT', '0.43')
('70.38', 'AAXN', '0.69%')
...
So it's skipping some rows and not returning the data in the right order. I would like to see in my final output:
('A', '7.52%', -0.94%, 5.56%)
('AA', '0.74%', 0.42%, -20.83%)
('AABA', '7.08%', '0.50%', '7.65%')
('AAC', '31.18%', '-10.95%', '-65.14%')
...
I know it's the last sections of the code that are incorrect but looking for some guidance. Thanks!
回答1:
the problem is you're only extracting column Ticker and random cell (.screener-link), extract the rows instead.
for line in csv_urls:
# Start at first url and look for items
page = requests.get(line[0])
soup = BeautifulSoup(page.text, 'html.parser')
rows = soup.select('table[bgcolor="#d3d3d3"] tr')
for row in rows[1:]:
# extract string items
a1, a2, a3, a4 = (x.text for x in row.find_all('td')[1:5])
row = a1, a2, a3, a4
print(row)
# write row
csv_output.writerow(row)
output
('A', '7.52%', '-0.94%', '5.56%')
('AA', '0.74%', '0.42%', '-20.83%')
('AABA', '7.08%', '0.50%', '7.65%')
('AAC', '31.18%', '-10.95%', '-65.14%')
('AAL', '-0.75%', '-6.74%', '0.60%')
('AAN', '5.68%', '6.51%', '-6.55%')
('AAOI', '5.47%', '-17.10%', '-23.12%')
('AAON', '0.62%', '1.10%', '8.58%')
('AAP', '0.38%', '-3.85%', '-2.30%')
('AAPL', '2.72%', '-9.69%', '-29.61%')
('AAT', '3.26%', '-2.39%', '10.74%')
('AAWW', '15.87%', '1.55%', '-9.62%')
('AAXN', '7.48%', '11.85%', '-14.24%')
('AB', '1.32%', '6.67%', '-2.73%')
('ABBV', '-0.85%', '0.16%', '-5.12%')
('ABC', '3.15%', '-7.18%', '-15.72%')
('ABCB', '5.23%', '-3.31%', '-22.35%')
('ABEO', '1.71%', '-10.41%', '-28.81%')
('ABG', '1.71%', '8.95%', '12.70%')
('ABM', '7.09%', '26.92%', '5.90%')
回答2:
This is just my preference. But to read and write csv, I like using Pandas.
I'm also assuming each link in your list will be the same table structure. If that's not the case, I might need to just see a few of the links to run through and make it more robust. Otherwise, for the one link you provided, got the desired output.
from bs4 import BeautifulSoup
import requests
import pandas as pd
import csv, random, time
# Read in the csv
csv_df = pd.read_csv('URLs.csv')
# Create a list of the column with the urls. Change the column name to whatever you have it named in the csv file
csv_urls = list(csv_df['NAME OF COLUMN WITH URLS'])
########### delete this line below. This is for me to test ####################
csv_urls = ['https://finviz.com/screener.ashx?v=141&f=geo_usa,ind_stocksonly,sh_avgvol_o100,sh_price_o1&o=ticker']
###############################################################################
headers = requests.utils.default_headers()
headers['User-Agent'] = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'
result = pd.DataFrame()
for url in csv_urls:
tables = pd.read_html(url)
for dataframe in tables:
# I'm assuming the tables are all the same that you're getting. Otherwise this won't work for all tables
# The table you're interested in is the table with 16 columns
if len(dataframe.columns) == 16:
table = dataframe
else:
continue
# Make first row column headers and keep all rows and
table.columns = table.iloc[0,:]
table = table.iloc[1:,1:5]
result = result.append(table)
result.to_csv('Results.csv', index=False)
Output:
print (result)
0 Ticker Perf Week Perf Month Perf Quart
1 A 7.52% -0.94% 5.56%
2 AA 0.74% 0.42% -20.83%
3 AABA 7.08% 0.50% 7.65%
4 AAC 31.18% -10.95% -65.14%
5 AAL -0.75% -6.74% 0.60%
6 AAN 5.68% 6.51% -6.55%
7 AAOI 5.47% -17.10% -23.12%
8 AAON 0.62% 1.10% 8.58%
9 AAP 0.38% -3.85% -2.30%
10 AAPL 2.72% -9.69% -29.61%
11 AAT 3.26% -2.39% 10.74%
12 AAWW 15.87% 1.55% -9.62%
13 AAXN 7.48% 11.85% -14.24%
14 AB 1.32% 6.67% -2.73%
15 ABBV -0.85% 0.16% -5.12%
16 ABC 3.15% -7.18% -15.72%
17 ABCB 5.23% -3.31% -22.35%
18 ABEO 1.71% -10.41% -28.81%
19 ABG 1.71% 8.95% 12.70%
20 ABM 7.09% 26.92% 5.90%
来源:https://stackoverflow.com/questions/54165551/scrape-finviz-page-for-specific-values-in-table