问题
I am trying to add information scraped from a website into columns. I have a dataset that looks like:
COL1 COL2 COL3
... ... bbc.co.uk
and I would like to have a dataset which includes new columns:
COL1 COL2 COL3 Website Address Last Analysis Blacklist Status \
... ... bbc.co.uk
IP Address Server Location City Region
These new columns come from the this website: https://www.urlvoid.com/scan/bbc.co.uk. I would need to fill each column with its related information.
For example:
COL1 COL2 COL3 Website Address Last Analysis Blacklist Status \
... ... bbc.co.uk Bbc.co.uk 9 days ago 0/35
Domain Registration IP Address Server Location City Region
1996-08-01 | 24 years ago 151.101.64.81 (US) United States Unknown Unknown
Unfortunately I am having some issue in creating new columns and filling them with the information scraped from the website. I might have more websites to check, not only bbc.co.uk. Please see below the code used. I am sure that there is a better (and less confused) approach to do that. I would really grateful if you could help me to figure it out. Thanks
EDIT:
As shown in the example above, to the already existing dataset including the three columns (col1, col2 and col3) I should add also the fields that come from scraping (Website Address,Last Analysis,Blacklist Status, ... ). For each url, then, I should have information related to it (e.g. bbc.co.uk in the example).
COL1 COL2 COL3 Website Address Last Analysis Blacklist Status \
... ... bbc.co.uk Bbc.co.uk 9 days ago 0/35
... ... stackoverflow.com
... ... ...
IP Address Server Location City Region
COL1 COL2 COL3 Website Address Last Analysis Blacklist Status \
... ... bbc.co.uk Bbc.co.uk 9 days ago 0/35
... ... stackoverflow.com Stackoverflow.com 7 days ago 0/35
Domain Registration IP Address Server Location ...
996-08-01 | 24 years ago 151.101.64.81 (US) United States ...
2003-12-26 | 17 years ago ...
(the format is not good, but I think it could be enough to let you have an idea of the expected output).
Updated code:
urls= ['bbc.co.uk', 'stackoverflow.com', ...]
for x in urls:
print(x)
r = requests.get('https://www.urlvoid.com/scan/'+x)
soup = BeautifulSoup(r.content, 'lxml')
tab = soup.select("table.table.table-custom.table-striped")
dat = tab[0].select('tr')
for d in dat:
row = d.select('td')
original_dataset[row[0].text]=row[1].text
Unfortunately there is something that I am doing wrong, as it is copying only the information from the first url checked on the website (i.e. bbc.co.uk) over all the rows under the new column.
回答1:
Let me know if this is what you are looking for:
cols = ['Col1','Col2']
rows = ['something','something else']
my_df= pd.DataFrame(rows,index=cols).transpose()
my_df
Picking up you existing code from this line:
dat = tab[0].select('tr')
add:
for d in dat:
row = d.select('td')
my_df[row[0].text]=row[1].text
my_df
Output (sorry about the formatting):
Col1 Col2 Website Address Last Analysis Blacklist Status Domain Registration Domain Information IP Address Reverse DNS ASN Server Location Latitude\Longitude City Region
0 something something else Bbc.com 11 days ago | Rescan 0/35 1989-07-15 | 31 years ago WHOIS Lookup | DNS Records | Ping 151.101.192.81 Find Websites | IPVoid | ... Unknown AS54113 FASTLY (US) United States 37.751 / -97.822 Google Map Unknown Unknown
Edit:
To do it with multiple urls, try something like this:
urls = ['bbc.com', 'stackoverflow.com']
ares = []
for u in urls:
url = 'https://www.urlvoid.com/scan/'+u
r = requests.get(url)
ares.append(r)
rows = []
cols = []
for ar in ares:
soup = bs(ar.content, 'lxml')
tab = soup.select("table.table.table-custom.table-striped")
dat = tab[0].select('tr')
line= []
header=[]
for d in dat:
row = d.select('td')
line.append(row[1].text)
new_header = row[0].text
if not new_header in cols:
cols.append(new_header)
rows.append(line)
my_df = pd.DataFrame(rows,columns=cols)
my_df
Output:
Website Address Last Analysis Blacklist Status Domain Registration Domain Information IP Address Reverse DNS ASN Server Location Latitude\Longitude City Region
0 Bbc.com 12 days ago | Rescan 0/35 1989-07-15 | 31 years ago WHOIS Lookup | DNS Records | Ping 151.101.192.81 Find Websites | IPVoid | ... Unknown AS54113 FASTLY (US) United States 37.751 / -97.822 Google Map Unknown Unknown
1 Stackoverflow.com 5 minutes ago | Rescan 0/35 2003-12-26 | 17 years ago WHOIS Lookup | DNS Records | Ping 151.101.1.69 Find Websites | IPVoid | Whois Unknown AS54113 FASTLY (US) United States 37.751 / -97.822 Google Map Unknown Unknown
Note that this doesn't have your two existing columns (since I don't know what they are), so you'll have to append them separately to the dataframe.
回答2:
You can use a more simple way to fetch the data by using pandas read_html method. Here is my shot-
import pandas as pd
df = pd.read_html("https://www.urlvoid.com/scan/bbc.co.uk/")[0]
df_transpose = df.T
Now you have the required transposed data. You can drop the unwanted columns if you like. After that, all you have to do now is concat it with your existing dataset. Considering you can load your dataset as a pandas dataframe you can simply use the concat function for this (axis=1 is to concatenate as columns):
pd.concat([df_transpose, existing_dataset], axis=1)
See the pandas docs on merging/concatenating: http://pandas.pydata.org/pandas-docs/stable/merging.html
来源:https://stackoverflow.com/questions/61037401/creating-new-columns-by-scraping-information