Creating new columns by scraping information

问题

I am trying to add information scraped from a website into columns. I have a dataset that looks like:

COL1   COL2    COL3
...     ...    bbc.co.uk

and I would like to have a dataset which includes new columns:

 COL1   COL2    COL3          Website Address   Last Analysis   Blacklist Status \  
...     ...    bbc.co.uk

IP Address  Server Location    City       Region

These new columns come from the this website: https://www.urlvoid.com/scan/bbc.co.uk. I would need to fill each column with its related information.

For example:

  COL1   COL2    COL3          Website Address   Last Analysis   Blacklist Status \  
...     ...    bbc.co.uk         Bbc.co.uk         9 days ago       0/35

Domain Registration               IP Address       Server Location    City       Region
1996-08-01 | 24 years ago       151.101.64.81    (US) United States   Unknown    Unknown

Unfortunately I am having some issue in creating new columns and filling them with the information scraped from the website. I might have more websites to check, not only bbc.co.uk. Please see below the code used. I am sure that there is a better (and less confused) approach to do that. I would really grateful if you could help me to figure it out. Thanks

EDIT:

As shown in the example above, to the already existing dataset including the three columns (col1, col2 and col3) I should add also the fields that come from scraping (Website Address,Last Analysis,Blacklist Status, ... ). For each url, then, I should have information related to it (e.g. bbc.co.uk in the example).

 COL1   COL2    COL3          Website Address   Last Analysis   Blacklist Status \  
...     ...    bbc.co.uk          Bbc.co.uk         9 days ago       0/35
...     ...    stackoverflow.com
...     ...    ...


IP Address  Server Location    City       Region
  COL1   COL2    COL3          Website Address   Last Analysis   Blacklist Status \  
...     ...    bbc.co.uk         Bbc.co.uk         9 days ago       0/35
...     ...    stackoverflow.com Stackoverflow.com  7 days ago      0/35


Domain Registration               IP Address       Server Location    ...
996-08-01 | 24 years ago       151.101.64.81    (US) United States    ...
2003-12-26 | 17 years ago      ...

(the format is not good, but I think it could be enough to let you have an idea of the expected output).

Updated code:

urls= ['bbc.co.uk', 'stackoverflow.com', ...]

for x in urls:
        print(x)
        r = requests.get('https://www.urlvoid.com/scan/'+x)
        soup = BeautifulSoup(r.content, 'lxml')
        tab = soup.select("table.table.table-custom.table-striped")
        dat = tab[0].select('tr')
        for d in dat:
                row = d.select('td')
                original_dataset[row[0].text]=row[1].text

Unfortunately there is something that I am doing wrong, as it is copying only the information from the first url checked on the website (i.e. bbc.co.uk) over all the rows under the new column.

回答1:

Let me know if this is what you are looking for:

cols = ['Col1','Col2']
rows = ['something','something else']
my_df= pd.DataFrame(rows,index=cols).transpose()
my_df

Picking up you existing code from this line:

dat = tab[0].select('tr')

add:

for d in dat:
    row = d.select('td')
    my_df[row[0].text]=row[1].text
my_df

Output (sorry about the formatting):

    Col1       Col2       Website Address   Last Analysis   Blacklist Status    Domain Registration     Domain Information  IP Address  Reverse DNS     ASN     Server Location     Latitude\Longitude  City    Region
0   something   something else  Bbc.com     11 days ago  |  Rescan  0/35    1989-07-15 | 31 years ago   WHOIS Lookup | DNS Records | Ping   151.101.192.81   Find Websites  |  IPVoid  |  ...   Unknown     AS54113 FASTLY  (US) United States  37.751 / -97.822   Google Map   Unknown     Unknown

Edit:

To do it with multiple urls, try something like this:

urls = ['bbc.com', 'stackoverflow.com']
ares = []
for u in urls:
    url = 'https://www.urlvoid.com/scan/'+u
    r = requests.get(url)
    ares.append(r)
rows = []
cols = []
for ar in ares:
    soup = bs(ar.content, 'lxml')
    tab = soup.select("table.table.table-custom.table-striped")        
    dat = tab[0].select('tr')
    line= []
    header=[]
    for d in dat:
        row = d.select('td')
        line.append(row[1].text)
        new_header = row[0].text
        if not new_header in cols:
            cols.append(new_header)

    rows.append(line)

my_df = pd.DataFrame(rows,columns=cols)   
my_df

Output:

Website Address     Last Analysis   Blacklist Status    Domain Registration     Domain Information  IP Address  Reverse DNS     ASN     Server Location     Latitude\Longitude  City    Region
0   Bbc.com     12 days ago  |  Rescan  0/35    1989-07-15 | 31 years ago   WHOIS Lookup | DNS Records | Ping   151.101.192.81   Find Websites  |  IPVoid  |  ...   Unknown     AS54113 FASTLY  (US) United States  37.751 / -97.822   Google Map   Unknown     Unknown
1   Stackoverflow.com   5 minutes ago  |  Rescan    0/35    2003-12-26 | 17 years ago   WHOIS Lookup | DNS Records | Ping   151.101.1.69   Find Websites  |  IPVoid  |  Whois   Unknown     AS54113 FASTLY  (US) United States  37.751 / -97.822   Google Map   Unknown     Unknown

Note that this doesn't have your two existing columns (since I don't know what they are), so you'll have to append them separately to the dataframe.

回答2:

You can use a more simple way to fetch the data by using pandas read_html method. Here is my shot-

import pandas as pd

df = pd.read_html("https://www.urlvoid.com/scan/bbc.co.uk/")[0]

df_transpose = df.T

Now you have the required transposed data. You can drop the unwanted columns if you like. After that, all you have to do now is concat it with your existing dataset. Considering you can load your dataset as a pandas dataframe you can simply use the concat function for this (axis=1 is to concatenate as columns):

pd.concat([df_transpose, existing_dataset], axis=1)

See the pandas docs on merging/concatenating: http://pandas.pydata.org/pandas-docs/stable/merging.html

来源：https://stackoverflow.com/questions/61037401/creating-new-columns-by-scraping-information

标签

python

pandas

dataframe

web-scraping

beautifulsoup