how to preserve links when scraping a table with beautiful soup and pandas

问题

Scraping a web to get a table, using Beautiful soup and Pandas. One of the columns got some urls. When I pass html to pandas, href are lost.

is there any way of preserving the url link just for that column?

Example data (edited for better suit ral case):

  <html>
        <body>
          <table>
              <tr>
               <td>customer</td>
               <td>country</td>
               <td>area</td>
               <td>website link</td>
             </tr>
             <tr>
               <td>IBM</td>
               <td>USA</td>
               <td>EMEA</td>
               <td><a href="http://www.ibm.com">IBM site</a></td>
            </tr>
          <tr>
            <td>CISCO</td>
            <td>USA</td>
            <td>EMEA</td>
            <td><a href="http://www.cisco.com">cisco site</a></td>
         </tr>
           <tr>
            <td>unknown company</td>
            <td>USA</td>
            <td>EMEA</td>
            <td></td>
         </tr>
       </table>
     </body>
  </html>

My python code:

    file = open(url,"r")

    soup = BeautifulSoup(file, 'lxml')

    parsed_table = soup.find_all('table')[1] 

    df = pd.read_html(str(parsed_table),encoding='utf-8')[0]

 df

Output (exported to CSV):

customer;country;area;website
IBM;USA;EMEA;IBM site
CISCO;USA;EMEA;cisco site
unknown company;USA;EMEA;

df output is ok but the link is lost. I need to preserve the link. The URL at least.

any hint?

回答1:

Just check if tag exists this way:

 import numpy as np

 with open(url,"r") as f:
     sp = bs.BeautifulSoup(f, 'lxml')
     tb = sp.find_all('table')[56] 
     df = pd.read_html(str(tb),encoding='utf-8', header=0)[0]
     df['href'] = [np.where(tag.has_attr('href'),tag.get('href'),"no link") for tag in tb.find_all('a')]

回答2:

pd.read_html assumes the data you are interested in is in the text, not the tag attributes. However, it isn't hard to scrape the table yourself:

import bs4 as bs
import pandas as pd

with open(url,"r") as f:
    soup = bs.BeautifulSoup(f, 'lxml')
    parsed_table = soup.find_all('table')[1] 
    data = [[td.a['href'] if td.find('a') else 
             ''.join(td.stripped_strings)
             for td in row.find_all('td')]
            for row in parsed_table.find_all('tr')]
    df = pd.DataFrame(data[1:], columns=data[0])
    print(df)

yields

          customer country  area          website link
0              IBM     USA  EMEA    http://www.ibm.com
1            CISCO     USA  EMEA  http://www.cisco.com
2  unknown company     USA  EMEA

回答3:

Here is another way to do it if you have more than one link to grab from html table. Instead of making list comprehension I would rather go with separeated for loops so code is more readable to those who are new to python and it is easier to adjust code ore handle errors if they emerge. I hope it will help someone.

soup = BeautifulSoup(html, "lxml")
table = table.find('table')
thead = table.find('thead')
column_names = [th.text.strip() for th in thead.find_all('th')]

data = []
for row in table.find_all('tr'):
    row_data = []
    for td in row.find_all('td'):
        td_check = td.find('a')
        if td_check is not None:
            link = td.a['href']
            row_data.append(link)
        else:
            not_link = ''.join(td.stripped_strings)
            if not_link == '':
                 not_link = None
            row_data.append(not_link)
    data.append(row_data)
df = pd.DataFrame(data[1:], columns=column_names)
df_dict = df.to_dict('records')

for row in df_dict:
    print(row)

来源：https://stackoverflow.com/questions/42285417/how-to-preserve-links-when-scraping-a-table-with-beautiful-soup-and-pandas

标签

python

pandas

beautifulsoup