how to preserve links when scraping a table with beautiful soup and pandas

只谈情不闲聊 提交于 2020-07-17 08:24:16

问题


Scraping a web to get a table, using Beautiful soup and Pandas. One of the columns got some urls. When I pass html to pandas, href are lost.

is there any way of preserving the url link just for that column?

Example data (edited for better suit ral case):

  <html>
        <body>
          <table>
              <tr>
               <td>customer</td>
               <td>country</td>
               <td>area</td>
               <td>website link</td>
             </tr>
             <tr>
               <td>IBM</td>
               <td>USA</td>
               <td>EMEA</td>
               <td><a href="http://www.ibm.com">IBM site</a></td>
            </tr>
          <tr>
            <td>CISCO</td>
            <td>USA</td>
            <td>EMEA</td>
            <td><a href="http://www.cisco.com">cisco site</a></td>
         </tr>
           <tr>
            <td>unknown company</td>
            <td>USA</td>
            <td>EMEA</td>
            <td></td>
         </tr>
       </table>
     </body>
  </html>

My python code:

    file = open(url,"r")

    soup = BeautifulSoup(file, 'lxml')

    parsed_table = soup.find_all('table')[1] 

    df = pd.read_html(str(parsed_table),encoding='utf-8')[0]

 df

Output (exported to CSV):

customer;country;area;website
IBM;USA;EMEA;IBM site
CISCO;USA;EMEA;cisco site
unknown company;USA;EMEA;

df output is ok but the link is lost. I need to preserve the link. The URL at least.

any hint?


回答1:


Just check if tag exists this way:

 import numpy as np

 with open(url,"r") as f:
     sp = bs.BeautifulSoup(f, 'lxml')
     tb = sp.find_all('table')[56] 
     df = pd.read_html(str(tb),encoding='utf-8', header=0)[0]
     df['href'] = [np.where(tag.has_attr('href'),tag.get('href'),"no link") for tag in tb.find_all('a')]



回答2:


pd.read_html assumes the data you are interested in is in the text, not the tag attributes. However, it isn't hard to scrape the table yourself:

import bs4 as bs
import pandas as pd

with open(url,"r") as f:
    soup = bs.BeautifulSoup(f, 'lxml')
    parsed_table = soup.find_all('table')[1] 
    data = [[td.a['href'] if td.find('a') else 
             ''.join(td.stripped_strings)
             for td in row.find_all('td')]
            for row in parsed_table.find_all('tr')]
    df = pd.DataFrame(data[1:], columns=data[0])
    print(df)  

yields

          customer country  area          website link
0              IBM     USA  EMEA    http://www.ibm.com
1            CISCO     USA  EMEA  http://www.cisco.com
2  unknown company     USA  EMEA                      



回答3:


Here is another way to do it if you have more than one link to grab from html table. Instead of making list comprehension I would rather go with separeated for loops so code is more readable to those who are new to python and it is easier to adjust code ore handle errors if they emerge. I hope it will help someone.

soup = BeautifulSoup(html, "lxml")
table = table.find('table')
thead = table.find('thead')
column_names = [th.text.strip() for th in thead.find_all('th')]

data = []
for row in table.find_all('tr'):
    row_data = []
    for td in row.find_all('td'):
        td_check = td.find('a')
        if td_check is not None:
            link = td.a['href']
            row_data.append(link)
        else:
            not_link = ''.join(td.stripped_strings)
            if not_link == '':
                 not_link = None
            row_data.append(not_link)
    data.append(row_data)
df = pd.DataFrame(data[1:], columns=column_names)
df_dict = df.to_dict('records')

for row in df_dict:
    print(row)


来源:https://stackoverflow.com/questions/42285417/how-to-preserve-links-when-scraping-a-table-with-beautiful-soup-and-pandas

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!