soup.findAll is not working for table

穿精又带淫゛_ 提交于 2019-12-24 02:02:30

问题


I am trying to parse this site https://www.dibbs.bsm.dla.mil/RFQ/RfqRecs.aspx?category=issue&TypeSrch=dt&Value=09-07-2017

using the following code

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import ssl
context = ssl._create_unverified_context()
dibbsurl = 'https://www.dibbs.bsm.dla.mil/RFQ/RfqRecs.aspx?category=issue&TypeSrch=dt&Value=09-07-2017'
uClient = uReq(dibbsurl, context=context)
dibbshtml = uClient.read()
uClient.close()

#html parser
dibbssoup = soup(dibbshtml, "html.parser")

#grabs each rfq
containers = dibbssoup.findAll("tr",{"Class":"Bgwhite"})

I want to grab the National Stock Numbers, the Nomenclature and QTY from the table for research purposes.

containers = dibbssoup.findAll("tr",{"Class":"Bgwhite"})

I was trying to grab each row of the table but containers does not seem to be grabing it. when I type len(containers) it shows 0 why is the table not being grabbed and how can I fix it?

update this is the sample html from the site

<tr class="BgWhite">
    <td headers="th0" valign="top">
        1
    </td>
    <td headers="th1" style="width: 125px;" valign="top">
        <a href="https://www.dibbs.bsm.dla.mil/RFQ/RFQNsn.aspx?value=8465015550093&amp;category=issue&amp;Scope=" title="go to NSN view">8465-01-555-0093</a>
    </td>
    <td headers="th2" valign="top">
        SNAP LINK, RAPPELLER
    </td>
    <td headers="th3" valign="top">
        None
    </td>
    <td headers="th4" style="width: 150px;" valign="top">
        <a href="https://dibbs2.bsm.dla.mil/Downloads/RFQ/8/SPE1C117T2608.PDF" title="RFQ document" target="DIBBSDocuments">SPE1C1-17-T-2608</a><br>&nbsp;&nbsp;<span style="font-size: 9px; color: #505050;">» <a href="https://www.dibbs.bsm.dla.mil/rfq/rfqrec.aspx?sn=SPE1C117T2608" title="Package View" class="SubMenuLink">Package View</a></span><a href="https://www.dibbs.bsm.dla.mil/RFQ/RFQQHlp.aspx?ht=fi"><img src="https://www.dibbs.bsm.dla.mil/app_themes/images/icons/iconFastPace.gif" alt="Fast Award Candidate.  Micro-purchase quotes may be awarded prior to the solicitation return date.  See Master Solicitation for Additional Info" width="14" height="11" hspace="0" border="0" align="middle"></a><br><img src="https://www.dibbs.bsm.dla.mil/app_themes/images/icons/iconEproc.gif" width="36" height="16" hspace="1" border="0" alt="DLA E-Procurement" style="border-width:0px;  vertical-align: bottom;">
    </td>
    <td headers="th5" valign="top">
        <span style="color:#000099">Open</span><br><a href="https://www.dibbs.bsm.dla.mil/RA/Quote/QuoteFrm.aspx?sn=SPE1C117T2608"><img src="https://www.dibbs.bsm.dla.mil/app_themes/images/buttons/btnQ.gif" width="18" height="18" border="0" alt="Click to submit Quote" hspace="1" align="bottom"></a><a href="https://www.dibbs.bsm.dla.mil/RA/Quote/QuoteFrm.aspx?sn=SPE1C117T2608"><span style="font-size: 9px;">uote</span></a>&nbsp;&nbsp;<img src="https://www.dibbs.bsm.dla.mil/app_themes/images/icons/iconSpace1010.gif" alt=" " width="18" height="16" hspace="0" border="0">
    </td>
    <td headers="th6" valign="top">
        0070631319<br>QTY: 400
    </td>
    <td headers="th7" valign="top">
        09-07-2017
    </td>
    <td headers="th8" valign="top">
        09-18-2017
    </td>
</tr>

回答1:


I analyzed the site you want to scrape, I found out that the site does have a page like a Terms and Condition that you need to agree before viewing the content. To be able to "agree" to that there is a need to submit a form. Thus, create a solution with 3 levels of fetches or retrieval of page source.

I used requests and html5lib on this example because it's easy to use. You can install them using pip

The last part is the parsing of the table and similar to what you did.

import requests
from bs4 import BeautifulSoup
import urllib3

urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

request_headers = {'Accept': '*/*',
                   'Accept-Encoding': 'gzip, deflate, sdch',
                   'Accept-Language': 'en-US,en;q=0.8',
                   'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
                       }

req = requests.Session()
warning_url = 'https://www.dibbs.bsm.dla.mil/dodwarning.aspx'

# get initial warning page
get_warning_page = req.get(warning_url, headers=request_headers, verify=False)
warning_soup = BeautifulSoup(get_warning_page.content, 'html5lib')

# parse forms needed to be submitted later (T&C of the site that you need to agree before proceeding)
payload = {}
for inp in warning_soup.find('form').find_all('input'):
    payload[inp.get('name')] = inp.get('value')

# submit the warning form (means you already agreed on the T&C)
submit_warning_form = req.post(warning_url, headers=request_headers, data=payload, verify=False)

# lastly, navigate to the main page that contains the table
main_page = req.post('https://www.dibbs.bsm.dla.mil/RFQ/RfqRecs.aspx?category=issue&TypeSrch=dt&Value=09-07-2017', headers=request_headers, verify=False)

# parsing of table
dibbssoup = BeautifulSoup(main_page.content, 'html5lib')
#grabs each rfq
containers = dibbssoup.find_all("tr", {"class": "BgWhite"})

print(containers)

If you have any questions or encountered errors, just let me know. If this solved your issue, please mark it as answer. Thanks!



来源:https://stackoverflow.com/questions/46096671/soup-findall-is-not-working-for-table

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!