Displaying contents of web scrape

问题

The code below displays all the fields out onto the screen.Is there a way I could get the fields "alongside" each other as they would appear in a database or in a spreadsheet.In the source code the fields track,date,datetime,grade,distance and prizes are found in the resultsBlockHeader div class,and the Fin(finishing position) Greyhound,Trap,SP timeSec and Time Distance are found in Div resultsBlock.I am trying to get them displayed like this track,date,datetime,grade,distance,prizes,fin,greyhound,trap,sp,timeSec,timeDistance all in one line.Any help appreciated.

from urllib import urlopen

from bs4 import BeautifulSoup
html = urlopen("http://www.gbgb.org.uk/resultsMeeting.aspx?id=135754")

bsObj = BeautifulSoup(html, 'lxml')
nameList = bsObj. findAll("div", {"class": "track"})
for name in nameList:
 print(name. get_text())

nameList = bsObj. findAll("div", {"class": "date"})
for name in nameList:
 print(name. get_text())

 nameList = bsObj. findAll("div", {"class": "datetime"})
for name in nameList:
 print(name. get_text())
nameList = bsObj. findAll("div", {"class": "grade"})
for name in nameList:
 print(name. get_text())
nameList = bsObj. findAll("div", {"class": "distance"})
for name in nameList:
 print(name. get_text())
nameList = bsObj. findAll("div", {"class": "prizes"})
for name in nameList:
 print(name. get_text())
nameList = bsObj. findAll("li", {"class": "first essential fin"})
for name in nameList:
 print(name. get_text())
nameList = bsObj. findAll("li", {"class": "essential greyhound"})
for name in nameList:
 print(name. get_text())
nameList = bsObj. findAll("li", {"class": "trap"})
for name in nameList:
 print(name. get_text())
nameList = bsObj. findAll("li", {"class": "sp"})
for name in nameList:
 print(name. get_text())
nameList = bsObj. findAll("li", {"class": "timeSec"})
for name in nameList:
 print(name. get_text())
nameList = bsObj. findAll("li", {"class": "timeDistance"})
for name in nameList:
 print(name. get_text())
nameList = bsObj. findAll("li", {"class": "essential trainer"})
for name in nameList:
 print(name. get_text())
nameList = bsObj. findAll("li", {"class": "first essential comment"})
for name in nameList:
 print(name. get_text())
nameList = bsObj. findAll("div", {"class": "resultsBlockFooter"})
for name in nameList:
 print(name. get_text())
 nameList = bsObj. findAll("li", {"class": "first essential"})
for name in nameList:
 print(name. get_text())

回答1:

First of all, make sure you are not violating the website's Terms of Use - stay on the legal side.

The markup is not very easy to scrape, but what I would do is to iterate over the race headers and for every header, get the desired information about the race. Then, get the sibling results block and extract the rows. Sample code to get you started - extracts the track and the greyhound:

from pprint import pprint
from urllib2 import urlopen

from bs4 import BeautifulSoup


html = urlopen("http://www.gbgb.org.uk/resultsMeeting.aspx?id=135754")
soup = BeautifulSoup(html, 'lxml')

rows = []
for header in soup.find_all("div", class_="resultsBlockHeader"):
    track = header.find("div", class_="track").get_text(strip=True)

    results = header.find_next_sibling("div", class_="resultsBlock").find_all("ul", class_="line1")
    for result in results:
        greyhound = result.find("li", class_="greyhound").get_text(strip=True)

        rows.append({
            "track": track,
            "greyhound": greyhound
        })

pprint(rows)

Note that every row you see in the tables is actually represented by 3 lines in the markup:

<ul class="contents line1">
   ...
</ul>
<ul class="contents line2">
   ...
</ul>
<ul class="contents line3">
   ...
</ul>

The greyhound value was inside the first ul (with line1 class), you may need to get the line2 and line3 using the result.find_next_sibling("ul", class="line2") and result.find_next_sibling("ul", class="line3").

来源：https://stackoverflow.com/questions/35229446/displaying-contents-of-web-scrape

标签

python

html

beautifulsoup