HTML Parsing using bs4 | 易学教程

问题

I am parsing an HTMl page and am having a hard time figuring out how to pull a certain 'p' tag without a class or on id. I am trying to reach the tag of 'p' with the lat and long. Here is my current code:

 import bs4
 from urllib import urlopen as uReq #this opens the URL
 from bs4 import BeautifulSoup as soup #parses/cuts  the html

 my_url = 'http://www.fortwiki.com/Battery_Adair'
 print(my_url)
 uClient = uReq(my_url) #opens the HTML and stores it in uClients

 page_html = uClient.read() # reads the URL
 uClient.close() # closes the URL

 page_soup = soup(page_html, "html.parser") #parses/cuts the HTML
 containers = page_soup.find_all("table")
 for container in containers:
    title = container.tr.p.b.text.strip()
    history = container.tr.p.text.strip()
      lat_long = container.tr.table
       print(title)
       print(history)
       print(lat_long)

Link to website: http://www.fortwiki.com/Battery_Adair

回答1:

The <p> tag you're looking for is very common in the document, and it doesn't have any unique attributes, so we can't select it directly.

A possible solution would be to select the tag by index, as in bloopiebloopie's answer.
However that won't work unless you know the exact position of the tag.

Another possible solution would be to find a neighbouring tag that has distinguishing attributes/text and select our tag in relation to that.
In this case we can find the previous tag with text: "Maps & Images", and use find_next to select the next tag.

import requests
from bs4 import BeautifulSoup

url = 'http://www.fortwiki.com/Battery_Adair'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")

b = soup.find('b', text='Maps & Images')
if b:
    lat_long = b.find_next().text

This method should find the coordinates data in any www.fortwiki.com page that has a map.

回答2:

You can use re to match partial text inside a tag.

import re
import requests
from bs4 import BeautifulSoup

url = 'http://www.fortwiki.com/Battery_Adair'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")

lat_long = soup.find('p', text=re.compile('Lat:\s\d+\.\d+\sLong:')).text
print(lat_long)
# Lat: 24.5477038 Long: -81.8104541

回答3:

I am not exactly sure what you want but this works for me. There are probably neeter ways of doing it. I am new to python

soup = BeautifulSoup(requests.get("http://www.fortwiki.com/Battery_Adair").content, "html.parser")
x = soup.find("div", id="mw-content-text").find("table").find_all("p")[8]
x = x.get_text()
x = x.split("Long:")
lat = x[0].split(" ")[1]
long = x[1]
print("LAT = " + lat)
print("LNG = " + long)

来源：https://stackoverflow.com/questions/49618479/html-parsing-using-bs4

标签

python

beautifulsoup