Beautiful Soup has extra </body> before actual end

左心房为你撑大大i 提交于 2020-04-30 07:31:51

问题


I am trying to scrape poems from PoetryFoundation.org. I have found in one of my test cases that when I pull the html from a specific poem it includes an extra </body> before the end of the actual poem. I can look at the source code for the poem online and there is no in the middle of the poem (as to be expected). I created an example with the url of a specific case such that others can try to replicate the problem:

from bs4 import BeautifulSoup
from urllib.request import urlopen

poem_page = urlopen("https://www.poetryfoundation.org/poems-and-poets/poems/detail/57956")
poem_soup = BeautifulSoup(poem_page.read(), "html5lib")
print(poem_soup)

I'm running Python 3.5.1. I've tried this with the default parsers html.parser as well as html5lib and lxml.

In the print out, if you search for 'in the poem' you'll find this snippet of html, which makes no sense because it ends the entire html document midway through the poem with </body></html> and then continues on with the rest of document:

in the poem</div></div></div></div></div></div></div></div></div></div></div></div></div></div></div></div></div></div></body></html>. But when we met,<br/><div style="text-indent: -1em; padding-left: 1em;"><br/>

I've looked at the source code online and this is what it should be:

in the poem</em>. But when we met,<br></div><div style="text-indent: -1em; padding-left: 1em;">

I have no idea why when I scrape it it's closing the entire html document partway through the page.


回答1:


When I try to get the poem with your url with html.parser,I got the same problem as you.The html was truncated at the in the poem position.

import requests
from bs4 import BeautifulSoup

poem_page = requests.get("https://www.poetryfoundation.org/poems-and-poets/poems/detail/57956")
poem_soup = BeautifulSoup(poem_page.text, "html.parser")
poem_div = poem_soup.find('div', class_='poem')
print poem_div

OUTPUT:

<div class="poem" data-view="ContentView">
<div style="text-indent: -1em; padding-left: 1em;">It seems a certain fear underlies everything. <br/></div><div style="text-indent: -1em; padding-left: 1em;">If I were to tell you something profound<br/></div><div style="text-indent: -1em; padding-left: 1em;"> it would be useless, as every single thing I know<br/></div><div style="text-indent: -1em; padding-left: 1em;"> is not timeless. I am particularly risk-averse.<br/></div><div style="text-indent: -1em; padding-left: 1em;"><br/></div><div style="text-indent: -1em; padding-left: 1em;">I choose someone else over me every time, <br/></div><div style="text-indent: -1em; padding-left: 1em;">as I'm sure they'll finish the task at hand, <br/></div><div style="text-indent: -1em; padding-left: 1em;">which is to say that whatever is in front of us<br/></div><div style="text-indent: -1em; padding-left: 1em;"> will get done if I'm not in charge of it.<br/></div><div style="text-indent: -1em; padding-left: 1em;"><br/></div><div style="text-indent: -1em; padding-left: 1em;">There is a limit to the number of times <br/></div><div style="text-indent: -1em; padding-left: 1em;">I can practice every single kind of mortification <br/></div><div style="text-indent: -1em; padding-left: 1em;">(of the flesh?). I can turn toward you and say <em>yes, <br/></em></div><div style="text-indent: -1em; padding-left: 1em;">it was you in the poem</div></div>

But changing the parser to lxml,everything is ok.

import requests
from bs4 import BeautifulSoup

poem_page = requests.get("https://www.poetryfoundation.org/poems-and-poets/poems/detail/57956")
poem_soup = BeautifulSoup(poem_page.text, "lxml")
poem_div = poem_soup.find('div', class_='poem')
# print poem_div
for s in poem_div.find_all('div'):
    print list(s.children)[0]

OUTPUT:

It seems a certain fear underlies everything. 
If I were to tell you something profound
 it would be useless, as every single thing I know
 is not timeless. I am particularly risk-averse.
<br/>
I choose someone else over me every time, 
as I'm sure they'll finish the task at hand, 
which is to say that whatever is in front of us
 will get done if I'm not in charge of it.
<br/>
There is a limit to the number of times 
I can practice every single kind of mortification 
(of the flesh?). I can turn toward you and say 
it was you in the poem. But when we met,
<br/>
you were actually wearing a shirt, and the poem 
wasn't about you or your indecipherable tattoo. 
The poem is always about me, but that one time 
I was in love with the memory of my twenties
<br/>
so I was, for a moment, in love with you 
because you remind me of an approaching
 subway brushing hair off my face with 
its hot breath. Darkness. And then light,
<br/>
the exact goldness of dawn fingering
 that brick wall out my bedroom window 
on Smith Street mornings when I'd wake
 next to godknowswho but always someone
<br/>
who wasn't a mistake, because what kind 
of mistakes are that twitchy and joyful 
even if they're woven with a particular 
thread of regret: the guy who used
<br/>
my toothbrush without asking,
I walked to the end of a pier with him,
would have walked off anywhere with him
until one day we both landed in California
<br/>
when I was still young, and going West
meant taking a laptop and some clothes
in a hatchback and learning about produce.
I can turn toward you, whoever you are,
<br/>
and say you are my lover simply because
I say you are, and that is, I realize,
a tautology, but this is my poem. I claim
nothing other than what I write, and even that,
<br/>
I'd leave by the wayside, since the only thing
to pack would be the candlesticks, and 
even those are burned through, thoroughly
replaceable. Who am I kidding? I don't
<br/>
own anything worth packing into anything.
We are cardboard boxes, you and I, stacked
nowhere near each other and humming
different tunes. It is too late to be writing this.
<br/>
I am writing this to tell you something less
than neutral, which is to say I'm sorry.
It was never you. It was always you:
your unutterable name, this growl in my throat.
<br/>


来源:https://stackoverflow.com/questions/39564359/beautiful-soup-has-extra-body-before-actual-end

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!