BeautifulSoup returning different html than view source

我只是一个虾纸丫 提交于 2020-01-15 08:50:14

问题


I'm brand new to using BeautifulSoup, so forgive me if my question is stupid. However, I've been googling and trying suggestions in every stackoverflow thread I could since 6am, but to no avail.

My problem is that I have a .csv file with gene names, some of them are in ensEMBL format, which means I MUST use the ensembl database to lookup the info I need. For the rest I can use the ncbi database.

Now, my code is just fine. I know this because every query sent to ncbi returns the info I need, and I'm able to extract it all with BeautifulSoup and output it to a csv. HOWEVER, either urlopen or BeautifulSoup are not working the way I've been led to understand they work.

When I put the following URL into my address bar, the correct webpage loads: http://uswest.ensembl.org/Gallus_gallus/Gene/Summary?db=core;g=ENSGALG00000016955;r=1:165302186-165480795;t=ENSGALT00000027404.

I can then view source and check out the HTML. Yet when I have:

html = urlopen(http://uswest.ensembl.org/Gallus_gallus/Gene/Summary?db=core;g=ENSGALG00000016955;r=1:165302186-165480795;t=ENSGALT00000027404, 'lxml')

The HTML it outputs is not at all what I get when I load the same URL in my browser and view source. I know that for pages with javascript, inspect element and view source will be different, but urlopen should ALWAYS return the same HTML as view source.

I need to extract the string after "Description". Visiting the link in my browser, I can inspect source and see the tags I need to find with BeautifulSoup; however, unless urlopen works properly and returns the correct HTML, there is nothing I can do. My RA job depends on getting this done by tonight.

Any suggestions?


回答1:


Parts of the page are loaded by the Javascript that is referenced in the script tag, for instance the "Summary". However the text you are looking for is embedded in the HTML. Locating the text after the Description tag works with this code:

import requests
from bs4 import BeautifulSoup

url = "http://uswest.ensembl.org/Gallus_gallus/Gene/Summary?db=core;g=ENSGALG00000016955;r=1:165302186-165480795;t=ENSGALT00000027404"
r = requests.get(url, timeout=5)
html = BeautifulSoup(r.text)
description = html.find("div", {'class': "rhs"})
print description.text


来源:https://stackoverflow.com/questions/26763461/beautifulsoup-returning-different-html-than-view-source

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!