beautifulsoup

Scraping 'N' pages with Beautifulsoup and Requests (How to obtain the true page number)

梦想与她 提交于 2020-01-14 05:57:09
问题 I want to get all the titles() in the website. http://www.shyan.gov.cn/zwhd/web/webindex.action Now, my code successfully scrapes only one page. However, there are multiple pages available at the site above in which I would like to to scrape. For example, with the url above, when I click the link to "page 2", the overall url does NOT change. I looked at the page source and saw javascript code to advance to the next page like this: javascript:gotopage(2) or javascript:void(0). My code is here

“illegal multibyte sequence” error from BeautifulSoup when Python 3

淺唱寂寞╮ 提交于 2020-01-14 05:48:06
问题 .html saved to local disk, and I am using BeautifulSoup (bs4) to parse it. It worked all fine until lately it's changed to Python 3. I tested the same .html file in another machine Python 2, it works and returned the page contents. soup = BeautifulSoup(open('page.html'), "lxml") Machine with Python 3 doesn't work, and it says: UnicodeDecodeError: 'gbk' codec can't decode byte 0x92 in position 298670: illegal multibyte sequence Searched around and I tried below but neither worked: (be it 'r',

How to extract text within HTML lists using beautifulsoup python

不打扰是莪最后的温柔 提交于 2020-01-14 05:06:35
问题 I'm trying to write a python program that can extract text between list in html. I would like to extract information like the book being hardcover and number of pages. Does anybody know the command for this operation? <h2>Product Details</h2> <div class="content"> <ul> <li><b>Hardcover:</b> 156 pages</li> <li><b>Publisher:</b> Insight Editions; Har/Pstr edition (June 18, 2013)</li> <li><b>Language:</b> English</li> <li><b>ISBN-10:</b> 1608871827</li> <li><b>ISBN-13:</b> 978-1608871827</li>

How to extract text within HTML lists using beautifulsoup python

对着背影说爱祢 提交于 2020-01-14 05:05:05
问题 I'm trying to write a python program that can extract text between list in html. I would like to extract information like the book being hardcover and number of pages. Does anybody know the command for this operation? <h2>Product Details</h2> <div class="content"> <ul> <li><b>Hardcover:</b> 156 pages</li> <li><b>Publisher:</b> Insight Editions; Har/Pstr edition (June 18, 2013)</li> <li><b>Language:</b> English</li> <li><b>ISBN-10:</b> 1608871827</li> <li><b>ISBN-13:</b> 978-1608871827</li>

Can't extract the text and find all by BeautifulSoup

末鹿安然 提交于 2020-01-14 04:37:18
问题 I want to extract the all the available items in the équipements, but I can only get the first four items, and then I got '+ plus'. import urllib2 from bs4 import BeautifulSoup import re import requests headers = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'} url = 'https://www.airbnb.fr/rooms/8261637?s=bAMrFL5A' req = urllib2.Request(url = url, headers = headers) html = urllib2.urlopen(req) bsobj = BeautifulSoup(html.read(),'lxml') b

Trip Advisor Scraping 'moreLink'

不问归期 提交于 2020-01-14 04:06:05
问题 I've been building a web scraper in BS4 and have gotten stuck. I am using Trip Advisor as a test for other data I will be going after, but am not able to isolate the tag of the 'entire' reviews. Here is an example: https://www.tripadvisor.com/Restaurant_Review-g56010-d470148-Reviews-Chez_Nous-Humble_Texas.html Notice in the first review, there is an icon below "the wine list is...". I am able to easily isolate the partial reviews, but have not been able to figure out a way to get BS4 to pull

Using BeautifulSoup where authentication is required

别说谁变了你拦得住时间么 提交于 2020-01-14 03:52:07
问题 I am scraping LAN data using BeautifulSoup4 and Python requests for a company project. Since the site has a login interface, I am not authorized to access the data. The login interface is a pop-up that doesn't allow me to access the page source or inspect the page elements without log in. the error I get is this- Access Error: Unauthorized Access to this document requires a User ID This is a screen-shot of the pop-up box (The blackened part is sensitive information). It has not information

How to scrape aspx pages with python

元气小坏坏 提交于 2020-01-14 03:35:07
问题 I am trying to scrape a site, https://www.searchiqs.com/nybro/ (you have to click the "Log In as Guest" to get to the search form. If I search for a Party 1 term like say "Andrew" the results have pagination and also, the request type is POST so the URL does not change and also the sessions time out very quickly. So quickly that if i wait ten minutes and refresh the search url page it gives me a timeout error. I got started with scraping recently, so I have mostly been doing GET posts where I

BeautifulSoup: Print div's based on content of preceding tag

我的梦境 提交于 2020-01-13 19:22:46
问题 I would like to select the contents of elements based on the preceding tag: <h4>Models & Products</h4> <div class="profile-area">...</div> <h4>Production Capacity (year)</h4> <div class="profile-area">...</div> How can I get the "profile-area" values based on the content of the preceding tag? Here is my code: import requests from bs4 import BeautifulSoup import csv import re html_doc = """ <html> <body> <div class="col-md-6"> <iframe class="factory_detail_google_map" frameborder="0" src=

BeautifulSoup: Find table by style

心不动则不痛 提交于 2020-01-13 11:36:06
问题 Is it possible to find a specific table with unique style? Say, given the following html: <table border="1" style="background-color:White;font-size:10pt;border-collapse:collapse;"> How can I use BS to find that table? Thanks 回答1: Try it: from bs4 import BeautifulSoup bs = BeautifulSoup(htmlcontent) bs.find_all('table', attrs={'border': '1' ,'style':'background-color:White;font-size:10pt;border-collapse:collapse;'}) Check this link for more details. 来源: https://stackoverflow.com/questions