beautifulsoup | 易学教程

Scraping 'N' pages with Beautifulsoup and Requests (How to obtain the true page number)

阅读更多关于 Scraping 'N' pages with Beautifulsoup and Requests (How to obtain the true page number)

问题 I want to get all the titles() in the website. http://www.shyan.gov.cn/zwhd/web/webindex.action Now, my code successfully scrapes only one page. However, there are multiple pages available at the site above in which I would like to to scrape. For example, with the url above, when I click the link to "page 2", the overall url does NOT change. I looked at the page source and saw javascript code to advance to the next page like this: javascript:gotopage(2) or javascript:void(0). My code is here

“illegal multibyte sequence” error from BeautifulSoup when Python 3

阅读更多关于 “illegal multibyte sequence” error from BeautifulSoup when Python 3

问题 .html saved to local disk, and I am using BeautifulSoup (bs4) to parse it. It worked all fine until lately it's changed to Python 3. I tested the same .html file in another machine Python 2, it works and returned the page contents. soup = BeautifulSoup(open('page.html'), "lxml") Machine with Python 3 doesn't work, and it says: UnicodeDecodeError: 'gbk' codec can't decode byte 0x92 in position 298670: illegal multibyte sequence Searched around and I tried below but neither worked: (be it 'r',

How to extract text within HTML lists using beautifulsoup python

阅读更多关于 How to extract text within HTML lists using beautifulsoup python

问题 I'm trying to write a python program that can extract text between list in html. I would like to extract information like the book being hardcover and number of pages. Does anybody know the command for this operation? <h2>Product Details</h2> <div class="content"> <ul> <li><b>Hardcover:</b> 156 pages</li> <li><b>Publisher:</b> Insight Editions; Har/Pstr edition (June 18, 2013)</li> <li><b>Language:</b> English</li> <li><b>ISBN-10:</b> 1608871827</li> <li><b>ISBN-13:</b> 978-1608871827</li>

How to extract text within HTML lists using beautifulsoup python

阅读更多关于 How to extract text within HTML lists using beautifulsoup python

Can't extract the text and find all by BeautifulSoup

阅读更多关于 Can't extract the text and find all by BeautifulSoup

问题 I want to extract the all the available items in the équipements, but I can only get the first four items, and then I got '+ plus'. import urllib2 from bs4 import BeautifulSoup import re import requests headers = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'} url = 'https://www.airbnb.fr/rooms/8261637?s=bAMrFL5A' req = urllib2.Request(url = url, headers = headers) html = urllib2.urlopen(req) bsobj = BeautifulSoup(html.read(),'lxml') b

Trip Advisor Scraping 'moreLink'

阅读更多关于 Trip Advisor Scraping 'moreLink'

问题 I've been building a web scraper in BS4 and have gotten stuck. I am using Trip Advisor as a test for other data I will be going after, but am not able to isolate the tag of the 'entire' reviews. Here is an example: https://www.tripadvisor.com/Restaurant_Review-g56010-d470148-Reviews-Chez_Nous-Humble_Texas.html Notice in the first review, there is an icon below "the wine list is...". I am able to easily isolate the partial reviews, but have not been able to figure out a way to get BS4 to pull

Using BeautifulSoup where authentication is required

阅读更多关于 Using BeautifulSoup where authentication is required

问题 I am scraping LAN data using BeautifulSoup4 and Python requests for a company project. Since the site has a login interface, I am not authorized to access the data. The login interface is a pop-up that doesn't allow me to access the page source or inspect the page elements without log in. the error I get is this- Access Error: Unauthorized Access to this document requires a User ID This is a screen-shot of the pop-up box (The blackened part is sensitive information). It has not information

How to scrape aspx pages with python

阅读更多关于 How to scrape aspx pages with python

问题 I am trying to scrape a site, https://www.searchiqs.com/nybro/ (you have to click the "Log In as Guest" to get to the search form. If I search for a Party 1 term like say "Andrew" the results have pagination and also, the request type is POST so the URL does not change and also the sessions time out very quickly. So quickly that if i wait ten minutes and refresh the search url page it gives me a timeout error. I got started with scraping recently, so I have mostly been doing GET posts where I

BeautifulSoup: Print div's based on content of preceding tag

阅读更多关于 BeautifulSoup: Print div's based on content of preceding tag

问题 I would like to select the contents of elements based on the preceding tag: <h4>Models & Products</h4> <div class="profile-area">...</div> <h4>Production Capacity (year)</h4> <div class="profile-area">...</div> How can I get the "profile-area" values based on the content of the preceding tag? Here is my code: import requests from bs4 import BeautifulSoup import csv import re html_doc = """ <html> <body> <div class="col-md-6"> <iframe class="factory_detail_google_map" frameborder="0" src=

BeautifulSoup: Find table by style

阅读更多关于 BeautifulSoup: Find table by style

问题 Is it possible to find a specific table with unique style? Say, given the following html: <table border="1" style="background-color:White;font-size:10pt;border-collapse:collapse;"> How can I use BS to find that table? Thanks 回答1: Try it: from bs4 import BeautifulSoup bs = BeautifulSoup(htmlcontent) bs.find_all('table', attrs={'border': '1' ,'style':'background-color:White;font-size:10pt;border-collapse:collapse;'}) Check this link for more details. 来源： https://stackoverflow.com/questions