beautifulsoup | 易学教程

Select all div siblings by using BeautifulSoup

阅读更多关于 Select all div siblings by using BeautifulSoup

问题 I have an html file which has a structure like the following: <div> </div <div> </div> <div> <div> </div> <div> </div> <div> </div> <div> <div> <div> </div> </div> I would like to select all the siblings div without selecting nested div in the third and fourth block. If I use find_all() I get all the divs. 回答1: You can find direct children of the parent element: soup.select('body > div') to get all div elements under the top-level body tag. You could also find the first div , then grab all

How can I use BeautifulSoup or Slimit on a site to output the email address from a javascript variable

阅读更多关于 How can I use BeautifulSoup or Slimit on a site to output the email address from a javascript variable

问题 I have this example website: http://www.example.com/whatever.asp?profile=1 For each profile number I have a different email in this Java script code. <script LANGUAGE="JavaScript"> function something() { var ptr; ptr = ""; ptr += "<table><td class=france></td></table>"; ptr += "<table><td class=france><a href=mailto:exa"; ptr += "mple@email.com>email</a></td></table>"; document.all.something.innerHTML = ptr; } </script> I want to parse or regex the email address. The position of emails

Is there a better approach to use BeautifulSoup in my python web crawler codes?

阅读更多关于 Is there a better approach to use BeautifulSoup in my python web crawler codes?

问题 I'm trying to crawl information from urls in a page and save them in a text file. I have recieved great help in the question How to get the right source code with Python from the URLs using my web crawler? and I try to use what I have learned in BeautifulSoup to finish my codes based on that question. But when I look at my codes, although they have satisfied my need, but they look pretty messed up. Can anyone help me to optimize them a little, especially on the BeautifulSoup part? Such as the

How to recursively crawl subpages with Scrapy

阅读更多关于 How to recursively crawl subpages with Scrapy

问题 So basically I am trying to crawl a page with a set of categories, scrape the names of each category, follow a sublink associated with each category to a page with a set of subcategories, scrape their names, and then follow each subcategory to their associated page and retrieve text data. At the end I want to output a json file formatted somewhat like: Category 1 name Subcategory 1 name data from this subcategory's page Subcategory n name data from this page Category n name Subcategory 1 name

Scraping: add data stored as a picture to CSV file in python 3.5

阅读更多关于 Scraping: add data stored as a picture to CSV file in python 3.5

问题 For this project, I am scraping data from a database and attempting to export this data to a spreadsheet for further analysis. (Previously posted here--thanks for the help over there reworking my code!) I previously thought that finding the winning candidate in the table could be simplified by just always selecting the first name that appears in the table, as I thought the "winners" always appeared first. However, this is not the case. Whether or not a candidate was elected is stored in the

Using Python Requests Module with Dropdown Options

阅读更多关于 Using Python Requests Module with Dropdown Options

问题 I am trying to scrape information from this webpage: https://www.tmea.org/programs/all-state/history I want to select several options from the first dropdown menu and use Beautiful Soup to pull the information I need. First I tried using beautiful soup to extract the different options: import requests from bs4 import BeautifulSoup page = requests.get('https://www.tmea.org/programs/all-state/history') soup = BeautifulSoup(page.text, 'html.parser') body = soup.find(id = 'organization') options

Using Python Requests Module with Dropdown Options

阅读更多关于 Using Python Requests Module with Dropdown Options

How to use BeautifulSoup to parse google search results in Python

阅读更多关于 How to use BeautifulSoup to parse google search results in Python

问题 I am trying to parse the first page of google search results. Specifically, the Title and the small Summary that is provided. Here is what I have so far: from urllib.request import urlretrieve import urllib.parse from urllib.parse import urlencode, urlparse, parse_qs import webbrowser from bs4 import BeautifulSoup import requests address = 'https://google.com/#q=' # Default Google search address start file = open( "OCR.txt", "rt" ) # Open text document that contains the question word = file

Images download with BeautifulSoup

阅读更多关于 Images download with BeautifulSoup

问题 I am using BeautifulSoup for extracting pictures which works well for normal pages. Now I want to extract the picture of the Chromebook from a web page like this https://twitter.com/banprada/statuses/829102430017187841 The page apparently contains a link to another page with the image. Here is my code for downloading an image from mentioned link but I am only getting the image of the person who posted the link. import urllib.request import os from bs4 import BeautifulSoup URL = "http:/

Python web scraping - Loop through all categories and subcategories

阅读更多关于 Python web scraping - Loop through all categories and subcategories

问题 I am trying to retrieve all categories and subcategories within a retail website. I am able to use BeautifulSoup to pull every single product in the category once I am in it. However, I am struggle with the loop for categories. I'm using this as a test website https://www.uniqlo.com/us/en/women How do I loop through each category as well as the subcategories on the left side of the website? The problem is that you would have to click on the category before the website displays all the