bs4 | 易学教程

BeautifulSoup output to .txt file

阅读更多关于 BeautifulSoup output to .txt file

问题 I am trying to export my data as a .txt file from bs4 import BeautifulSoup import requests import os import os os.getcwd() '/home/folder' os.mkdir("Probeersel6") os.chdir("Probeersel6") os.getcwd() '/home/Desktop/folder' os.mkdir("img") #now `folder` url = "http://nos.nl/artikel/2093082-steeds-meer-nekklachten-bij-kinderen-door-gebruik-tablets.html" r = requests.get(url) soup = BeautifulSoup(r.content) data = soup.find_all("article", {"class": "article"}) with open(""%s".txt", "wb" %(url)) as

BeautifulSoup output to .txt file

阅读更多关于 BeautifulSoup output to .txt file

How to find all comments with Beautiful Soup

阅读更多关于 How to find all comments with Beautiful Soup

问题 This question was asked four years ago, but the answer is now out of date for BS4. I want to delete all comments in my html file using beautiful soup. Since BS4 makes each comment as a special type of navigable string, I thought this code would work: for comments in soup.find_all('comment'): comments.decompose() So that didn't work.... How do I find all comments using BS4? 回答1: You can pass a function to find_all() to help it check whether the string is a Comment. For example I have below

How to get Python bs4 to work properly on XML?

阅读更多关于 How to get Python bs4 to work properly on XML?

问题 I'm trying to use Python and BeautifulSoup 4 (bs4) to convert Inkscape SVGs into an XML-like format for some proprietary software. I can't seem to get bs4 to correctly parse a minimal example. I need the parser to respect self-closing tags, handle unicode, and not add html stuff. I thought specifying the 'lxml' parser with selfClosingTags should do it, but nope! check it out. #!/usr/bin/python from __future__ import print_function from bs4 import BeautifulSoup print('\nbs4 mangled XML:')

soup.select('.r a') in 'https://www.google.com/#q=vigilante+mic' gives empty list in python BeautifulSoup

阅读更多关于 soup.select('.r a') in 'https://www.google.com/#q=vigilante+mic' gives empty list in python BeautifulSoup

问题 I am using BeautifulSoup to extract all links from google search results page. here's the snippet of the code: import requests,bs4 res = requests.get('https://www.google.com/#q=vigilante+mic') soup = bs4.BeautifulSoup(res.text) linkElem = soup.select('.r a') Now soup.select('.r a') is returning an empty list thankyou 回答1: That's because of the url you are using: https://www.google.com/#q=vigilante+mic Is a javascript version of the search. If you curl it you will see there are no answers in

Extracting information from a table except header of the table using bs4

阅读更多关于 Extracting information from a table except header of the table using bs4

问题 I am trying to extracting information from a table using bs4 and python. when I am using the following code to extract information from header of the table: tr_header=table.findAll("tr")[0] tds_in_header = [td.get_text() for td in tr_header.findAll("td")] header_items= [data.encode('utf-8') for data in tds_in_header] len_table_header = len (header_items) It works, but for the following codes that I am trying to extract information from the first row to the end of the table: tr_all=table

beautiful soup captures null values in a table

阅读更多关于 beautiful soup captures null values in a table

问题 For the following piece of HTML code, I used beautifulsoup to capture the table information: <table> <tr> <td><b>Code</b></td> <td><b>Display</b></td> </tr> <tr> <td>min</td> <td>Minute</td><td/> </tr> <tr> <td>happy </td> <td>Hour</td><td/> </tr> <tr> <td>daily </td> <td>Day</td><td/> </tr> This is my code: comments = [td.get_text() for td in table.findAll("td")] Comments=[data.encode('utf-8') for data in comments] As you see, this table has two headers: "code and display" and some values in

Cannot find table using Python BeautifulSoup

阅读更多关于 Cannot find table using Python BeautifulSoup

问题 I am trying to scrape the data from the table id=AWS from the following NOAA site, https://www.weather.gov/afc/alaskaObs, but when I try to find the table using '.find' my result comes up as none. I am able to return the parent div, but can't seem to access the table. Below is my code. from bs4 import BeautifulSoup from urllib2 import urlopen # Get soup set up html = urlopen('https://www.weather.gov/afc/alaskaObs').read() soup = BeautifulSoup(html, 'lxml').find("div", {"id":"obDataDiv"}).find

Extracting properly data with bs4?

阅读更多关于 Extracting properly data with bs4?

问题 Here is my first question on this site as I have tried many ways to get what I want but I didnt succeed.. I try to extract 2 types of data from a french website similar to CraigList. My need is simple and I manage to get those information but I still have tags and other signs in my extract. I also have issue with encoding even if using .encode(utf-8). # -*- coding: utf-8 -*- from urllib.request import urlopen from bs4 import BeautifulSoup import re import csv csvfile=open("test.csv", 'w+')

Extract specific columns from a given webpage

阅读更多关于 Extract specific columns from a given webpage

问题 I am trying to read web page using python and save the data in csv format to be imported as pandas dataframe. I have the following code that extracts the links from all the pages, instead I am trying to read certain column fields. for i in range(10): url='https://pythonexpress.in/workshop/'+str(i).zfill(3) import urllib2 from bs4 import BeautifulSoup try: page = urllib2.urlopen(url).read() soup = BeautifulSoup(page) for anchor in soup.find_all('div', {'class':'col-xs-8'})[:9]: print i, anchor