screen-scraping | 易学教程

BeautifulSoup - scraping a forum page

阅读更多关于 BeautifulSoup - scraping a forum page

问题 I'm trying to scrape a forum discussion and export it as a csv file, with rows such as "thread title", "user", and "post", where the latter is the actual forum post from each individual. I'm a complete beginner with Python and BeautifulSoup so I'm having a really hard time with this! My current problem is that all the text is split into one character per row in the csv file. Is there anyone out there who can help me out? It would be fantastic if someone could give me a hand! Here's the code I

Scrapy - Correct way to change User Agent in Request

阅读更多关于 Scrapy - Correct way to change User Agent in Request

问题 I have created a custom Middleware in Scrapy by overriding the RetryMiddleware which changes both Proxy and User-Agent before retrying. It looks like this class CustomRetryMiddleware(RetryMiddleware): def _retry(self, request, reason, spider): retries = request.meta.get('retry_times', 0) + 1 if retries <= self.max_retry_times: Proxy_UA_Middleware.switch_proxy() Proxy_UA_Middleware.switch_ua() logger.debug("Retrying %(request)s (failed %(retries)d times): %(reason)s", {'request': request,

i want to print a proper table out of data scrapped using scrapy

阅读更多关于 i want to print a proper table out of data scrapped using scrapy

问题 so i have written all the code to scrap table from [http://www.rarityguide.com/cbgames_view.php?FirstRecord=21][1] but i am getting output like # the output that i get {'EXG': (['17.00', '10.00', '90.00', '9.00', '13.00', '17.00', '16.00', '43.00', '125.00', '16.00', '11.00', '150.00', '17.00', '24.00', '15.00', '24.00', '21.00', '36.00', '270.00', '280.00'],), 'G': ['8.00', '5.00', '38.00', '2.00', '6.00', '7.00', '6.00', '20.00', '40.00', '7.00', '5.00', '70.00', '6.00', '12.00', '7.00',

python requests not getting full page

阅读更多关于 python requests not getting full page

问题 """THIS IS MY CODE """ import requests from bs4 import BeautifulSoup import random from selenium import webdriver url ="http://www.yopmail.com/en/?smith" request = requests.get(url) soup = BeautifulSoup(request.text, 'html5lib') print(soup) """IT RETURNING THIS OUTPUT """ <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"><head> <meta content="text/html; charset=utf-8" http-equiv=

python requests not getting full page

阅读更多关于 python requests not getting full page

Failing to create the data frame and populating its data into the csv file properly

阅读更多关于 Failing to create the data frame and populating its data into the csv file properly

问题 I'm looking to scrape this link, with just two simple pieces of information, but I don't know why I have this result and it can't give me all the data I search for: particulier_allinfo particulier_tel 0 ABEL KEVIN10 RUE VIRGILE67200 Strasbourg This is the code, thanks for your help : import bs4 as bs import urllib import urllib.request import requests from bs4 import BeautifulSoup import pandas from pandas import DataFrame import csv with open('test_bs_118000.csv', mode='w') as csv_file:

Scraping Data from a Tableau Map

阅读更多关于 Scraping Data from a Tableau Map

问题 I am trying to pull locations and names of Naloxone distribution centers in Illinois for a research project on the opioid crisis. This tableau generated dashboard is accessible from here from the department of public health https://idph.illinois.gov/OpioidDataDashboard/ I've tried everything I could find. First changing the url to "download" the data using Tableau's interface. That only let me download a pdf map not the actual dataset behind it. Second, I modified the python script I've seen

LoadError: cannot load such file — capybara Stand Alone Code

阅读更多关于 LoadError: cannot load such file — capybara Stand Alone Code

问题 I'm working on building a simple post miner using Ruby and the following tutorial (http://ngauthier.com/2014/06/scraping-the-web-with-ruby.html) Here is my code I currently have: #!/usr/bin/ruby require 'capybara' require 'capybara/poltergeist' include Capybara::DSL Capybara.default_driver = :poltergeist visit "http://dilloncarter.com" all(".posts .post ").each do |post| title = post.find("h1 a").text url = post.find("h1 a")["href"] date = post.find("a")["datetime"] summary = post.find("p

What is the most efficient way to capture screen in python using modules eg PIL or cv2? because It takes up a lot of ram

阅读更多关于 What is the most efficient way to capture screen in python using modules eg PIL or cv2? because It takes up a lot of ram

问题 What is the most efficient way to capture screen in python using modules eg PIL or cv2? Because It takes up a lot of ram. I wanted to teach AI to play dino game of Chrome through screen scraping and neat but it is way to slow... I have tried: import numpy as np from PIL import ImageGrab import cv2 import time last_time = time.time() while True: printscreen_pil = ImageGrab.grab(bbox= (0, 40, 800, 640)) printscreen_numpy = np.array(printscreen_pil.getdata(), dtype = 'uint8').reshape(

detect if a web page is changed

阅读更多关于 detect if a web page is changed

问题 In my python application I have to read many web pages to collect data. To decrease the http calls I would like to fetch only changed pages. My problem is that my code always tells me that the pages have been changed (code 200) but in reality it is not. This is my code: from models import mytab import re import urllib2 from wsgiref.handlers import format_date_time from datetime import datetime from time import mktime def url_change(): urls = mytab.objects.all() # this is some urls: # http:/