screen-scraping

BeautifulSoup - scraping a forum page

丶灬走出姿态 提交于 2021-02-17 09:04:47
问题 I'm trying to scrape a forum discussion and export it as a csv file, with rows such as "thread title", "user", and "post", where the latter is the actual forum post from each individual. I'm a complete beginner with Python and BeautifulSoup so I'm having a really hard time with this! My current problem is that all the text is split into one character per row in the csv file. Is there anyone out there who can help me out? It would be fantastic if someone could give me a hand! Here's the code I

Scrapy - Correct way to change User Agent in Request

旧时模样 提交于 2021-02-16 14:13:22
问题 I have created a custom Middleware in Scrapy by overriding the RetryMiddleware which changes both Proxy and User-Agent before retrying. It looks like this class CustomRetryMiddleware(RetryMiddleware): def _retry(self, request, reason, spider): retries = request.meta.get('retry_times', 0) + 1 if retries <= self.max_retry_times: Proxy_UA_Middleware.switch_proxy() Proxy_UA_Middleware.switch_ua() logger.debug("Retrying %(request)s (failed %(retries)d times): %(reason)s", {'request': request,

i want to print a proper table out of data scrapped using scrapy

匆匆过客 提交于 2021-02-11 17:20:38
问题 so i have written all the code to scrap table from [http://www.rarityguide.com/cbgames_view.php?FirstRecord=21][1] but i am getting output like # the output that i get {'EXG': (['17.00', '10.00', '90.00', '9.00', '13.00', '17.00', '16.00', '43.00', '125.00', '16.00', '11.00', '150.00', '17.00', '24.00', '15.00', '24.00', '21.00', '36.00', '270.00', '280.00'],), 'G': ['8.00', '5.00', '38.00', '2.00', '6.00', '7.00', '6.00', '20.00', '40.00', '7.00', '5.00', '70.00', '6.00', '12.00', '7.00',

python requests not getting full page

巧了我就是萌 提交于 2021-02-11 16:52:08
问题 """THIS IS MY CODE """ import requests from bs4 import BeautifulSoup import random from selenium import webdriver url ="http://www.yopmail.com/en/?smith" request = requests.get(url) soup = BeautifulSoup(request.text, 'html5lib') print(soup) """IT RETURNING THIS OUTPUT """ <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"><head> <meta content="text/html; charset=utf-8" http-equiv=

python requests not getting full page

♀尐吖头ヾ 提交于 2021-02-11 16:49:24
问题 """THIS IS MY CODE """ import requests from bs4 import BeautifulSoup import random from selenium import webdriver url ="http://www.yopmail.com/en/?smith" request = requests.get(url) soup = BeautifulSoup(request.text, 'html5lib') print(soup) """IT RETURNING THIS OUTPUT """ <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"><head> <meta content="text/html; charset=utf-8" http-equiv=

Failing to create the data frame and populating its data into the csv file properly

杀马特。学长 韩版系。学妹 提交于 2021-02-11 13:56:10
问题 I'm looking to scrape this link, with just two simple pieces of information, but I don't know why I have this result and it can't give me all the data I search for: particulier_allinfo particulier_tel 0 ABEL KEVIN10 RUE VIRGILE67200 Strasbourg This is the code, thanks for your help : import bs4 as bs import urllib import urllib.request import requests from bs4 import BeautifulSoup import pandas from pandas import DataFrame import csv with open('test_bs_118000.csv', mode='w') as csv_file:

Scraping Data from a Tableau Map

廉价感情. 提交于 2021-02-09 08:46:27
问题 I am trying to pull locations and names of Naloxone distribution centers in Illinois for a research project on the opioid crisis. This tableau generated dashboard is accessible from here from the department of public health https://idph.illinois.gov/OpioidDataDashboard/ I've tried everything I could find. First changing the url to "download" the data using Tableau's interface. That only let me download a pdf map not the actual dataset behind it. Second, I modified the python script I've seen

LoadError: cannot load such file — capybara Stand Alone Code

独自空忆成欢 提交于 2021-02-08 15:32:53
问题 I'm working on building a simple post miner using Ruby and the following tutorial (http://ngauthier.com/2014/06/scraping-the-web-with-ruby.html) Here is my code I currently have: #!/usr/bin/ruby require 'capybara' require 'capybara/poltergeist' include Capybara::DSL Capybara.default_driver = :poltergeist visit "http://dilloncarter.com" all(".posts .post ").each do |post| title = post.find("h1 a").text url = post.find("h1 a")["href"] date = post.find("a")["datetime"] summary = post.find("p

What is the most efficient way to capture screen in python using modules eg PIL or cv2? because It takes up a lot of ram

99封情书 提交于 2021-02-08 06:56:23
问题 What is the most efficient way to capture screen in python using modules eg PIL or cv2? Because It takes up a lot of ram. I wanted to teach AI to play dino game of Chrome through screen scraping and neat but it is way to slow... I have tried: import numpy as np from PIL import ImageGrab import cv2 import time last_time = time.time() while True: printscreen_pil = ImageGrab.grab(bbox= (0, 40, 800, 640)) printscreen_numpy = np.array(printscreen_pil.getdata(), dtype = 'uint8').reshape(

detect if a web page is changed

只谈情不闲聊 提交于 2021-02-07 12:35:00
问题 In my python application I have to read many web pages to collect data. To decrease the http calls I would like to fetch only changed pages. My problem is that my code always tells me that the pages have been changed (code 200) but in reality it is not. This is my code: from models import mytab import re import urllib2 from wsgiref.handlers import format_date_time from datetime import datetime from time import mktime def url_change(): urls = mytab.objects.all() # this is some urls: # http:/