web-scraping | 易学教程

Twython with 140 character limitation of twitter

阅读更多关于 Twython with 140 character limitation of twitter

问题 I am trying to search in twitter using Tython, but it seems that the library has a limitation on 140 characters. With the new feature of python, i.e. 280 characters length, what can one do? 回答1: This is not a limitation of Twython. The Twitter API by default returns the old 140-character limited tweet. In order to see the newer extended tweet you just need to supply this parameter to your search query: tweet_mode=extended Then, you will find the 280-character extended tweet in the full_text

Webscrape loop on all URLs in Column A

阅读更多关于 Webscrape loop on all URLs in Column A

问题 I'm trying to scrape the Facebook Video Titles from a list of URL's. I've got my macro working for a single video in which the URL is built into the code. I'd like the script to instead loop through each URL in Column A and output the Video Title into Column B. Any help? Current code: Sub ScrapeVideoTitle() Dim appIE As Object Set appIE = CreateObject("internetexplorer.application") With appIE .navigate "https://www.facebook.com/rankertotalnerd/videos/276505496352731/" .Visible = True Do

ValueError: could not convert string to float: (pd.Series)

阅读更多关于 ValueError: could not convert string to float: (pd.Series)

问题 I'm failing to execute 'lambda function' on the following code snippet below. My desired goal is to split columns( btts_x & btts_y ) respectively for further maths calculation. The lambda function is succeeding on first position column btts_x ( see btts_x_1 & btts_x_2 ); but fails on column btts_y as revealed in traceback re ValueError. I think I need to pass a re.sub() inside the lambda function, however I'm stuck on it and would appreciate help! Note: special character(s) \n\n in Team_x &

Scrapy: saving cookies between invocations

阅读更多关于 Scrapy: saving cookies between invocations

问题 Is there a way to preserve cookies between invocations of a scrapy crawler? The purpose - the site requires log in, and then maintains the session via cookies. I'd rather reuse the session than re-login every time. 回答1: Please refer to the docs about cookies. FAQ entry CookiesMiddleware Alternatively you can send Request objects with cookies managed by yourself (you can read cookies from Response objects' head). About Request and Response objects 来源： https://stackoverflow.com/questions

Scraping Javascript-Rendered Content in R from a Webpage without Unique URL

阅读更多关于 Scraping Javascript-Rendered Content in R from a Webpage without Unique URL

问题 I want to scrape historical results of South African LOTTO draws (especially Total Pool Size, Total Sales, etc.) from the South African National Lottery website. By default one sees links to results for the last ten draws, or one can select a date range to pull up a larger set of links to draws (which will still display only ten per page). Hovering in the browser over a link e.g. 'LOTTO DRAW 2012' we see javascript:void(); so it is clear that the draw results will be rendered using Javascript

Puppeteer not behaving like in Developer Console

阅读更多关于 Puppeteer not behaving like in Developer Console

问题 I am trying to extract using Puppeteer the title of this page: https://www.nordstrom.com/s/zella-high-waist-studio-pocket-7-8-leggings/5460106 I have the below code, (async () => { const browser = await puppet.launch({ headless: true }); const page = await browser.newPage(); await page.goto(req.params[0]); //this is the url title = await page.evaluate(() => { Array.from(document.querySelectorAll("meta")).filter(function ( el ) { return ( (el.attributes.name !== null && el.attributes.name !==

R rvest retrieve empty table

阅读更多关于 R rvest retrieve empty table

问题 I'm trying two strategies to get data from a web table: library(tidyverse) library(rvest) webpage <- read_html('https://markets.cboe.com/us/equities/market_statistics/book/') data <- html_table(webpage, fill=TRUE) data[[2]] '' library("httr") library("XML") URL <- 'https://markets.cboe.com/us/equities/market_statistics/book/' temp <- tempfile(fileext = ".html") GET(url = URL, user_agent("Mozilla/5.0"), write_disk(temp)) df <- readHTMLTable(temp) df <- df[[2]] Both of them are returning an

How to load and parse whole content of a dynamic page that use infinity scroll

阅读更多关于 How to load and parse whole content of a dynamic page that use infinity scroll

问题 I am trying to solve my problems making searches, reading documentations. The problem I want to get all youtube titles from an youtube channel using python beautiful soup. Youtube loads dynamically, i think with JavaScript, without pyqt5 I just can not get any title, So i used the pyqt5 I was able to get titles from youtube channel. The problem is that i need to load all the videos. I can just load the 29 ou 30 first ones. I am thinking on simulating a scroll down or somthing like that. I can

Selenium only returns an empty list

阅读更多关于 Selenium only returns an empty list

问题 I'm trying to scrape football team names from betfair.com and no matter what, it returs an empty list. This is what I've tried most recently. from selenium import webdriver import pandas as pd driver = webdriver.Chrome(r'C:\Users\Tom\Desktop\chromedriver\chromedriver.exe') driver.get('https://www.betfair.com/exchange/plus/football') team = driver.find_elements_by_xpath('//*[@id="main-wrapper"]/div/div[2]/div/ui-view/div/div/div/div/div[1]/div/div[1]/bf-super-coupon/main/ng-include[3]/section

Selenium Python - Explicit waits not working

阅读更多关于 Selenium Python - Explicit waits not working

问题 I am unable to get explicit waits to work while waiting for the page to render the js, so I am forced to use time.sleep() in order for the code to work as intended. I read the docs and still wasn't able to get it to work. http://selenium-python.readthedocs.io/waits.html The commented out section of code with the time.sleep() works as intended. The WebDriverWait part runs but does not wait. from selenium import webdriver import time from selenium.webdriver.support.ui import WebDriverWait from