screen-scraping | 易学教程

Screen scraping a Datepicker with Scrapy and Selenium on mouse hover

阅读更多关于 Screen scraping a Datepicker with Scrapy and Selenium on mouse hover

问题 So I need to scrap a page like this for example and I am using Scrapy + Seleninum to interact with a date-picker calendar. I realized that if a certain date is available a price shows on the tooltip, and if its not available if you hover on it nothing happens. Whats the code for me to get the price that appears dynamically when you hover on an available day and also how do I know if its available or not just with the hover? 回答1: It is not that straightforward how to approach the problem

HttpRequest: pass through AuthLogin

阅读更多关于 HttpRequest: pass through AuthLogin

问题 I would need to make a simple program that logs with given credentials to certain website and then navigate to some element (link). It is even possible (I mean this Authlogin thing)? EDIT: SORRY - I am on my company machine and I cannot click on "Vote" or "Add comment" - the page says "Done, but with errors on page" (IE..). I do appreciate your answers and comments, you have helped me a lot! 回答1: Main things to do are: Start using Fiddler to see what needs to be sent and in what way Assuming

Python: load text as python object

阅读更多关于 Python: load text as python object

问题 I have a such text to load: https://sites.google.com/site/iminside1/paste I'd prefer to create a python dictionary from it, but any object is OK. I tried pickle , json and eval , but didn't succeeded. Can you help me with this? Thanks! The results: a = open("the_file", "r").read() json.loads(a) ValueError: Expecting property name: line 1 column 1 (char 1) pickle.loads(a) KeyError: '{' eval(a) File "<string>", line 19 from: {code: 'DME', airport: "Домодедово", city: 'Москва', country: 'Россия'

Web scraping using VBA

阅读更多关于 Web scraping using VBA

问题 I would like to extract data from this URL. I want to extract Title, mobile contact number and address from each of 10 business cards. Here is some code I tried but did not get success. Public Sub GetValueFromBrowser() On Error Resume Next Dim Sn As Integer Dim ie As Object Dim url As String Dim Doc As HTMLDocument Dim element As IHTMLElement Dim elements As IHTMLElementCollection For Sn = 1 To 1 url = Sheets("Infos").Range("C" & Sn).Value Set ie = CreateObject("InternetExplorer.Application")

Python Mechanize Browser: HTTP Error 460

阅读更多关于 Python Mechanize Browser: HTTP Error 460

问题 I am trying to log into a site using a mechanize browser and getting an HTTP 460 Error which appears to be a made up error so I'm not sure what to make of it. Here's the code: # Browser br = mechanize.Browser() # Cookie Jar cj = cookielib.LWPCookieJar() br.set_cookiejar(cj) # Browser options br.set_handle_equiv(True) br.set_handle_redirect(True) br.set_handle_referer(True) br.set_handle_robots(False) br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1) br.addheaders = [(

Should I use Yahoo-Pipes to scrape the contents of a div?

阅读更多关于 Should I use Yahoo-Pipes to scrape the contents of a div?

问题 Given: Url - http://www.contoso.com/search.php?q={param} returns: -html- --body- {...} ---div id='foo'- ----div id='page1'/- ----div id='page2'/- ----div id='page3'/- ----div id='pageN'/- ---/div- {...} --/body- -/html- Wanted: The innerHtml of div id='foo' must be fetched by the client (i.e. Javascript). It will be split into discrete items (i.e. div id='page1' to div id='pageN'). API Throttling prevents server-side code from pre-fetching the data, so the parsing and manipulation burden must

Parsing ajax responses to retrieve final url content in Scrapy?

阅读更多关于 Parsing ajax responses to retrieve final url content in Scrapy?

问题 I have the following problem: My scraper starts at a "base" URL. This page contains a dropdown that creates another dropdown via ajax calls, and this cascades 2-3 times until it has all the information needed to get to the "final" page where the actual content I want to scrape is. Rather than clicking things (and having to use Selenium or similar) I use the pages exposed JSON API to mimic this behavior, so instead of clicking dropdowns I simply send a request and read JSON responses that

Screen scraping in clojure

阅读更多关于 Screen scraping in clojure

问题 I googled, but I can't find a satisfactory answer. This SO question is related but kinda old as well as the exact opposite of what I am looking for: a way to do screen-scraping using XPath, not CSS selectors. I've used enlive for some basic screen-scraping but sometimes one needs the power of XPath selectors. So here it is: Is there any equivalent to Nokogiri or lxml for clojure (java)? What is the state of the "pure java Nokogiri"? Any way to use the library from clojure? Any better

Scraping multiple table out of webpage in R

阅读更多关于 Scraping multiple table out of webpage in R

I am trying to pull mutual funds data into R, My way of code works for single table but when there are multiple tables in a webpage, it doesn't work. Link - https://in.finance.yahoo.com/q/pm?s=115748.BO My Code url <- "https://in.finance.yahoo.com/q/pm?s=115748.BO" library(XML) perftable <- readHTMLTable(url, header = T, which = 1, stringsAsFactors = F) but i am getting an error message. Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘readHTMLTable’ for signature ‘"NULL"’ In addition: Warning message: XML content does not seem to be XML: ' https:/

How to extract links from a webpage using lxml, XPath and Python?

阅读更多关于 How to extract links from a webpage using lxml, XPath and Python?

问题 I've got this xpath query: /html/body//tbody/tr[*]/td[*]/a[@title]/@href It extracts all the links with the title attribute - and gives the href in FireFox's Xpath checker add-on. However, I cannot seem to use it with lxml . from lxml import etree parsedPage = etree.HTML(page) # Create parse tree from valid page. # Xpath query hyperlinks = parsedPage.xpath("/html/body//tbody/tr[*]/td[*]/a[@title]/@href") for x in hyperlinks: print x # Print links in <a> tags, containing the title attribute