screen-scraping | 易学教程

How to get content of a javascript/ajax -loaded div on a site?

阅读更多关于 How to get content of a javascript/ajax -loaded div on a site?

问题 I have a PHP-script that loads page-content from another website by using CURL and simple_html_dom PHP library. This works great. If I echo out the HTML returned I can see the div-content there. However, if I try to select only that div with the simple_html_dom, the div always returned empty. At first I didn't know why. Now I know that it's because its content apparently is populated with javascript/ajax. How would I get the content of the site and then be able to select the div-content AFTER

Python WWW macro

阅读更多关于 Python WWW macro

问题 i need something like iMacros for Python. It would be great to have something like that: browse_to('www.google.com') type_in_input('search', 'query') click_button('search') list = get_all('<p>') Do you know something like that? Thanks in advance, Etam. 回答1: Almost a direct fulfillment of the wishes in the question - twill. twill is a simple language that allows users to browse the Web from a command-line interface. With twill, you can navigate through Web sites that use forms, cookies, and

heavy iTunes Connect scraping

阅读更多关于 heavy iTunes Connect scraping

问题 I'm looking at different options to get the sales reports and other data out of the iTunes Connect website. Since Apple doesn't provide an API, all the solutions I found are based on scraping the page. As I need the information for a product that we offer, I'm not that happy to give all the iTunes accounts to a 3rd party service. This is why I want to scrape it myself or use a product that runs on our servers. My questions are: does someone have experience how frequent apple is changing the

Trying to get authentication cookie(s) using HttpWebRequest

阅读更多关于 Trying to get authentication cookie(s) using HttpWebRequest

问题 I have to scrape a table from a secure site and I'm having trouble logging in to the page and retrieving the authentication token and any other associated cookies. Am I doing something wrong here? public NameValueCollection LoginToDatrose() { var loginUriBuilder = new UriBuilder(); loginUriBuilder.Host = DatroseHostName; loginUriBuilder.Path = BuildURIPath(DatroseBasePath, LOGIN_PAGE); loginUriBuilder.Scheme = "https"; var boundary = Guid.NewGuid().ToString(); var postData = new

Take screenshots quickly from python

阅读更多关于 Take screenshots **quickly** from python

问题 A PIL.Image.grab() takes about 0.5 seconds. That's just to get data from the screen to my app, without any processing on my part. FRAPS, on the other hand, can take screenshots up to 30 FPS. Is there any way for me to do the same from a Python program? If not, how about from a C program? (I could interface it w/ the Python program, potentially...) 回答1: If you want fast screenshots, you must use a lower level API, like DirectX or GTK. There are Python wrappers for those, like DirectPython and

Scrapy + Splash + ScrapyJS

阅读更多关于 Scrapy + Splash + ScrapyJS

i am using Splash 2.0.2 + Scrapy 1.0.5 + Scrapyjs 0.1.1 and im still not able to render javascript with a click. Here is an example url https://olx.pt/anuncio/loja-nova-com-250m2-garagem-em-box-fechada-para-arrumos-IDyTzAT.html#c49d3d94cf I am still getting the page without the phone number rendered: class OlxSpider(scrapy.Spider): name = "olx" rotate_user_agent = True allowed_domains = ["olx.pt"] start_urls = [ "https://olx.pt/imoveis/" ] def parse(self, response): script = """ function main(splash) splash:go(splash.args.url) splash:runjs('document.getElementById("contact_methods")

I need to scrape data from a facebook game - using ruby

阅读更多关于 I need to scrape data from a facebook game - using ruby

Revised (clarified question) I've spent a few days already trying to figure out how to scrape specific information from a facebook game; however, I've run into brick wall after brick wall. As best as I can tell, the main problem is as follows. I can use Chrome's inspect element tool to manually find the html that I need - it appears nestled inside an iframe. However, when I try and scrape that iframe, it is empty (except for properties): <iframe id="game_frame" name="game_frame" src="" scrolling="no" ...></iframe> This is the same output that I see if I use a browsers "View page source" tool.

How do search engines find relevant content?

阅读更多关于 How do search engines find relevant content?

问题 How does Google find relevant content when it's parsing the web? Let's say, for instance, Google uses the PHP native DOM Library to parse content. What methods would they be for it to find the most relevant content on a web page? My thoughts would be that it would search for all paragraphs, order by the length of each paragraph and then from possible search strings and query params work out the percentage of relevance each paragraph is. Let's say we had this URL: http://domain.tld/posts

Screen scraping: regular expressions or XQuery expressions?

阅读更多关于 Screen scraping: regular expressions or XQuery expressions?

I was answering some quiz questions for an interview, and the question was about how would I do screen scraping. That is, picking content out of a web page, assuming you don't have a better structured way to query the information directly (e.g. a web service). My solution was to use an XQuery expression. The expression was fairly long because the content I needed was pretty deep in the HTML hierarchy. I had to search up through the ancestors a fair way before I found an element with an id attribute. For example, scraping an Amazon.com page for Product Dimensions looks like this: //a[@id=

how to scrapy handle dns lookup failed

阅读更多关于 how to scrapy handle dns lookup failed

I am looking to handle a DNS error when scraping domains Scrapy. Here's the error that I am seeing: ERROR: Error downloading <GET http://domain.com>: DNS lookup failed: address 'domain.com' not found [Errno 8] nodename nor servname provided, or not known. How could I be notified when I get an error like this, so that I can handle it myself without Scrapy just throwing an error and moving on. Tasawer Nawaz Use errback along with callback: Request(url, callback=your_callback, errback=your_errorback) and errback : def your_errorback(self, response): //your logic will be here 来源： https:/