screen-scraping | 易学教程

How to reuse a selenium driver instance during parallel processing?

阅读更多关于 How to reuse a selenium driver instance during parallel processing?

问题 To scrape a pool of URLs, I am paralell processing selenium with joblib. In this context, I am facing two challenges: Challenge 1 is to speed up this process. In the moment, my code opens and closes a driver instance for every URL (ideally would be one for every process) Challenge 2 is to get rid of the CPU-intensive while loop that I think I need to continue on empty results (I know that this is most likely wrong) Pseudocode: URL_list = [URL1, URL2, URL3, ..., URL100000] # List of URLs to be

Regex HTML Extraction C#

阅读更多关于 Regex HTML Extraction C#

问题 I have searched and searched about Regex but I can't seem to find something that will allow me to do this. I need to get the 12.32, 2,300, 4.644 M and 12,444.12 from the following strings in C#: <td class="c-ob-j1a" property="c-value">12.32</td> <td class="c-ob-j1a" property="c-value">2,300</td> <td class="c-ob-j1a" property="c-value">4.644 M</td> <td class="c-ob-j1a" property="c-value">12,444.12 M</td> I got up to this: MatchCollection valueCollection = Regex.Matches(html, @"<td class=""c-ob

WWW::Mechanize Extraction Help - PERL

阅读更多关于 WWW::Mechanize Extraction Help - PERL

问题 I'm try to automate the extraction of a transcript found on a website. The entire transcript is found between dl tags since the site formatted the interview in a description list. The script I have below allows me to search the site and extract the text in a plain-text format, but I'm actually looking for it to include everything between the dl tags, meaning dd's, dt's, etc. This will allow us to develop our own CSS for the interview. Something to note about the page is that there are break

Failed to screen scrap ASP.Net website while posting data

阅读更多关于 Failed to screen scrap ASP.Net website while posting data

问题 Getting Invalid postback or callback argument error while trying to screen scrap a website which has build on ASP.NET. Fist request of landing page has no issue. It's raising exception when I posts form data after changing one of drop-down field value. """ Invalid postback or callback argument. Event validation is enabled using <pages enableEventValidation="true"/> in configuration or <%@ Page EnableEventValidation="true" %> in a page. For security purposes, this feature verifies that

fetch text from web with Angular JS tags such as ng-view

阅读更多关于 fetch text from web with Angular JS tags such as ng-view

问题 I'm trying to fetch all the visible text from a website, I'm using python-scrapy for this work. However what i observe scrapy only works with HTML tags such as div,body,head etc. and not with angular js tags such as ng-view, if there is any element within ng-view tags and when I do a right-click on the page and do view source then the content inside the tag doesn't appear and it displays like <ng-view> </ng-view> , So how can I use python to scrap the elements within this ng-view tags.Thanks

Parsing: Can I pick up the URL of embedded CSS Background in Nokogiri?

阅读更多关于 Parsing: Can I pick up the URL of embedded CSS Background in Nokogiri?

问题 The HTML I am parsing contains images with inline CSS in a table, can I use Nokogiri to determine the URL component is, here is a snippet of code I'd like to parse: tldr: i'ld like to get the .png in this html snippet using nokogiri <table border="0" cellspacing="0" cellpadding="0" width="300" height="300" background="http://s3.amazonaws.com/static.example.com/sale/homepage/3166-300x300-1328107072.png" style="background-image:url('http://s3.amazonaws.com/static.example.com/sale/homepage/3166

Casperjs Not returning Google Search link titles BUT Screenshot & Source Code test works

阅读更多关于 Casperjs Not returning Google Search link titles BUT Screenshot & Source Code test works

问题 Appreciate someone can help me with this problem I'm having. Please see my image to understand further. https://onedrive.live.com/redir?resid=F95DD828CA2E63D7!1326&authkey=!AEbavlKl38fBJYI&v=3&ithint=photo%2cjpg I have a screenshot of the actual CasperJS html capture. It shows that casperjs correctly entered the field in google. The problem is CasperJS not calling my function getLinks >> links = this.evaluate(getLinks);. >> it returns null. I have tested the actual selector >>

PhantomJS executing JavaScript in a popup for data extraction

阅读更多关于 PhantomJS executing JavaScript in a popup for data extraction

问题 So I have a web page with some photos of people. When you click on a photo of the person the JavaScript is executed and produces a popup with some more detailed information such as a description etc. The link for each photo is as follows: <a href="javascript:void(0)" data="10019" class="seeMore"></a> First I want to start with the basics, of just extracting the description etc. from one single person. So I want to execute the JavaScript above to write the popup window, and when I'm on the

state of HTML after onload javascript

阅读更多关于 state of HTML after onload javascript

问题 many webpages use onload JavaScript to manipulate their DOM. Is there a way I can automate accessing the state of the HTML after these JavaScript operations? A took like wget is not useful here because it just downloads the original source. Is there perhaps a way to use a web browser rendering engine? Ideally I am after a solution that I can interface with from Python. thanks! 回答1: The only good way I know to do such things is to automate a browser, for example via Selenium RC. If you have no

convert lxml to scrapy xxs selector

阅读更多关于 convert lxml to scrapy xxs selector

问题 How can I convert this pure python lxml to scrapy built in xxs selectors? This one works but i want to convert this to the scrapy xxs selectors. def parse_device_list(self, response): self.log("\n\n\n List of devices \n\n\n") self.log('Hi, this is the parse_device_list page! %s' % response.url) root = lxml.etree.fromstring(response.body) for row in root.xpath('//row'): allcells = row.xpath('./cell') # first cell contain the link to follow detail_page_link = allcells[0].get("href") yield