screen-scraping | 易学教程

Using Python and Mechanize to submit form data and authenticate

阅读更多关于 Using Python and Mechanize to submit form data and authenticate

问题 I want to submit login to the website Reddit.com, navigate to a particular area of the page, and submit a comment. I don't see what's wrong with this code, but it is not working in that no change is reflected on the Reddit site. import mechanize import cookielib def main(): #Browser br = mechanize.Browser() # Cookie Jar cj = cookielib.LWPCookieJar() br.set_cookiejar(cj) # Browser options br.set_handle_equiv(True) br.set_handle_gzip(True) br.set_handle_redirect(True) br.set_handle_referer(True

Perform screen-scape of Webbrowser control in thread

阅读更多关于 Perform screen-scape of Webbrowser control in thread

问题 I am using the technique shown in WebBrowser Control in a new thread Trying to get a screen-scrape of a webpage I have been able to get the following code to successfully work when the WebBrowser control is placed on a WinForm . However it fails by providing an arbitrary image of the desktop when run inside a thread. Thread browserThread = new Thread(() => { WebBrowser br = new WebBrowser(); br.DocumentCompleted += webBrowser1_DocumentCompleted; br.ProgressChanged += webBrowser1

How to scroll down with Phantomjs to load dynamic content

阅读更多关于 How to scroll down with Phantomjs to load dynamic content

问题 I am trying to scrape links from a page that generates content dynamically as the user scroll down to the bottom (infinite scrolling). I have tried doing different things with Phantomjs but not able to gather links beyond first page. Let say the element at the bottom which loads content has class .has-more-items . It is available until final content is loaded while scrolling and then becomes unavailable in DOM (display:none). Here are the things I have tried- Setting viewportSize to a large

Scrape web page contents

阅读更多关于 Scrape web page contents

问题 I am developing a project, for which I want to scrape the contents of a website in the background and get some limited content from that scraped website. For example, in my page I have "userid" and "password" fields, by using those I will access my mail and scrape my inbox contents and display it in my page. I done the above by using javascript alone. But when I click the sign in button the URL of my page (http://localhost/web/Login.html) is changed to the URL (http://mail.in.com/mails/inbox

Scraping works well until I get this error: 'ascii' codec can't encode character u'\u2122' in position

阅读更多关于 Scraping works well until I get this error: 'ascii' codec can't encode character u'\u2122' in position

问题 I only have a few weeks of python training, so I suspect that there's a simple solution to this problem. But for me it's quite frustrating and after working on this for several hours I now ask you for help! The website I'm trying to scrape is well organized (see https://twam2dcppennla6s.onion.to/), and the code I've written scrapes about half of the 26 pages until I receive this error message: Traceback (most recent call last): File "SR2works4real2.py", line 18, in <module> csvWriter

web scraping using excel and VBA

阅读更多关于 web scraping using excel and VBA

问题 i wrote my VBA code in excel sheet as below but it is not scrape data for me and also i don't know why please any one help me. it gave me reullt as "click her to read more" onlyi want to scrape enitre data such as first name last name state zip code and so on Sub extractTablesData() Dim IE As Object, obj As Object Dim myState As String Dim r As Integer, c As Integer, t As Integer Dim elemCollection As Object Set IE = CreateObject("InternetExplorer.Application") myState = InputBox("Enter the

DOM Parser Foreach

阅读更多关于 DOM Parser Foreach

问题 Does anyone know why this wouldn't work? foreach($html->find('tbody.result') as $article) { // get retail $item['Retail'] = trim($article->find('span.price', 0)->plaintext); // get soldby $item['SoldBy'] = trim($article->find('img', 0)->getAttribute('alt')); $articles[] = $item; } print_r($articles); 回答1: Try this: $html = file_get_html('http://www.amazon.com/gp/offer-listing/B002UYSHMM'); $articles = array(); foreach($html->find('table tbody.result tr') as $article) { if($article->find('span

WebRequest NameResolutionFailure

阅读更多关于 WebRequest NameResolutionFailure

问题 I'm attempting to write a small screen-scraping tool for statistics aggregation in c#. I have attempted to use this code, (posted many times here but again for detail): public static string GetPage(string url) { HttpWebRequest request = (HttpWebRequest) WebRequest.Create(url); request.UserAgent = "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)"; WebResponse response = (HttpWebResponse) request.GetResponse(); Stream stream = response.GetResponseStream(); StreamReader reader = new

Xpath and wildcards

阅读更多关于 Xpath and wildcards

问题 I have tried several combinations without success. The full xpath to that data is .//*[@id='detail_row_seek_37878']/td The problem is the number portion '37878' changes for each node and thus I can't use a foreach to loop through the nodes. Is there some way to use a wildcard and reduce the xpath to .//*[@id='detail wildcard , in an effort to bypass the absolute value portion? I am using html agility pack on this. HtmlNode ddate = node.SelectSingleNode(".//*[@id='detail_row_seek_37878']/td");

Set session to scrape page

阅读更多关于 Set session to scrape page

问题 URL1: https://duapp3.drexel.edu/webtms_du/ URL2: https://duapp3.drexel.edu/webtms_du/Colleges.asp?Term=201125&univ=DREX URL3: https://duapp3.drexel.edu/webtms_du/Courses.asp?SubjCode=CS&CollCode=E&univ=DREX As a personal programming project, I want to scrape my University's course catalog and provide it as a RESTful API. However, I'm running into the following issue. The page that I need to scrape is URL3. But URL3 only returns meaningful information after I visit URL2 (it sets the term there