screen-scraping | 易学教程

Python scraping beautiful soup— output to excel

阅读更多关于 Python scraping beautiful soup— output to excel

问题 I want to take a blog lets use this as an example...www.forbes.com/sites/zillow and scrape it for content across all pages with the below output [in a csv if possible] link = http://www.forbes.com/sites/zillow/2012/09/14/underwater-and-under-40-a-list-of-the-top-u-s-metros/ title = Underwater and Under 40: A List of the Top U.S. Metros inlinks = #list the links in the article picture = #list eider the number of pictures or their links wordcount = #if this is possible Views = #in the html of

get data from a website

阅读更多关于 get data from a website

问题 How can i scrap(get ) the data from a website. Example :- I have a site say www.getfinancialdata.com now i want to grab the data by running a script/url frm my system to this website and then sorting the data and save in spreadsheet. I have done this thing for a simple website where i can view the HTML content in the body of a web page (after i do view source code) But my problem is bit compex when i view the source i see it is the DOM data(no simple html content)there are jquery functions

Reading HTML page using Libreoffice Basic

阅读更多关于 Reading HTML page using Libreoffice Basic

问题 I'm new to LibreOffice Basic. I'm trying to write a macro in LibreOffice Calc that will read the name of a noble House of Westeros from a cell (e.g. Stark), and output the Words of that House by looking it up on the relevant page on A Wiki of Ice and Fire. It should work like this: Here is the pseudocode: Read HouseName from column A Open HtmlFile at "http://www.awoiaf.westeros.org/index.php/House_" & HouseName Iterate through HtmlFile to find line which begins "<table class="infobox infobox

Click on element in dropdown with Selenium and Python

阅读更多关于 Click on element in dropdown with Selenium and Python

问题 With Selenium and Chrome webdriver on MacOS need to click dropdown element. But always have an error that can't find. Have this html code on a page where it located: <select id="periodoExtrato" name="periodoExtrato" class="EXTtexto" onchange="enviarExtrato(document.formperiodo.periodoExtrato[document.formperiodo.periodoExtrato.selectedIndex].value);">  <option value="03" selected="true">Últimos 3 dias</option> <option value="05">Últimos 5 dias<

Screen scrape a web page that uses javaScript and frames

阅读更多关于 Screen scrape a web page that uses javaScript and frames

问题 I want to scrape data from www.marktplaats.nl . I want to analyze the scraped description, price, date and views in Excel/Access. I tried to scrape data with Ruby (nokogiri, scrapi) but nothing worked. (on other sites it worked well) The main problem is that for example selectorgadget and the add-on firebug (Firefox) don’t find any css I can use to scrape the page. On other sites I can extract the css with selectorgadget or firebug and use it with nokogiri or scrapi. Due to lack of experience

C# RegEx on a StreamReader will not return matches

阅读更多关于 C# RegEx on a StreamReader will not return matches

问题 I'm writing myself a simple screen scraping application to play around with the HTMLAgilityPack library, and after getting it to work on several different types of HtmlNodes, I figured I'd get fancy and throw in a Regex for Email addresses as well. The only problem is that the application never finds any matches, or maybe it is but not returning properly. This takes place even on sites known to contain email addresses. Can anyone spot what I'm doing wrong here? string url = String.Format(

extract value from web page

阅读更多关于 extract value from web page

问题 Hi I have a website's home page that I am reading in using Curl and I need to grab the number of pages that the site has. The information is in a div:- <div class="pager"> <span class="page-numbers current">1</span> <a href="/users?page=2" title="go to page 2"><span class="page-numbers">2</span></a> <a href="/users?page=3" title="go to page 3"><span class="page-numbers">3</span></a> <a href="/users?page=4" title="go to page 4"><span class="page-numbers">4</span></a> <a href="/users?page=5"

simple_html_dom.php

阅读更多关于 simple_html_dom.php

问题 I am using " simple_html_dom.php " to scrap the data from the Wikipedia site. If I run the code in scraperwiki.com it's throwing an error as exit status 139 and if run the same code in my xampp sever, the server is hanging. I have a set of links I'm trying to get Literacy value from all the sites If I run the code with one link there is no problem and it's returning the expected result If I try to get data from all the sites in one go I'm facing the above problem The code is: <?php $test

Chrome Extension: How to pass a variable from Content Script to background.html

阅读更多关于 Chrome Extension: How to pass a variable from Content Script to background.html

问题 I can't figure out how to pass a variable (or an array of variables) from a content script to a background page. What I'm trying to do is find certain DOM elements with my content script, then send them to my background page so that I can make a cross-domain XMLHttpRequest with them (store them in a database on a different site). My code is below. I know that the variable named "serialize" is not being passed (and I don't expect it to based on my current code but have it in there so it's

PhantomJS querySelectorAll().textcontent returns nothing

阅读更多关于 PhantomJS querySelectorAll().textcontent returns nothing

问题 I create a simple web scraper to grab data from a website by using phantomjs. It's doesn't work for me when I used querySelectorAll to get content which I want. Here is my whole code. var page = require('webpage').create(); var url = 'https://www.google.com.kh/?gws_rd=cr,ssl&ei=iE7jV87UKsrF0gSDw4zAAg'; page.open(url, function(status){ if(status === 'success'){ var title = page.evaluate(function(){ return document.querySelectorAll('.logo-subtext')[0].textContent; }); console.log(title); }