screen-scraping

Python scraping beautiful soup— output to excel

前提是你 提交于 2019-12-13 16:41:20
问题 I want to take a blog lets use this as an example...www.forbes.com/sites/zillow and scrape it for content across all pages with the below output [in a csv if possible] link = http://www.forbes.com/sites/zillow/2012/09/14/underwater-and-under-40-a-list-of-the-top-u-s-metros/ title = Underwater and Under 40: A List of the Top U.S. Metros inlinks = #list the links in the article picture = #list eider the number of pictures or their links wordcount = #if this is possible Views = #in the html of

get data from a website

断了今生、忘了曾经 提交于 2019-12-13 09:36:12
问题 How can i scrap(get ) the data from a website. Example :- I have a site say www.getfinancialdata.com now i want to grab the data by running a script/url frm my system to this website and then sorting the data and save in spreadsheet. I have done this thing for a simple website where i can view the HTML content in the body of a web page (after i do view source code) But my problem is bit compex when i view the source i see it is the DOM data(no simple html content)there are jquery functions

Reading HTML page using Libreoffice Basic

血红的双手。 提交于 2019-12-13 08:38:47
问题 I'm new to LibreOffice Basic. I'm trying to write a macro in LibreOffice Calc that will read the name of a noble House of Westeros from a cell (e.g. Stark), and output the Words of that House by looking it up on the relevant page on A Wiki of Ice and Fire. It should work like this: Here is the pseudocode: Read HouseName from column A Open HtmlFile at "http://www.awoiaf.westeros.org/index.php/House_" & HouseName Iterate through HtmlFile to find line which begins "<table class="infobox infobox

Click on element in dropdown with Selenium and Python

做~自己de王妃 提交于 2019-12-13 08:07:28
问题 With Selenium and Chrome webdriver on MacOS need to click dropdown element. But always have an error that can't find. Have this html code on a page where it located: <select id="periodoExtrato" name="periodoExtrato" class="EXTtexto" onchange="enviarExtrato(document.formperiodo.periodoExtrato[document.formperiodo.periodoExtrato.selectedIndex].value);"> <!--<option value="01" >Último dia</option>--> <option value="03" selected="true">Últimos 3 dias</option> <option value="05">Últimos 5 dias<

Screen scrape a web page that uses javaScript and frames

纵然是瞬间 提交于 2019-12-13 07:56:02
问题 I want to scrape data from www.marktplaats.nl . I want to analyze the scraped description, price, date and views in Excel/Access. I tried to scrape data with Ruby (nokogiri, scrapi) but nothing worked. (on other sites it worked well) The main problem is that for example selectorgadget and the add-on firebug (Firefox) don’t find any css I can use to scrape the page. On other sites I can extract the css with selectorgadget or firebug and use it with nokogiri or scrapi. Due to lack of experience

C# RegEx on a StreamReader will not return matches

假装没事ソ 提交于 2019-12-13 06:40:41
问题 I'm writing myself a simple screen scraping application to play around with the HTMLAgilityPack library, and after getting it to work on several different types of HtmlNodes, I figured I'd get fancy and throw in a Regex for Email addresses as well. The only problem is that the application never finds any matches, or maybe it is but not returning properly. This takes place even on sites known to contain email addresses. Can anyone spot what I'm doing wrong here? string url = String.Format(

extract value from web page

旧时模样 提交于 2019-12-13 06:29:28
问题 Hi I have a website's home page that I am reading in using Curl and I need to grab the number of pages that the site has. The information is in a div:- <div class="pager"> <span class="page-numbers current">1</span> <a href="/users?page=2" title="go to page 2"><span class="page-numbers">2</span></a> <a href="/users?page=3" title="go to page 3"><span class="page-numbers">3</span></a> <a href="/users?page=4" title="go to page 4"><span class="page-numbers">4</span></a> <a href="/users?page=5"

simple_html_dom.php

戏子无情 提交于 2019-12-13 06:26:13
问题 I am using " simple_html_dom.php " to scrap the data from the Wikipedia site. If I run the code in scraperwiki.com it's throwing an error as exit status 139 and if run the same code in my xampp sever, the server is hanging. I have a set of links I'm trying to get Literacy value from all the sites If I run the code with one link there is no problem and it's returning the expected result If I try to get data from all the sites in one go I'm facing the above problem The code is: <?php $test

Chrome Extension: How to pass a variable from Content Script to background.html

北城余情 提交于 2019-12-13 04:12:16
问题 I can't figure out how to pass a variable (or an array of variables) from a content script to a background page. What I'm trying to do is find certain DOM elements with my content script, then send them to my background page so that I can make a cross-domain XMLHttpRequest with them (store them in a database on a different site). My code is below. I know that the variable named "serialize" is not being passed (and I don't expect it to based on my current code but have it in there so it's

PhantomJS querySelectorAll().textcontent returns nothing

瘦欲@ 提交于 2019-12-13 04:07:10
问题 I create a simple web scraper to grab data from a website by using phantomjs. It's doesn't work for me when I used querySelectorAll to get content which I want. Here is my whole code. var page = require('webpage').create(); var url = 'https://www.google.com.kh/?gws_rd=cr,ssl&ei=iE7jV87UKsrF0gSDw4zAAg'; page.open(url, function(status){ if(status === 'success'){ var title = page.evaluate(function(){ return document.querySelectorAll('.logo-subtext')[0].textContent; }); console.log(title); }