screen-scraping

Using Java to pull data from web [closed]

[亡魂溺海] 提交于 2019-12-25 19:43:01
问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 5 years ago . I was wondering if there is a way to pull specific data from a website using java (eclipse). For example, stock information from Yahoo Finances or from Bloomberg. I've looked around and have found some resources, but I haven't been able to get them to work, perhaps I'm missing

Using Java to pull data from web [closed]

Deadly 提交于 2019-12-25 19:40:06
问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 5 years ago . I was wondering if there is a way to pull specific data from a website using java (eclipse). For example, stock information from Yahoo Finances or from Bloomberg. I've looked around and have found some resources, but I haven't been able to get them to work, perhaps I'm missing

Skipp the error while scraping a list of urls form a csv

隐身守侯 提交于 2019-12-25 19:08:23
问题 I managed to scrape a list of urls from a CSV file, but I got a problem, the scraping stops when it hits a broken link . Also it prints a lot of None lines, is it possible to get rid of them ? Would appreciate some help here. Thank you in advance ! Here is the code : #!/usr/bin/python # -*- coding: utf-8 -*- from bs4 import BeautifulSoup #required to parse html import requests #required to make request #read file with open('urls.csv','r') as f: csv_raw_cont=f.read() #split by line split_csv

Using PHP's DOMDocument::preserveWhiteSpace = false and still getting whitespace

本小妞迷上赌 提交于 2019-12-25 18:21:36
问题 I'm scraping this page: http://kat.ph/search/example/?field=seeders&sorder=desc In this way: ... curl_setopt( $curl, CURLOPT_URL, $url ); $header = array ( 'Accept:text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Charset:ISO-8859-1,utf-8;q=0.7,*;q=0.3', 'Accept-Encoding:gzip,deflate,sdch', 'Accept-Language:en-US,en;q=0.8', 'Cache-Control:max-age=0', 'Connection:keep-alive', 'Host:kat.ph', 'User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.19

Screen scraping through nokogiri or hpricot

可紊 提交于 2019-12-25 18:11:10
问题 I'm trying to get actual value of given xpath. I am having the following code in sample.rb file require 'rubygems' require 'nokogiri' require 'open-uri' doc = Nokogiri::HTML(open('http://www.changebadtogood.com/')) desc "Trying to get the value of given xapth" task :sample do begin doc.xpath('//*[@id="view_more"]').each do |link| puts link.content end rescue Exception => e puts "error" end end Output is: View more issues .. When I try to get the value for other a different XPath, such as:

methods width and height Mechanize

两盒软妹~` 提交于 2019-12-25 08:19:10
问题 I'm using Mechanize for scraping images url then I'm looking http://mechanize.rubyforge.org/Mechanize/Page/Image.html for to know width and height images. I write in console: url = "http://www.bbc.co.uk/" page = Mechanize.new.get(url) images_url = page.images.map{|img| img.width}.compact I get the result: ["1", "84", "432", "432", "432", "432", "432", "432", "432", "304", "144", "144", "144", "144", "144", "144", "432", "432", "432", "432", "432", "432", "432", "336", "62", "62", "62", "62",

Inserting scraped data using php curl into MySQL

我的未来我决定 提交于 2019-12-25 05:09:28
问题 I have been working on this script for the last couple of days and cannot seem to find a way to insert the data into MySQL. I am a beginner when it comes to PHP/MYSQL and have only written a couple of simple scripts before. I am able to echo out the scraped data and get no error messages, but when I check phpmyadmin the query isn't working (the results aren't being input to the database). Here is the code that I have been working on require ("mysqli_connect.php"); include('../simple_html_dom

Convert a (nested)HTML unordered list of links to PHP array of links

雨燕双飞 提交于 2019-12-25 03:44:31
问题 I have a regular, nested HTML unordered list of links, and I'd like to scrape it with PHP and convert it to an array. The original list looks something like this: <ul> <li><a href="http://someurl.com">First item</a> <ul> <li><a href="http://someotherurl.com/">Child of First Item</a></li> <li><a href="http://someotherurl.com/">Second Child of First Item</a></li> </ul> </li> <li><a href="http://bogusurl.com">Second item</a></li> <li><a href="http://bogusurl.com">Third item</a></li> <li><a href=

htmlunit 404 error for scripts within page

一笑奈何 提交于 2019-12-25 03:21:05
问题 i am using htmlunit to try to open a site but I keep getting 404 errors. The site works in my python scripts and in my browser but not in html unit for some reason. I think my URL itself is fine but it seems to be opening another site within the site and failing (example.com/SharedResources/Default/js/coda_bubble/jquery.codabubble.js) For anyone familiar with htmlunit, is there any way to get it not to automatically load these other areas of the site? or more gracefully handle errors on the

scraping way2sms with mechanize

无人久伴 提交于 2019-12-25 01:54:01
问题 I am trying to send an sms with by scraping way2sms.com, but I am unable to login into way2sms.com using mechanize. I am using following code to submit the login form. import mechanize br = mechanize.Browser() br.set_handle_robots(False) br.set_handle_refresh(False) br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; Linux x86_64; rv:18.0) Gecko/20100101 Firefox/18.0')] res=br.open('http://wwwa.way2sms.com/content/prehome.jsp') link=list(br.links())[5] res=br.follow_link(link) br.form = list