screen-scraping | 易学教程

Using Java to pull data from web [closed]

阅读更多关于 Using Java to pull data from web [closed]

问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 5 years ago . I was wondering if there is a way to pull specific data from a website using java (eclipse). For example, stock information from Yahoo Finances or from Bloomberg. I've looked around and have found some resources, but I haven't been able to get them to work, perhaps I'm missing

Using Java to pull data from web [closed]

阅读更多关于 Using Java to pull data from web [closed]

Skipp the error while scraping a list of urls form a csv

阅读更多关于 Skipp the error while scraping a list of urls form a csv

问题 I managed to scrape a list of urls from a CSV file, but I got a problem, the scraping stops when it hits a broken link . Also it prints a lot of None lines, is it possible to get rid of them ? Would appreciate some help here. Thank you in advance ! Here is the code : #!/usr/bin/python # -*- coding: utf-8 -*- from bs4 import BeautifulSoup #required to parse html import requests #required to make request #read file with open('urls.csv','r') as f: csv_raw_cont=f.read() #split by line split_csv

Using PHP's DOMDocument::preserveWhiteSpace = false and still getting whitespace

阅读更多关于 Using PHP's DOMDocument::preserveWhiteSpace = false and still getting whitespace

问题 I'm scraping this page: http://kat.ph/search/example/?field=seeders&sorder=desc In this way: ... curl_setopt( $curl, CURLOPT_URL, $url ); $header = array ( 'Accept:text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Charset:ISO-8859-1,utf-8;q=0.7,*;q=0.3', 'Accept-Encoding:gzip,deflate,sdch', 'Accept-Language:en-US,en;q=0.8', 'Cache-Control:max-age=0', 'Connection:keep-alive', 'Host:kat.ph', 'User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.19

Screen scraping through nokogiri or hpricot

阅读更多关于 Screen scraping through nokogiri or hpricot

问题 I'm trying to get actual value of given xpath. I am having the following code in sample.rb file require 'rubygems' require 'nokogiri' require 'open-uri' doc = Nokogiri::HTML(open('http://www.changebadtogood.com/')) desc "Trying to get the value of given xapth" task :sample do begin doc.xpath('//*[@id="view_more"]').each do |link| puts link.content end rescue Exception => e puts "error" end end Output is: View more issues .. When I try to get the value for other a different XPath, such as:

methods width and height Mechanize

阅读更多关于 methods width and height Mechanize

问题 I'm using Mechanize for scraping images url then I'm looking http://mechanize.rubyforge.org/Mechanize/Page/Image.html for to know width and height images. I write in console: url = "http://www.bbc.co.uk/" page = Mechanize.new.get(url) images_url = page.images.map{|img| img.width}.compact I get the result: ["1", "84", "432", "432", "432", "432", "432", "432", "432", "304", "144", "144", "144", "144", "144", "144", "432", "432", "432", "432", "432", "432", "432", "336", "62", "62", "62", "62",

Inserting scraped data using php curl into MySQL

阅读更多关于 Inserting scraped data using php curl into MySQL

问题 I have been working on this script for the last couple of days and cannot seem to find a way to insert the data into MySQL. I am a beginner when it comes to PHP/MYSQL and have only written a couple of simple scripts before. I am able to echo out the scraped data and get no error messages, but when I check phpmyadmin the query isn't working (the results aren't being input to the database). Here is the code that I have been working on require ("mysqli_connect.php"); include('../simple_html_dom

Convert a (nested)HTML unordered list of links to PHP array of links

阅读更多关于 Convert a (nested)HTML unordered list of links to PHP array of links

问题 I have a regular, nested HTML unordered list of links, and I'd like to scrape it with PHP and convert it to an array. The original list looks something like this: <ul> <li><a href="http://someurl.com">First item</a> <ul> <li><a href="http://someotherurl.com/">Child of First Item</a></li> <li><a href="http://someotherurl.com/">Second Child of First Item</a></li> </ul> </li> <li><a href="http://bogusurl.com">Second item</a></li> <li><a href="http://bogusurl.com">Third item</a></li> <li><a href=

htmlunit 404 error for scripts within page

阅读更多关于 htmlunit 404 error for scripts within page

问题 i am using htmlunit to try to open a site but I keep getting 404 errors. The site works in my python scripts and in my browser but not in html unit for some reason. I think my URL itself is fine but it seems to be opening another site within the site and failing (example.com/SharedResources/Default/js/coda_bubble/jquery.codabubble.js) For anyone familiar with htmlunit, is there any way to get it not to automatically load these other areas of the site? or more gracefully handle errors on the

scraping way2sms with mechanize

阅读更多关于 scraping way2sms with mechanize

问题 I am trying to send an sms with by scraping way2sms.com, but I am unable to login into way2sms.com using mechanize. I am using following code to submit the login form. import mechanize br = mechanize.Browser() br.set_handle_robots(False) br.set_handle_refresh(False) br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; Linux x86_64; rv:18.0) Gecko/20100101 Firefox/18.0')] res=br.open('http://wwwa.way2sms.com/content/prehome.jsp') link=list(br.links())[5] res=br.follow_link(link) br.form = list