screen-scraping | 易学教程

Some help scraping a page in Java

阅读更多关于 Some help scraping a page in Java

问题 I need to scrape a web page using Java and I've read that regex is a pretty inefficient way of doing it and one should put it into a DOM Document to navigate it. I've tried reading the documentation but it seems too extensive and I don't know where to begin. Could you show me how to scrape this table in to an array? I can try figuring out my way from there. A snippet/example would do just fine too. Thanks. 回答1: You can try jsoup: Java HTML Parser. It is an excellent library with good sample

PHP function to grab all links inside a <DIV> on remote site using scrape method

阅读更多关于 PHP function to grab all links inside a on remote site using scrape method

问题 Anyone has a PHP function that can grab all links inside a specific DIV on a remote site? So usage might be: $links = grab_links($url,$divname); And return an array I can use. Grabbing links I can figure out but not sure how to make it only do it within a specific div. Thanks! Scott 回答1: Check out PHP XPath. It will let you query a document for the contents of specific tags and so on. The example on the php site is pretty straightforward: http://php.net/manual/en/simplexmlelement.xpath.php

How does a site like kayak.com aggregate content? [closed]

阅读更多关于 How does a site like kayak.com aggregate content? [closed]

问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 3 years ago . Greetings, I've been toying with an idea for a new project and was wondering if anyone has any idea on how a service like Kayak.com is able to aggregate data from so many sources so quickly and accurately. More specifically, do you think Kayak.com is interacting with APIs or are

Using Nokogiri to Split Content on BR tags

阅读更多关于 Using Nokogiri to Split Content on BR tags

问题 I have a snippet of code im trying to parse with nokogiri that looks like this: <td class="j"> <a title="title text1" href="http://link1.com">Link 1</a> (info1), Blah 1,<br> <a title="title text2" href="http://link2.com">Link 2</a> (info1), Blah 1,<br> <a title="title text2" href="http://link3.com">Link 3</a> (info2), Blah 1 Foo 2,<br> </td> I have access to the source of the td.j using something like this: data_items = doc.css("td.j") My goal is to split each of those lines up into an array

Scraping javascript-generated data using Python

阅读更多关于 Scraping javascript-generated data using Python

问题 I want to scrape some data of following url using Python. http://www.hankyung.com/stockplus/main.php?module=stock&mode=stock_analysis_infomation&itemcode=078340 It's about a summary of company information. What I want to scrape is not shown on the first page. By clicking tab named "재무제표", you can access financial statement. And clicking tab named "현금흐름표', you can access "Cash Flow". I want to scrape the "Cash Flow" data. However, Cash flow data is generated by javascript across the url. The

Scraping javascript-generated data using Python

阅读更多关于 Scraping javascript-generated data using Python

scrapy log handler

阅读更多关于 scrapy log handler

问题 I seek your help in the following 2 questions - How do I set the handler for the different log levels like in python. Currently, I have STATS_ENABLED = True STATS_DUMP = True LOG_FILE = 'crawl.log' But the debug messages generated by Scrapy are also added into the log files. Those are very long and ideally, I would like the DEBUG level messages to left on standard error and INFO messages to be dump to my LOG_FILE . Secondly, in the docs, it says The logging service must be explicitly started

Screen Scraping from a web page with a lot of Javascript [closed]

阅读更多关于 Screen Scraping from a web page with a lot of Javascript [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 3 years ago . I have been asked to write an app which screen scrapes info from an intranet web page and presents the certain info from it in a nice easy to view format. The web page is a real mess and requires the user to click on half a dozen icons to discover if an ordered item has arrived or has been receipted. As you can

Is there a PHP equivalent of Perl's WWW::Mechanize?

阅读更多关于 Is there a PHP equivalent of Perl's WWW::Mechanize?

问题 I'm looking for a library that has functionality similar to Perl's WWW::Mechanize, but for PHP. Basically, it should allow me to submit HTTP GET and POST requests with a simple syntax, and then parse the resulting page and return in a simple format all forms and their fields, along with all links on the page. I know about CURL, but it's a little too barebones, and the syntax is pretty ugly (tons of curl_foo($curl_handle, ...) statements Clarification: I want something more high-level than the

Is scraping from public Facebook pages legal? [closed]

阅读更多关于 Is scraping from public Facebook pages legal? [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 3 years ago . My question is : Is scraping from public Facebook pages legal ? Why am I asking that question : To get the rating of facebook pages that we don't own using the graph API we will need a page access token, and that is impossible (because the pages that I am talking about are not mine). For that reason I am