How do I screen scrape a website and get data within div?

问题

How can I screen scrape a website using cURL and show the data within a specific div?

回答1:

Download the page using cURL (There are a lot of examples in the documentation). Then use a DOM Parser, for example Simple HTML DOM or PHPs DOM to extract the value from the div element.

回答2:

After downloading with cURL use XPath to select the div and extract the content.

回答3:

A possible alternative.

# We will store the web page in a string variable.
var string page

# Read the page into the string variable.
cat "http://www.abczyx.com/path/to/page.ext" > $page

# Output the portion in the third (3rd) instance of "<div...</div>"
stex -r -c "^<div&</div\>^3" $page

This code is in biterscripting. I am using the 3 as sample to extract 3rd div. If you want to extract the div that has say string "ABC", then use this command syntax.

stex -r -c "^<div&ABC&</div\>^" $page

Take a look at this script http://www.biterscripting.com/helppages/SS_ExtractTable.html . It shows how to extract an element (div, table, frame, etc.) when the elements are nested.

回答4:

Fetch the website content using a cURL GET request. There's a code sample on the curl_exec manual page.

Use a regular expression to search for the data you need. There's a code sample on the preg_match manual page, but you'll need to do some reading up on regular expressions to be able to build the pattern you need. As Yacoby mentioned which I hadn't thought of, a better idea may be to examine the DOM of the HTML page using PHP's Simple XML or DOM parser.

Output the information you've found from the regex/parser in the HTML of your page (within the required div.)

来源：https://stackoverflow.com/questions/2523096/how-do-i-screen-scrape-a-website-and-get-data-within-div

标签

php

curl

screen-scraping