问题
Just to note from the start, the content is uncopyrighted and I would like to automate the process of acquiring the text for the purpose of a project.
I'd like to extract the text from a particular and recurring DIV (that is attributed with it's own 'class', in case that makes it easier) sitting in each page on a simply designed website.
There is a single archive page on the site with a list of all of the pages containing the content I would like.
The site is www.zenhabits.net
I imagine this could be achieved with some sort of script, but have no idea where to start.
I appreciate any help.
-Nathan.
回答1:
This is pretty straight forward.
Firstly, get all the links from this site, and throw them all into an array:
set_time_limit(0);//this could take a while...
ignore_user_abort(true);//in case browser times out
$html_output=file_get_contents("http://zenhabits.net/archives/");
# -- Do a preg_match on the html, and grab all links:
if(preg_match_all('/<a href=\"http:\/\/zenhabits.net\/(.*)\">/',$html_output,$matches)) {
# -- Append Data To Array
foreach($matches[1] as $secLink) {
$links[] = "http://zenhabits.net/".$secLink;
}
}
I tested this for you, and:
//first 3 are returning something weird, but you don't need them - so I shall remove them xD
unset($links[0]);
unset($links[1]);
unset($links[2]);
No that's all done, time to go through all of THOSE links (in the array $links), and take its content:
foreach($links as $contLink){
$html_output_c=file_get_contents("$contLink");
if(preg_match('|<div class=\"post\">(.*)</div>|s',$html_output_c,$c_matches)) {
# -- Append Data To Array
echo"data found <br>";
$contentFromPage[] = $c_matches[1];
}
else{echo "no content found in: $contLink -- <br><br><br>";}
}//end of foreach
I've basically just written a whole crawler script for you..
And now, loop the content array, and do whatever you want with it(here we shall put it into a text file):
//$contentFromPage now contains all of div class="post" content (in an array) - so do what you want with it
foreach($contentFromPage as $content){
# -- We need a name for each text file --
$textName=rand()."_content_".rand().".txt";//we'll just use some numbers and text
//define file path (where you want the txt file to be saved)
$path="../";//we'll just put it in a folder above the script
$full_path=$path.$textName;
// now save the file..
file_put_contents($full_path,$content);
//and that's it
}//end of foreach
回答2:
You may also use the SimpleHTML DOM Parser script to extract the content. This is a very useful script that I had used for 1.6 year. You can download the script from http://simplehtmldom.sourceforge.net/ . It is well documented with examples. Hope this will help you to solve your problem.
来源:https://stackoverflow.com/questions/10295669/extract-text-from-a-div-that-occurs-on-multiple-pages-on-a-website-then-output