parse and process HTML in PHP: fetching Wordpress-plugin-Metadata with a scraper

问题

How do parse and process HTML in PHP - a simple scraper

I'm currently working on a parser to make a small preview of a page from a URL given by the user in PHP. I'd like to retrieve only the title of the page and a little chunk of information (a bit of text)

The project: for a list of meta-data of popular wordpress-plugins (cf. https://de.wordpress.org/plugins/browse/popular/ and gathering the first 50 URLs - that are 50 plugins which are of interest! The challenge is: i want to fetch meta-data of all the existing plugins. What i subsequently want to filter out after the fetch is - those plugins that have the newest timestamp - that are updated (most) recently. It is all aobut acutality...

https://wordpress.org/plugins/wp-job-manager
https://wordpress.org/plugins/ninja-forms
https://wordpress.org/plugins/participants-database ....and so on and so forth.

so to take one page into consideration - fetching the meta-data of one Wordpress-plugin: With simple_html_dom ( http://simplehtmldom.sourceforge.net/ ) i guess that there is a appropiate way and method to do this without any other external libraries/classes. So far I've also tried using generally (DOM)-DOCDocument classes http://docs.php.net/manual/en/domdocument.loadhtml.php), loading the HTML and displaying it on the screen, and now i am musing about the proper way to do it. i consider simple_html_dom ( http://simplehtmldom.sourceforge.net/ ) for this. It will make it very easy. Here is an example of how to pull the title, and the meta-text(description).

<?php
require 'simple_html_dom.php';

$html = file_get_html('https://wordpress.org/plugins/wp-job-manager/');
$title = $html->find ("h1", class_="plugin-title").text];
$text  = $html->find(class_="entry-meta").text];

echo $title->plaintext."<br>\n";
echo $texte->text;
?>

see the source: https://wordpress.org/plugins/wp-job-manager/ we have the following set of meta-data for each wordpress-plugin:

Version: 1.9.5.12 
installations: 10,000+    
WordPress Version: 5.0 or higher 
Tested up to: 5.4 PHP  
Version: 5.6 or higher    
Tags 3 Tags: database member sign-up form volunteer
Last updated: 19 hours ago
plugin-ratings

the project consits of two parts: the looping-part: looping over this URL https://de.wordpress.org/plugins/browse/popular/ and gathering approx 50 to 80 URLs (which seems to be pretty straightforward). the parser-part: where i have some issues - to get propperly the data for the tags and the plugin-rating...

update: the plugins api. can help here - a great approach is cf Getting a list of ALL plugins

来源：https://stackoverflow.com/questions/61679425/parse-and-process-html-in-php-fetching-wordpress-plugin-metadata-with-a-scraper

标签

php

wordpress

curl

web-scraping