问题
How do parse and process HTML in PHP - a simple scraper
I'm currently working on a parser to make a small preview of a page from a URL given by the user in PHP. I'd like to retrieve only the title of the page and a little chunk of information (a bit of text)
The project: for a list of meta-data of popular wordpress-plugins (cf. https://de.wordpress.org/plugins/browse/popular/ and gathering the first 50 URLs - that are 50 plugins which are of interest! The challenge is: i want to fetch meta-data of all the existing plugins. What i subsequently want to filter out after the fetch is - those plugins that have the newest timestamp - that are updated (most) recently. It is all aobut acutality...
https://wordpress.org/plugins/wp-job-manager
https://wordpress.org/plugins/ninja-forms
https://wordpress.org/plugins/participants-database ....and so on and so forth.
so to take one page into consideration - fetching the meta-data of one Wordpress-plugin: With simple_html_dom ( http://simplehtmldom.sourceforge.net/ ) i guess that there is a appropiate way and method to do this without any other external libraries/classes. So far I've also tried using generally (DOM)-DOCDocument classes http://docs.php.net/manual/en/domdocument.loadhtml.php), loading the HTML and displaying it on the screen, and now i am musing about the proper way to do it. i consider simple_html_dom ( http://simplehtmldom.sourceforge.net/ ) for this. It will make it very easy. Here is an example of how to pull the title, and the meta-text(description).
<?php
require 'simple_html_dom.php';
$html = file_get_html('https://wordpress.org/plugins/wp-job-manager/');
$title = $html->find ("h1", class_="plugin-title").text];
$text = $html->find(class_="entry-meta").text];
echo $title->plaintext."<br>\n";
echo $texte->text;
?>
see the source: https://wordpress.org/plugins/wp-job-manager/ we have the following set of meta-data for each wordpress-plugin:
Version: 1.9.5.12
installations: 10,000+
WordPress Version: 5.0 or higher
Tested up to: 5.4 PHP
Version: 5.6 or higher
Tags 3 Tags: database member sign-up form volunteer
Last updated: 19 hours ago
plugin-ratings
the project consits of two parts: the looping-part: looping over this URL https://de.wordpress.org/plugins/browse/popular/ and gathering approx 50 to 80 URLs (which seems to be pretty straightforward). the parser-part: where i have some issues - to get propperly the data for the tags and the plugin-rating...
update: the plugins api. can help here - a great approach is cf Getting a list of ALL plugins
来源:https://stackoverflow.com/questions/61679425/parse-and-process-html-in-php-fetching-wordpress-plugin-metadata-with-a-scraper