I use YQL to get some html-pages for reading information out of it. Since today I get the return message "html table is no longer supported. See https://policies.yahoo.com/us/en/yahoo/terms/product-atos/yql/index.htm for YQL Terms of Use"
Example in the console: https://developer.yahoo.com/yql/console/#h=select+*+from+html+where+url%3D%22http%3A%2F%2Fwww.google.de%22
Did Yahoo stop this service? Does anybody know a kind of announcement from Yahoo? I am wondering whether this is simply a bug or whether they really stopped this service...
All documentation is still there (html scraping): https://developer.yahoo.com/yql/guide/yql-select-xpath.html , https://developer.yahoo.com/yql/
A while ago I posted in an YQL forum from Yahoo, now this one does not exist anymore (or at least I do not find it). How can you contact Yahoo to find out whether this service really stopped?
Best regards, hebr3
It looks like Yahoo did indeed end their support of the html library as of 6/8/2017 (according to my error logs). There doesn't appear to be any official announcement of it yet.
Luckily, there is a YQL community library that can be used in place of the official html library with few changes to your codebase. See the htmlstring table in the YQL Console.
Change your YQL query to reference htmltable instead of html and include the community environment in your REST query. For example:
/*/ Old code /*/
var site = "http://www.test.com/foo.html";
var yql = "select * from html where url='" + site + "' AND xpath='//div'";
var resturl = "https://query.yahooapis.com/v1/public/yql?q="
+ encodeURIComponent(yql) + "&format=json";
/*/ New code /*/
var site = "http://www.test.com/foo.html";
var yql = "select * from htmlstring where url='" + site + "' AND xpath='//div'";
var resturl = "https://query.yahooapis.com/v1/public/yql?q="
+ encodeURIComponent(yql) + "&format=json"
+ "&env=store%3A%2F%2Fdatatables.org%2Falltableswithkeys";
Thank you very much for your code.
It helped me to create my own script to read those pages which I need. I never programmed PHP before, but with your code and the wisdom of the internet I could change your script to my needs.
PHP
<?
header('Access-Control-Allow-Origin: *'); //all
$url = $_GET['url'];
if (substr($url,0,25) != "https://www.xxxx.yy") {
echo "Only https://www.xxxx.yy allowed!";
return;
}
$xpathQuery = $_GET['xpath'];
//need more hard check for security, I made only basic
function check($target_url){
$check = curl_init();
//curl_setopt( $check, CURLOPT_HTTPHEADER, array("REMOTE_ADDR: $ip", "HTTP_X_FORWARDED_FOR: $ip"));
//curl_setopt($check, CURLOPT_INTERFACE, "xxx.xxx.xxx.xxx");
curl_setopt($check, CURLOPT_COOKIEJAR, 'cookiemon.txt');
curl_setopt($check, CURLOPT_COOKIEFILE, 'cookiemon.txt');
curl_setopt($check, CURLOPT_TIMEOUT, 40000);
curl_setopt($check, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($check, CURLOPT_URL, $target_url);
curl_setopt($check, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
curl_setopt($check, CURLOPT_FOLLOWLOCATION, false);
$tmp = curl_exec ($check);
curl_close ($check);
return $tmp;
}
// get html
$html = check($url);
$dom = new DOMDocument();
@$dom->loadHTML($html);
// apply xpath filter
$xpath = new DOMXPath($dom);
$elements = $xpath->query($xpathQuery);
$temp_dom = new DOMDocument();
foreach($elements as $n) $temp_dom->appendChild($temp_dom->importNode($n,true));
$renderedHtml = $temp_dom->saveHTML();
// return html in json response
// json structure:
// {html: "xxxx"}
$post_data = array(
'html' => $renderedHtml
);
echo json_encode($post_data);
?>
Javascript
$.ajax({
url: "url of service",
dataType: "json",
data: { url: url,
xpath: "//*"
},
type: 'GET',
success: function() {
},
error: function(data) {
}
});
Even though YQL does not support the html table anymore, I've come to realize that instead of making one network call and parsing out the results it's possible to make several calls. For example, my call before would look like this:
select html from rss where url="http://w1.weather.gov/xml/current_obs/KFLL.rss"
Which should give me the information as such below
Now I'd have to use these two:
select title from rss where url="http://w1.weather.gov/xml/current_obs/KFLL.rss"
select description from rss where url="http://w1.weather.gov/xml/current_obs/KFLL.rss"
.. to get what I want. I don't know why they would deprecate something like this without a fallback clearly listed but you should be able to get your data this way.
I build an open source tool called CloudQuery (source code)provide similar functionality as yql recently. It is able to turn most websites to API with some clicks.
来源:https://stackoverflow.com/questions/44431212/yql-html-table-is-no-longer-supported