问题
When I input
=IMPORTXML("http://www.ilgiornale.it/autore/franco-battaglia.html","//h2")
in my google sheet, I get: #N/A Imported content is empty
.
However, when I input:
=IMPORTXML("http://www.ilgiornale.it/autore/franco-battaglia.html","*")
I get some content, so I can presume that access to the page is not blocked.
And the page contains several h2
tags without any doubt.
So what's the issue?
回答1:
- You want to know the reason of the following situation.
=IMPORTXML("http://www.ilgiornale.it/autore/franco-battaglia.html","//h2")
returns#N/A Imported content is empty
.=IMPORTXML("http://www.ilgiornale.it/autore/franco-battaglia.html","*")
returns the content.
If my understanding is correct, how about this answer?
Issue:
When I saw the HTML data of http://www.ilgiornale.it/autore/franco-battaglia.html
, I noticed that the wrong point of it. It is as follows.
window.jQuery || document.write("<script src='/sites/all/modules/jquery_update/replace/jquery/jquery.min.js'>\x3C/script>")
In this case, the script tag is not closed like \x3C/script>
. It seems that when IMPORTXML retrieves this line, the script tab is not closed. I could confirm that when \x3C
is converted to <
, =IMPORTXML("http://www.ilgiornale.it/autore/franco-battaglia.html","//h2")
correctly returns the values of h2
tag.
By this, it seems that the issue that =IMPORTXML("http://www.ilgiornale.it/autore/franco-battaglia.html","//h2")
returns #N/A Imported content is empty
occurs.
About the reason that =IMPORTXML("http://www.ilgiornale.it/autore/franco-battaglia.html","*")
returns the content, when I put this formula, I couldn't find the values of the script tab. From this situation, I thought that the script tag might have an issue. So I could find the above wrong point. I could confirm that when \x3C
is converted to <
, =IMPORTXML("http://www.ilgiornale.it/autore/franco-battaglia.html","*")
returns the values including the values of the script tag.
Workarounds:
In order to avoid above issue, it is required to be modified \x3C
to <
. So how about the following workarounds? In these workarounds, I used Google Apps Script. Please think of these workarounds as just two of several workarounds.
Pattern 1:
In this pattern, at first, download the HTML data from the URL, and modify the wrong point. Then, the modified HTML data is created as a file, and the file is shared. And retrieve the URL of the file. Using this URL, the values are retrieved.
Sample script:function myFunction() {
var url = "http://www.ilgiornale.it/autore/franco-battaglia.html";
var data = UrlFetchApp.fetch(url).getContentText().replace(/\\x3C/g, "<");
var file = DriveApp.createFile("htmlData.html", data, MimeType.HTML);
file.setSharing(DriveApp.Access.ANYONE_WITH_LINK, DriveApp.Permission.VIEW);
var endpoint = "https://drive.google.com/uc?id=" + file.getId() + "&export=download";
Logger.log(endpoint)
}
- When you use this script, at first, please run the function of
myFunction()
and retrieve the endpoint. And as a test case, please put the endpoint to the cell "A1". And put=IMPORTXML(A1,"//h2")
to the cell "A2". By this, the values can be retrieved.
Pattern 2:
In this pattern, the values of the tag h2
are directly retrieved by parsing HTML data and put them to the active Spreadsheet.
function myFunction() {
var url = "http://www.ilgiornale.it/autore/franco-battaglia.html";
var data = UrlFetchApp.fetch(url).getContentText().match(/<h2[\s\S]+?<\/h2>/g);
var xml = XmlService.parse("<temp>" + data.join("") + "</temp>");
var h2Values = xml.getRootElement().getChildren("h2").map(function(e) {return [e.getValue()]});
var sheet = SpreadsheetApp.getActiveSheet();
sheet.getRange(sheet.getLastRow() + 1, 1, h2Values.length, 1).setValues(h2Values);
Logger.log(h2Values)
}
- When you run the script, the values of the tag
h2
are directly put to the active Spreadsheet.
References:
- Class UrlFetchApp
- Class XmlService
If I misunderstood your question and this was not the direction you want, I apologize.
来源:https://stackoverflow.com/questions/58049531/another-importxml-returning-empty-content