another IMPORTXML returning empty content

问题

When I input

=IMPORTXML("http://www.ilgiornale.it/autore/franco-battaglia.html","//h2")

in my google sheet, I get: #N/A Imported content is empty.

However, when I input:

=IMPORTXML("http://www.ilgiornale.it/autore/franco-battaglia.html","*")

I get some content, so I can presume that access to the page is not blocked.

And the page contains several h2 tags without any doubt.

So what's the issue?

回答1:

You want to know the reason of the following situation.
- =IMPORTXML("http://www.ilgiornale.it/autore/franco-battaglia.html","//h2") returns #N/A Imported content is empty.
- =IMPORTXML("http://www.ilgiornale.it/autore/franco-battaglia.html","*") returns the content.

If my understanding is correct, how about this answer?

Issue:

When I saw the HTML data of http://www.ilgiornale.it/autore/franco-battaglia.html, I noticed that the wrong point of it. It is as follows.

window.jQuery || document.write("<script src='/sites/all/modules/jquery_update/replace/jquery/jquery.min.js'>\x3C/script>")

In this case, the script tag is not closed like \x3C/script>. It seems that when IMPORTXML retrieves this line, the script tab is not closed. I could confirm that when \x3C is converted to <, =IMPORTXML("http://www.ilgiornale.it/autore/franco-battaglia.html","//h2") correctly returns the values of h2 tag.

By this, it seems that the issue that =IMPORTXML("http://www.ilgiornale.it/autore/franco-battaglia.html","//h2") returns #N/A Imported content is empty occurs.

About the reason that =IMPORTXML("http://www.ilgiornale.it/autore/franco-battaglia.html","*") returns the content, when I put this formula, I couldn't find the values of the script tab. From this situation, I thought that the script tag might have an issue. So I could find the above wrong point. I could confirm that when \x3C is converted to <, =IMPORTXML("http://www.ilgiornale.it/autore/franco-battaglia.html","*") returns the values including the values of the script tag.

Workarounds:

In order to avoid above issue, it is required to be modified \x3C to <. So how about the following workarounds? In these workarounds, I used Google Apps Script. Please think of these workarounds as just two of several workarounds.

Pattern 1:

In this pattern, at first, download the HTML data from the URL, and modify the wrong point. Then, the modified HTML data is created as a file, and the file is shared. And retrieve the URL of the file. Using this URL, the values are retrieved.

Sample script:

function myFunction() {
  var url = "http://www.ilgiornale.it/autore/franco-battaglia.html";
  var data = UrlFetchApp.fetch(url).getContentText().replace(/\\x3C/g, "<");
  var file = DriveApp.createFile("htmlData.html", data, MimeType.HTML);
  file.setSharing(DriveApp.Access.ANYONE_WITH_LINK, DriveApp.Permission.VIEW);
  var endpoint = "https://drive.google.com/uc?id=" + file.getId() + "&export=download";
  Logger.log(endpoint)
}

When you use this script, at first, please run the function of myFunction() and retrieve the endpoint. And as a test case, please put the endpoint to the cell "A1". And put =IMPORTXML(A1,"//h2") to the cell "A2". By this, the values can be retrieved.

Pattern 2:

In this pattern, the values of the tag h2 are directly retrieved by parsing HTML data and put them to the active Spreadsheet.

Sample script:

function myFunction() {
  var url = "http://www.ilgiornale.it/autore/franco-battaglia.html";
  var data = UrlFetchApp.fetch(url).getContentText().match(/<h2[\s\S]+?<\/h2>/g);
  var xml = XmlService.parse("<temp>" + data.join("") + "</temp>");
  var h2Values = xml.getRootElement().getChildren("h2").map(function(e) {return [e.getValue()]});
  var sheet = SpreadsheetApp.getActiveSheet();
  sheet.getRange(sheet.getLastRow() + 1, 1, h2Values.length, 1).setValues(h2Values);

  Logger.log(h2Values)
}

When you run the script, the values of the tag h2 are directly put to the active Spreadsheet.

References:

Class UrlFetchApp
Class XmlService

If I misunderstood your question and this was not the direction you want, I apologize.

来源：https://stackoverflow.com/questions/58049531/another-importxml-returning-empty-content

标签

google-sheets-importxml