How to parse an HTML string in Google Apps Script without using XmlService?

后端未结

关注

 8  1839

I want to create a scraper using Google Spreadsheets with Google Apps Script. I know it is possible and I have seen some tutorials and threads about it.

The main ide

相关标签:

8条回答

感情败类

2020-12-14 09:23
I made cheeriogs for your problem. it's works on GAS as cheerio which is jQuery-like api. You can do that like this.
```
const content = UrlFetchApp.fetch('https://example.co/').getContentText();
const $ = Cheerio.load(content);
Logger.log($('p .blah').first().text()); // blah blah blah ...
```
See also https://github.com/asciian/cheeriogs
0 讨论(0)
发布评论:

提交评论
- 加载中...

别那么骄傲

2020-12-14 09:25

maybe not the cleanest approach, but simple string processing does the job too without xmlservice:

var url = 'https://somewebsite.com/?q=00:11:22:33:44:55';
var html = UrlFetchApp.fetch(url).getContentText();
// we want only the link text displayed from here:
//<td><a href="/company/ubiquiti-networks-inc">Ubiquiti Networks Inc.</a></td>
var string1 = html.split('<td><a href="/company/')[1]; // all after '<td><a href="/company/'
var string2 = string1.split('</a></td>')[0];           // all before '</a></td>'
var string3 = string2.split('>')[1];                   // all after '>'
Logger.log('link text: '+string3);                     // string3 => "Ubiquiti Networks Inc."

0 讨论(0)

孤街浪徒

2020-12-14 09:32

I´ve found a very neat alternative to scrape using Google App Script. It is called PhantomJS Cloud. One can use the urlFetchApp to access the API. This allows to execute Jquery code on the pages, which makes life so much simpler.

0 讨论(0)
发布评论:

提交评论
- 加载中...
长情又很酷

2020-12-14 09:37

Please be aware that certain web sites may not permit automated scraping of their content, so please consult their terms or service before using Apps Script to extract the content.

The XmlService only works against valid XML documents, and most HTML (especially HTML5), is not valid XML. A previous version of the XmlService, simply called Xml, allowed for "lenient" parsing, which would allow it to parse HTML as well. This service was sunset in 2013, but for the time being still functions. The reference docs are no longer available, but this old tutorial shows it's usage.

Another alternative is to use a service like Kimono, which handles the scraping and parsing parts and provides a simple API you can call via UrlFetchApp to retrieve the structured data.

0 讨论(0)
发布评论:

提交评论
- 加载中...

既然无缘

2020-12-14 09:39

I had some good luck today just by massaging the html:

// close unclosed tags
html = html.replace(/(<(?=link|meta|br|input)[^>]*)(?<!\/)>/ig, '$1/>')
// force script / style content into cdata
html = html.replace(/(<(script|style)[^>]*>)/ig, '$1<![CDATA[').replace(/(<\/(script|style)[^>]*>)/ig, ']]>$1')
// change & to &amp;
html = html.replace(/&(?!amp;)/g, '&amp;')
// now it works! (tested with original url)
let document = XmlService.parse(html)

0 讨论(0)

耶瑟儿～

2020-12-14 09:44

I have done this in vanilla js. Not real html parsing. Just try to get some content out of a string (url):

function getLKKBTC() {
  var url = 'https://www.lykke.com/exchange';
  var html = UrlFetchApp.fetch(url).getContentText();
  var searchstring = '<td class="ask_BTCLKK">';
  var index = html.search(searchstring);
  if (index >= 0) {
    var pos = index + searchstring.length
    var rate = html.substring(pos, pos + 6);
    rate = parseFloat(rate)
    rate = 1/rate
    return parseFloat(rate);
  }
  throw "Failed to fetch/parse data from " + url;
}

0 讨论(0)

1 2 下一页