How to parse an HTML string in Google Apps Script without using XmlService?

后端 未结 8 1839
清歌不尽
清歌不尽 2020-12-14 09:12

I want to create a scraper using Google Spreadsheets with Google Apps Script. I know it is possible and I have seen some tutorials and threads about it.

The main ide

相关标签:
8条回答
  • 2020-12-14 09:23

    I made cheeriogs for your problem. it's works on GAS as cheerio which is jQuery-like api. You can do that like this.

    const content = UrlFetchApp.fetch('https://example.co/').getContentText();
    const $ = Cheerio.load(content);
    Logger.log($('p .blah').first().text()); // blah blah blah ...
    

    See also https://github.com/asciian/cheeriogs

    0 讨论(0)
  • 2020-12-14 09:25

    maybe not the cleanest approach, but simple string processing does the job too without xmlservice:

    var url = 'https://somewebsite.com/?q=00:11:22:33:44:55';
    var html = UrlFetchApp.fetch(url).getContentText();
    // we want only the link text displayed from here:
    //<td><a href="/company/ubiquiti-networks-inc">Ubiquiti Networks Inc.</a></td>
    var string1 = html.split('<td><a href="/company/')[1]; // all after '<td><a href="/company/'
    var string2 = string1.split('</a></td>')[0];           // all before '</a></td>'
    var string3 = string2.split('>')[1];                   // all after '>'
    Logger.log('link text: '+string3);                     // string3 => "Ubiquiti Networks Inc."
    
    0 讨论(0)
  • 2020-12-14 09:32

    I´ve found a very neat alternative to scrape using Google App Script. It is called PhantomJS Cloud. One can use the urlFetchApp to access the API. This allows to execute Jquery code on the pages, which makes life so much simpler.

    0 讨论(0)
  • 2020-12-14 09:37

    Please be aware that certain web sites may not permit automated scraping of their content, so please consult their terms or service before using Apps Script to extract the content.

    The XmlService only works against valid XML documents, and most HTML (especially HTML5), is not valid XML. A previous version of the XmlService, simply called Xml, allowed for "lenient" parsing, which would allow it to parse HTML as well. This service was sunset in 2013, but for the time being still functions. The reference docs are no longer available, but this old tutorial shows it's usage.

    Another alternative is to use a service like Kimono, which handles the scraping and parsing parts and provides a simple API you can call via UrlFetchApp to retrieve the structured data.

    0 讨论(0)
  • 2020-12-14 09:39

    I had some good luck today just by massaging the html:

    // close unclosed tags
    html = html.replace(/(<(?=link|meta|br|input)[^>]*)(?<!\/)>/ig, '$1/>')
    // force script / style content into cdata
    html = html.replace(/(<(script|style)[^>]*>)/ig, '$1<![CDATA[').replace(/(<\/(script|style)[^>]*>)/ig, ']]>$1')
    // change & to &amp;
    html = html.replace(/&(?!amp;)/g, '&amp;')
    // now it works! (tested with original url)
    let document = XmlService.parse(html)
    
    0 讨论(0)
  • 2020-12-14 09:44

    I have done this in vanilla js. Not real html parsing. Just try to get some content out of a string (url):

    function getLKKBTC() {
      var url = 'https://www.lykke.com/exchange';
      var html = UrlFetchApp.fetch(url).getContentText();
      var searchstring = '<td class="ask_BTCLKK">';
      var index = html.search(searchstring);
      if (index >= 0) {
        var pos = index + searchstring.length
        var rate = html.substring(pos, pos + 6);
        rate = parseFloat(rate)
        rate = 1/rate
        return parseFloat(rate);
      }
      throw "Failed to fetch/parse data from " + url;
    }
    
    0 讨论(0)
提交回复
热议问题