How to extract/parse HTML using Microdata

自作多情 提交于 2019-11-30 10:25:26

Try beginning at the root itemscope node , filter descendant elements having itemprop attributes; return object result containing array items holding Microdata items.

This solution is based on the algorithm found at Microdata

7 Converting HTML to other formats

7.1 JSON

Given a list of nodes nodes in a Document, a user agent must run the following algorithm to extract the microdata from those nodes into a JSON form:

Let result be an empty object.

Let items be an empty array.

For each node in nodes, check if the element is a top-level microdata item, and if it is then get the object for that element and add it to items.

Add an entry to result called "items" whose value is the array items.

Return the result of serializing result to JSON in the shortest possible way (meaning no whitespace between tokens, no unnecessary zero digits in numbers, and only using Unicode escapes in strings for characters that do not have a dedicated escape sequence), and with a lowercase "e" used, when appropriate, in the representation of any numbers. [JSON]

This algorithm returns an object with a single property that is an array, instead of just returning an array, so that it is possible to extend the algorithm in the future if necessary.

When the user agent is to get the object for an item item, optionally with a list of elements memory, it must run the following substeps:

Let result be an empty object.

If no memory was passed to the algorithm, let memory be an empty list.

Add item to memory.

If the item has any item types, add an entry to result called "type" whose value is an array listing the item types of item, in the order they were specified on the itemtype attribute.

If the item has a global identifier, add an entry to result called "id" whose value is the global identifier of item.

Let properties be an empty object.

For each element element that has one or more property names and is one of the properties of the item item, in the order those elements are given by the algorithm that returns the properties of an item, run the following substeps:

Let value be the property value of element.

If value is an item, then: If value is in memory, then let value be the string "ERROR". Otherwise, get the object for value, passing a copy of memory, and then replace value with the object returned from those steps.

For each name name in element's property names, run the following substeps:

If there is no entry named name in properties, then add an entry named name to properties whose value is an empty array.

Append value to the entry named name in properties.

Add an entry to result called "properties" whose value is the object properties.

Return result.

var result = {};
var items = [];
document.querySelectorAll("[itemscope]")
  .forEach(function(el, i) {
    var item = {
      "type": [el.getAttribute("itemtype")],
      "properties": {}
    };
    var props = el.querySelectorAll("[itemprop]");
    props.forEach(function(prop) {
      item.properties[prop.getAttribute("itemprop")] = [
        prop.content || prop.textContent || prop.src
      ];
      if (prop.matches("[itemscope]") && prop.matches("[itemprop]")) {
        var _item = {
          "type": [prop.getAttribute("itemtype")],
          "properties": {}
        };
        prop.querySelectorAll("[itemprop]")
          .forEach(function(_prop) {
            _item.properties[_prop.getAttribute("itemprop")] = [
              _prop.content || _prop.textContent || _prop.src
            ];
          });
        item.properties[prop.getAttribute("itemprop")] = [_item];
      }
    });
    items.push(item)
  })

result.items = items;

console.log(result);

document.body
  .insertAdjacentHTML("beforeend", "<pre>" + JSON.stringify(result, null, 2) + "<pre>");

var props = ["Blendmagic", "ratingValue"];

// get the 'content' corresponding to itemprop 'ratingValue' 
// for item prop-name 'Blendmagic'
var data = result.items.map(function(value, key) {
  if (value.properties.name && value.properties.name[0] === props[0]) {
    var prop = value.properties.reviews[0].properties;
    var res = {},
      _props = {};
    _props[props[1]] = prop[props[1]];
    res[props[0]] = _props
    return res
  };
})[0];

console.log(data);
document.querySelector("pre").insertAdjacentHTML("beforebegin", "<pre>" + JSON.stringify(result, null, 2) + "<pre>");
<!DOCTYPE html>
<html>

<head>
</head>

<body>
  <div itemscope itemtype="http://schema.org/Offer">
    <span itemprop="name">Blendmagic</span>
    <span itemprop="price">$19.95</span>
    <div itemprop="reviews" itemscope itemtype="http://schema.org/AggregateRating">
      <img data-src="four-stars.jpg" />
      <meta itemprop="ratingValue" content="4" />
      <meta itemprop="bestRating" content="5" />Based on <span itemprop="ratingCount">25</span> user ratings
    </div>
  </div>
  <div itemscope itemtype="http://schema.org/Offer">
    <span itemprop="name">testMagic</span>
    <span itemprop="price">$10.95</span>
    <div itemprop="reviews" itemscope itemtype="http://schema.org/AggregateRating">
      <img data-src="four-stars.jpg" />
      <meta itemprop="ratingValue" content="4" />
      <meta itemprop="bestRating" content="5" />Based on <span itemprop="ratingCount">25</span> user ratings
    </div>
  </div>
</body>

</html>

See also Recursion and loops of Microdata items

Check this Fiddle

$("span[itemprop='name']").each(function(e) {
    if ($(arguments[1]).text() == 'Blendmagic') {
        alert($($("meta[itemprop='ratingValue']")[e]).attr('content'));       
    }    
});
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!