Indexing static HTML blob storage content with Azure Cognitive Search is not working as expected

依然范特西╮ 提交于 2021-02-11 15:12:37

问题


I'm working on indexing static HTML content in blob storage. The documentation states that preprocessing analyzers will strip surrounding HTML tags when indexing content from that data source. However, our content value is always the entire raw HTML document. I'm also unable to pull out the value of our "meta description" tags. According to the documentation on Indexing Blob Storage, HTML content should automatically produce a metadata_description property, but the value is always null.

I've tried many different indexer configurations, but thus far have not been able to tell if I have something misconfigured or if Azure Search doesn't recognize the content type properly.

All of the files in blob storage have a .html file extension, and the Content Type column shows text/html.

This is the indexer configuration (some bits <redacted>):

{
  "@odata.context": "https://<instance>.search.windows.net/$metadata#indexers/$entity",
  "@odata.etag": "\"<tag>\"",
  "name": "<name>",
  "description": null,
  "dataSourceName": "<datasource name>",
  "skillsetName": null,
  "targetIndexName": "<target index>",
  "disabled": null,
  "schedule": {
    "interval": "PT2H",
    "startTime": "0001-01-01T00:00:00Z"
  },
  "parameters": {
    "batchSize": null,
    "maxFailedItems": -1,
    "maxFailedItemsPerBatch": null,
    "base64EncodeKeys": null,
    "configuration": {
      "parsingMode": "text",
      "dataToExtract": "contentAndMetadata",
      "excludedFileNameExtensions": ".png .jpg .mpg .pdf",
      "indexedFileNameExtensions": ".html"
    }
  },
  "fieldMappings": [
    {
      "sourceFieldName": "metadata_storage_path",
      "targetFieldName": "id",
      "mappingFunction": {
        "name": "base64Encode",
        "parameters": null
      }
    },
    {
      "sourceFieldName": "metadata_description",
      "targetFieldName": "description",
      "mappingFunction": null
    },
    {
      "sourceFieldName": "metadata_storage_path",
      "targetFieldName": "url",
      "mappingFunction": {
        "name": "extractTokenAtPosition",
        "parameters": {
          "delimiter": "<delimiter>",
          "position": 1
        }
      }
    }
  ],
  "outputFieldMappings": [],
  "cache": null
}

回答1:


This is likely due to the configuration in your indexer "parsingMode": "text"

This parsing mode is for extracting literal text values from the documents. In this case, that includes all of the html tags.

Change that configuration to "parsingMode": "default" to strip html tags from your documents.



来源:https://stackoverflow.com/questions/59633272/indexing-static-html-blob-storage-content-with-azure-cognitive-search-is-not-wor

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!