How to extract metatags from HTML files and index them in SOLR and TIKA

后端 未结 3 1585
夕颜
夕颜 2021-01-07 11:29

I am trying to extract the metatags of HTML files and indexing them into solr with tika integration. I am not able to extract those metatags with Tika and not able to displa

3条回答
  •  北恋
    北恋 (楼主)
    2021-01-07 12:23

    Although an older question, I am replying as

    1. I recently asked a similar question (no replies or comments after several days), that I sorted out and which is relevant to this question.

    2. Solr has changed much over the years, and the existing documentation (where it exists) on this topic is both confusing and sometimes erroneous.

    3. While lengthy, this reply provides a solution to the question with an example and documentation.

    Briefly, my now-deleted StackOverflow question was "Extracting custom (e.g.

    The short answer is that while it is relatively easy to index "standard" HTML elements (a; div; h1; h2; li; meta; p; title; ... https://www.w3.org/TR/2005/WD-xhtml2-20050527/elements.html), it is challenging to include custom tagsets without the rigid use of properly formatted XML files and update functions in Solr (see, e.g.: https://lucene.apache.org/solr/guide/6_6/uploading-data-with-index-handlers.html#uploading-data-with-index-handlers), or the use of the captureAttr parameter with Apache Tika, native to Solr via the ExtractingRequestHandler (described below) or other tools such as Apache Nutch.

    Standard HTML elements such as Solr HTML Indexing Tests are easily indexed; however, non-standard elements like bt-ic8eew2u are ignored.

    While you could apply XML-based solutions such as bt-ic8eew2u, I prefer a facile HTML-based solution -- hence, the HTML metadata approach.


    Environment: Arch Linux (x86_64) command-line; Apache Solr 8.7.0; Solr Admin UI (http://localhost:8983/solr/#/gettingstarted/query) in FireFox 83.0

    Test file (solr_test9.html):

    
    
    
      
      Solr HTML Indexing Tests
      
      
      
      
      
    
    
    
    

    Apples

    I like apples.

    Bananas

    I also like bananas.

    This text is located in div element 1.

    This text is located in div element 2.


    Lorem ipsum dolor sit amet, consectetur adipiscing elit.

    Suspendisse efficitur pulvinar elementum.

    My website is BuriedTruth.com.

    Nova Scotia

    Nova Scotia is a province on the east coast of Canada.

    Capital of Nova Scotia

    Halifax is the capital of N.S.

    Halifax is also N.S.'s largest city.

    British Columbia

    Capital of British Columbia

    Victoria is the capital of B.C.

    Vancouver is the largest city in B.C., however.

    Non-terminated sentence (missing period)

    Current date: 2020-11-17


    solrconfig.xml

    Here are the relevant additions to my solrconfig.xml file.

      
      
      
    
      
      
        
          true
          ignored_
          div
          div
          h1
          h1
          h2
          h2_t
          p
          
          p
          
          
          
          
          
        
      
    
      
      
      
        
        
        
        
        
          content
          title
          p
          
          
          \s+
           
          true
        
    
        
        
        
        
          
          content
          title
          p
          
          
          rect http
          http
          true
        
        
        
        
      
    

    managed-schema (schema.xml):

    I edited the Solr schema via the Admin UI. Basically, for whatever HTML metadata you want to index, add a similarly-named field (of the appropriate type: e.g., text_general | string | pdate | ...).

    For example, to capture the "doc-id" and "date_pub" metadata I created the following (respective) schema entries:

    
    
    

    indexing

    Here's how I indexed that HTML test file,

    [victoria@victoria solr-8.7.0]$ date; pwd; ls -l; echo; ls -l server/solr/gettingstarted/conf/
    
    Tue Nov 17 02:18:12 PM PST 2020
    
    /mnt/Vancouver/apps/solr/solr-8.7.0
    total 1792
    drwxr-xr-x  3 victoria victoria   4096 Nov 17 13:26 bin
    -rw-r--r--  1 victoria victoria 946955 Oct 28 02:40 CHANGES.txt
    drwxr-xr-x 12 victoria victoria   4096 Oct 29 07:09 contrib
    drwxr-xr-x  4 victoria victoria   4096 Nov 15 12:33 dist
    drwxr-xr-x  3 victoria victoria   4096 Nov 15 12:33 docs
    drwxr-xr-x  6 victoria victoria   4096 Oct 28 02:40 example
    drwxr-xr-x  2 victoria victoria  36864 Oct 28 02:40 licenses
    -rw-r--r--  1 victoria victoria  12646 Oct 28 02:21 LICENSE.txt
    -rw-r--r--  1 victoria victoria 766662 Oct 28 02:40 LUCENE_CHANGES.txt
    -rw-r--r--  1 victoria victoria  27540 Oct 28 02:21 NOTICE.txt
    -rw-r--r--  1 victoria victoria   7490 Oct 28 02:40 README.txt
    drwxr-xr-x 11 victoria victoria   4096 Nov 15 12:40 server
    
    total 208
    drwxr-xr-x 2 victoria victoria  4096 Oct 28 02:21 lang
    -rw-r--r-- 1 victoria victoria 33888 Nov 17 13:20 managed-schema
    -rw-r--r-- 1 victoria victoria   873 Oct 28 02:21 protwords.txt
    -rw-r--r-- 1 victoria victoria 33788 Nov 17 11:36 schema.xml.2020-11-17.13:01
    -rw-r--r-- 1 victoria victoria 59248 Nov 17 13:16 solrconfig.xml
    -rw-r--r-- 1 victoria victoria 59151 Nov 17 12:59 solrconfig.xml.2020-11-17.13:01
    -rw-r--r-- 1 victoria victoria   781 Oct 28 02:21 stopwords.txt
    -rw-r--r-- 1 victoria victoria  1124 Oct 28 02:21 synonyms.txt
    
    [victoria@victoria solr-8.7.0]$ solr restart; sleep 1; post -c gettingstarted /mnt/Vancouver/programming/datasci/solr/test/solr_test9.html
    
    Sending stop command to Solr running on port 8983 ... waiting up to 180 seconds to allow Jetty process 3511453 to stop gracefully.
    Waiting up to 180 seconds to see Solr running on port 8983 [|]  
    Started Solr server on port 8983 (pid=3572520). Happy searching!
    
    /usr/lib/jvm/java-8-openjdk/jre//bin/java -classpath /mnt/Vancouver/apps/solr/solr-8.7.0/dist/solr-core-8.7.0.jar -Dauto=yes -Dc=gettingstarted -Ddata=files org.apache.solr.util.SimplePostTool /mnt/Vancouver/programming/datasci/solr/test/solr_test9.html
    SimplePostTool version 5.0.0
    Posting files to [base] url http://localhost:8983/solr/gettingstarted/update...
    Entering auto mode. File endings considered are xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
    POSTing file solr_test9.html (text/html) to [base]/extract
    1 files indexed.
    COMMITting Solr index changes to http://localhost:8983/solr/gettingstarted/update...
    Time spent: 0:00:00.755
    
    [victoria@victoria solr-8.7.0]$ 
    

    ... and here is the result (Solr Admin UI: http://localhost:8983/solr/#/gettingstarted/query)

    http://localhost:8983/solr/gettingstarted/select?q=*%3A*
    
    {
      "responseHeader":{
        "status":0,
        "QTime":0,
        "params":{
          "q":"*:*",
          "_":"1605651674401"}},
      "response":{"numFound":1,"start":0,"numFoundExact":true,"docs":[
          {
            "id":"/mnt/Vancouver/programming/datasci/solr/test/solr_test9.html",
            "stream_size":[1428],
            "x_parsed_by":["org.apache.tika.parser.DefaultParser",
              "org.apache.tika.parser.html.HtmlParser"],
            "stream_content_type":["text/html"],
            "date_created":"2019-11-01T00:00:00Z",
            "date_current":["2020-11-17"],
            "resourcename":["/mnt/Vancouver/programming/datasci/solr/test/solr_test9.html"],
            "title":["Solr HTML Indexing Tests"],
            "date_pub":"2020-11-16T00:00:00Z",
            "doc_id":"bt-ic8eeW2U",
            "source_url":"/mnt/Vancouver/programming/datasci/solr/test/solr_test9.html",
            "dc_title":["Solr HTML Indexing Tests"],
            "content_encoding":["UTF-8"],
            "content_type":["application/xhtml+xml; charset=UTF-8"],
            "content":[" en-us stream_size 1428 X-Parsed-By org.apache.tika.parser.DefaultParser X-Parsed-By org.apache.tika.parser.html.HtmlParser stream_content_type text/html date_created 2019-11-01 resourceName /mnt/Vancouver/programming/datasci/solr/test/solr_test9.html date_pub 2020-11-16 doc_id bt-ic8eeW2U source_url /mnt/Vancouver/programming/datasci/solr/test/solr_test9.html dc:title Solr HTML Indexing Tests Content-Encoding UTF-8 Content-Language en-us Content-Type application/xhtml+xml; charset=UTF-8 Solr HTML Indexing Tests Lorem ipsum dolor sit amet, consectetur adipiscing elit. "],
            "div":[" div1 This text is located in div element 1. div2 This text is located in div element 2."],
            "p":[" I like apples. I also like bananas. Suspendisse efficitur pulvinar elementum. My website is https://buriedtruth.com/ BuriedTruth.com . Nova Scotia is a province on the east coast of Canada. Halifax is the capital of N.S. Halifax is also N.S.'s largest city. Victoria is the capital of B.C. Vancouver is the largest city in B.C., however. Non-terminated sentence (missing period) Current date: 2020-11-17"],
            "h1":[" Apples Nova Scotia British Columbia"],
            "h2_t":" Bananas Capital of Nova Scotia Capital of British Columbia",
            "_version_":1683647678197530624}]
      }}
    

    UPDATE -- managed-schema >> schema.xml pecularities:

    While not related to the original question, the following content is related to my answer (above) -- specifically, pecularities associated with switching from Solr's managed-schema to the classic (user-managed) schema.xml. It is included here to provide a complete solution.

    First, add

    
    

    to your solrconfig.xml file.

    Then edit this: -->

    
    

    ... to this:

    
    

    i.e., delete

      name="add-unknown-fields-to-the-schema"
      default="${update.autoCreateFields:true}"
      add-schema-fields
    

    Rename managed-schema to schema.xml, and restart Solr or reload the core to effect the changes.

    To further extend my example (above), here is a sample and the output, on the HTML code that I also provided (above).

    solrconfig.xml (part):

    
      
      
      
        content
        title
        p
        
        
        \s+
         
        true
      
    
      
        content
        title
        p
        
        
        rect http
        http
        true
      
    
      
        content
        title
        [sS]olr
        APPLE
        true
      
    
      
        content
        title
        HTML
        BANANA
        true
      
    
      
    
    

    output

    • Note:

      • changes to "title" (Solr >> APPLE; HTML >> BANANA)
      • removal of "rect " from the URL in "p" (discussed here: Solr ExtractingRequestHandler extracting "rect" in links)
    {
      "responseHeader":{
        "status":0,
        "QTime":32,
        "params":{
          "q":"*:*",
          "_":"1605767164812"}},
      "response":{"numFound":1,"start":0,"numFoundExact":true,"docs":[
          {
            "id":"/mnt/Vancouver/programming/datasci/solr/test/solr_test9.html",
            "stream_size":[1628],
            "x_parsed_by":["org.apache.tika.parser.DefaultParser",
              "org.apache.tika.parser.html.HtmlParser"],
            "stream_content_type":["text/html"],
            "date_created":"2020-11-11T21:36:38Z",
            "date_current":["2020-11-17"],
            "resourcename":["/mnt/Vancouver/programming/datasci/solr/test/solr_test9.html"],
            "title":["APPLE BANANA Indexing Tests"],
            "date_pub":"2020-11-16T21:37:18Z",
            "doc_id":"bt-ic8eeW2U",
            "source_url":"/mnt/Vancouver/programming/datasci/solr/test/solr_test9.html",
            "dc_title":["Solr HTML Indexing Tests"],
            "content_encoding":["UTF-8"],
            "content_type":["application/xhtml+xml; charset=UTF-8"],
            "content":[" en-us stream_size 1628 X-Parsed-By org.apache.tika.parser.DefaultParser X-Parsed-By org.apache.tika.parser.html.HtmlParser stream_content_type text/html date_created 2020-11-11T21:36:38Z resourceName /mnt/Vancouver/programming/datasci/APPLE/test/APPLE_test9.html date_pub 2020-11-16T21:37:18Z doc_id bt-ic8eeW2U source_url /mnt/Vancouver/programming/datasci/APPLE/test/APPLE_test9.html dc:title APPLE BANANA Indexing Tests Content-Encoding UTF-8 Content-Language en-us Content-Type application/xhtml+xml; charset=UTF-8 APPLE BANANA Indexing Tests Lorem ipsum dolor sit amet, consectetur adipiscing elit. "],
            "div":[" div1 This text is located in div element 1. div2 This text is located in div element 2. apple This text is located in the \"apple\" (class) div element. banana This text is located in the \"banana\" (class) div element."],
            "p":[" I like apples. I also like bananas. Suspendisse efficitur pulvinar elementum. My website is https://buriedtruth.com/ BuriedTruth.com . Nova Scotia is a province on the east coast of Canada. Halifax is the capital of N.S. Halifax is also N.S.'s largest city. Victoria is the capital of B.C. Vancouver is the largest city in B.C., however. Non-terminated sentence (missing period) Current date: 2020-11-17"],
            "h1":[" Apples Nova Scotia British Columbia"],
            "h2_t":" Bananas Capital of Nova Scotia Capital of British Columbia",
            "_version_":1683814668971278336}]
      }}
    

提交回复
热议问题