SOLR - Best approach to import 20 million documents from csv file

后端 未结 5 980
有刺的猬
有刺的猬 2020-12-29 09:29

My current task on hand is to figure out the best approach to load millions of documents in solr. The data file is an export from DB in csv format.

Currently, I am t

5条回答
  •  清歌不尽
    2020-12-29 10:08

    Unless a database is already part of your solution, I wouldn't add additional complexity to your solution. Quoting the SOLR FAQ it's your servlet container that is issuing the session time-out.

    As I see it, you have a couple of options (In my order of preference):

    Increase container timeout

    Increase the container timeout. ("maxIdleTime" parameter, if you're using the embedded Jetty instance).

    I'm assuming you only occasionally index such large files? Increasing the time-out temporarily might just be simplest option.

    Split the file

    Here's the simple unix script that will do the job (Splitting the file in 500,000 line chunks):

    split -d -l 500000 data.csv split_files.
    for file in `ls split_files.*`
    do  
    curl 'http://localhost:8983/solr/update/csv?fieldnames=id,name,category&commit=true' -H 'Content-type:text/plain; charset=utf-8' --data-binary @$file
    done
    

    Parse the file and load in chunks

    The following groovy script uses opencsv and solrj to parse the CSV file and commit changes to Solr every 500,000 lines.

    import au.com.bytecode.opencsv.CSVReader
    
    import org.apache.solr.client.solrj.SolrServer
    import org.apache.solr.client.solrj.impl.CommonsHttpSolrServer
    import org.apache.solr.common.SolrInputDocument
    
    @Grapes([
        @Grab(group='net.sf.opencsv', module='opencsv', version='2.3'),
        @Grab(group='org.apache.solr', module='solr-solrj', version='3.5.0'),
        @Grab(group='ch.qos.logback', module='logback-classic', version='1.0.0'),
    ])
    
    SolrServer server = new CommonsHttpSolrServer("http://localhost:8983/solr/");
    
    new File("data.csv").withReader { reader ->
        CSVReader csv = new CSVReader(reader)
        String[] result
        Integer count = 1
        Integer chunkSize = 500000
    
        while (result = csv.readNext()) {
            SolrInputDocument doc = new SolrInputDocument();
    
            doc.addField("id",         result[0])
            doc.addField("name_s",     result[1])
            doc.addField("category_s", result[2])
    
            server.add(doc)
    
            if (count.mod(chunkSize) == 0) {
                server.commit()
            }
            count++
        }
        server.commit()
    }
    

提交回复
热议问题