splitting lucene index into two halves

ぐ巨炮叔叔 提交于 2020-01-24 13:42:26

问题


what is the best way to split an existing Lucene index into two halves i.e. each split should contain half of the total number of documents in the original index


回答1:


The easiest way to split an existing index (without reindexing all the documents) is to:

  1. Make another copy of the existing index (i.e. cp -r myindex mycopy)
  2. Open the first index, and delete half the documents (range 0 to maxDoc / 2)
  3. Open the second index, and delete the other half (range maxDoc / 2 to maxDoc)
  4. Optimize both indices

This is probably not the most efficient way, but it requires very little coding to do.




回答2:


A fairly robust mechanism is to use a checksum of the document, modulo the number of indexes, to decide which index it will go into.




回答3:


Recent versions of Lucene have a dedicated tool to do this (IndexSplitter and MultiPassIndexSplitter under contrib/misc).




回答4:


This question was one of the first I found when I was researching answers to this problem, so I'm leaving my solution here for future generations. In my case, I needed to split my index along specific lines, not arbitrarily down the middle or into thirds or what have you. This is a C# solution using Lucene 3.0.3.

My app's index is over 300GB in size, which was becoming a little unmanageable. Each document in the index is associated to one of the manufacturing plants that uses the app. There is no business reason that one plant would ever search for another plant's data, so I needed to cleanly divide the index along those lines. Here's the code I wrote to do so:

var distinctPlantIDs = databaseRepo.GetDistinctPlantIDs();
var sourceDir = GetOldIndexDir();
foreach (var plantID in distinctPlantIDs)
{
    var query = new TermQuery(new Term("PlantID", plantID.ToString()));
    var targetDir = GetNewIndexDirForPlant(plantID); //returns a unique directory where this plant's index will go

    //read each plant's documents and write them to the new index
    using (var analyzer = new StandardAnalyzer(Version.LUCENE_30, CharArraySet.EMPTY_SET))
    using (var sourceSearcher = new IndexSearcher(sourceDir, true))
    using (var destWriter = new IndexWriter(targetDir, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED))
    {
        var numHits = sourceSearcher.DocFreq(query.Term);
        if (numHits <= 0) continue;
        var hits = sourceSearcher.Search(query, numHits).ScoreDocs;
        foreach (var hit in hits)
        {
            var doc = sourceSearcher.Doc(hit.Doc);
            destWriter.AddDocument(doc);
        }
        destWriter.Optimize();
        destWriter.Commit();
    }

    //delete the documents out of the old index
    using (var analyzer = new StandardAnalyzer(Version.LUCENE_30, CharArraySet.EMPTY_SET))
    using (var sourceWriter = new IndexWriter(sourceIndexDir, analyzer, false, IndexWriter.MaxFieldLength.UNLIMITED))
    {
        sourceWriter.DeleteDocuments(query);
        sourceWriter.Commit();
    }
}

That part that deletes the records out of the old index is there because in my case, one plant's records took up the majority of the index (over 2/3rds). So in my real version there is some extra code to do that plant last, and instead of splitting it out like the others it will optimize the remaining index (which is just that plant) and then move it to its new directory.

Anyway, hope this helps someone out there.



来源:https://stackoverflow.com/questions/2865960/splitting-lucene-index-into-two-halves

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!