ElasticSearch Indexing 100K documents with BulkRequest API using Java RestHighLevelClient

≡放荡痞女 提交于 2019-12-06 14:52:49

问题


Am reading 100k plus file path from the index documents_qa using scroll API. Actual files will be available in my local d:\drive. By using the file path am reading the actual file and converting into base64 and am reindex with the base64 content (of a file) in another index document_attachment_qa.

My current implementation is, am reading filePath, convering the file into base64 and indexing document along with fileContent one by one. So its taking more time for eg:- indexing 4000 documents its taking more than 6 hours and also connection is terminating due to IO exception.

So now i want to index the documents using BulkRequest API, but am using RestHighLevelClient and am not sure how to using BulkRequest API along with RestHighLevelClient.

Please find my current implementation, which am indexing one by one document.

jsonMap = new HashMap<String, Object>();
            jsonMap.put("id", doc.getId());
            jsonMap.put("app_language", doc.getApp_language());
            jsonMap.put("fileContent", result);

            String id=Long.toString(doc.getId());

IndexRequest request = new IndexRequest(ATTACHMENT, "doc", id ) // ATTACHMENT is the index name
                    .source(jsonMap) // Its my single document.
                    .setPipeline(ATTACHMENT);

IndexResponse response = SearchEngineClient.getInstance3().index(request); // increased timeout 

I found the below documentation for BulkRequest.

https://www.elastic.co/guide/en/elasticsearch/client/java-api/current/java-docs-bulk.html

But am not sure how to implement BulkRequestBuilder bulkRequest = client.prepareBulk(); client.prepareBulk() method when and using RestHighLevelClient.

UPDATE 1

Am trying to indexing all 100K documents in one shot. so i creating one JSONArray and put all my JSONObject into the array one by one. Finally am trying to build BulkRequest and add all my documents (JSONArray) as a source to the BulkRequest and trying to index them.

Here am not sure, how to convert my JSONArray to List of String.

private final static String ATTACHMENT = "document_attachment_qa";
private final static String TYPE = "doc";
JSONArray reqJSONArray=new JSONArray();

while (searchHits != null && searchHits.length > 0) { 
...
...
    jsonMap = new HashMap<String, Object>();
    jsonMap.put("id", doc.getId());
    jsonMap.put("app_language", doc.getApp_language());
    jsonMap.put("fileContent", result);

    reqJSONArray.put(jsonMap)
}

String actionMetaData = String.format("{ \"index\" : { \"_index\" : \"%s\", \"_type\" : \"%s\" } }%n", ATTACHMENT, TYPE);
List<String> bulkData =   // not sure how to convert a list of my documents in JSON strings    
StringBuilder bulkRequestBody = new StringBuilder();
for (String bulkItem : bulkData) {
    bulkRequestBody.append(actionMetaData);
    bulkRequestBody.append(bulkItem);
    bulkRequestBody.append("\n");
}

HttpEntity entity = new NStringEntity(bulkRequestBody.toString(), ContentType.APPLICATION_JSON);
try {
    Response response = SearchEngineClient.getRestClientInstance().performRequest("POST", "/ATTACHMENT/TYPE/_bulk", Collections.emptyMap(), entity);
    return response.getStatusLine().getStatusCode() == HttpStatus.SC_OK;
} catch (Exception e) {
    // do something
}

回答1:


You can just new BulkRequest() and add the requests without using BulkRequestBuilder, like:

BulkRequest request = new BulkRequest();
request.add(new IndexRequest("foo", "bar", "1")
        .source(XContentType.JSON,"field", "foobar"));
request.add(new IndexRequest("foo", "bar", "2")
        .source(XContentType.JSON,"field", "foobar"));
...
BulkResponse bulkResponse = myHighLevelClient.bulk(request, RequestOptions.DEFAULT);



回答2:


In addition to @chengpohi answer. I would like to add below points:

A BulkRequest can be used to execute multiple index, update and/or delete operations using a single request.

It requires at least one operation to be added to the Bulk request:

BulkRequest request = new BulkRequest(); 
request.add(new IndexRequest("posts", "doc", "1")  
        .source(XContentType.JSON,"field", "foo"));
request.add(new IndexRequest("posts", "doc", "2")  
        .source(XContentType.JSON,"field", "bar"));
request.add(new IndexRequest("posts", "doc", "3")  
        .source(XContentType.JSON,"field", "baz"));

Note: The Bulk API supports only documents encoded in JSON or SMILE. Providing documents in any other format will result in an error.

Synchronous Operation:

BulkResponse bulkResponse = client.bulk(request, RequestOptions.DEFAULT);

client will be High-Level Rest Client and execution will be synchronous.

Asynchronous Operation(Recommended Approach):

client.bulkAsync(request, RequestOptions.DEFAULT, listener);

The asynchronous execution of a bulk request requires both the BulkRequest instance and an ActionListener instance to be passed to the asynchronous method.

Listener Example:

ActionListener<BulkResponse> listener = new ActionListener<BulkResponse>() {
    @Override
    public void onResponse(BulkResponse bulkResponse) {

    }

    @Override
    public void onFailure(Exception e) {

    }
};

The returned BulkResponse contains information about the executed operations and allows to iterate over each result as follows:

for (BulkItemResponse bulkItemResponse : bulkResponse) { 
    DocWriteResponse itemResponse = bulkItemResponse.getResponse(); 

    if (bulkItemResponse.getOpType() == DocWriteRequest.OpType.INDEX
            || bulkItemResponse.getOpType() == DocWriteRequest.OpType.CREATE) { 
        IndexResponse indexResponse = (IndexResponse) itemResponse;

    } else if (bulkItemResponse.getOpType() == DocWriteRequest.OpType.UPDATE) { 
        UpdateResponse updateResponse = (UpdateResponse) itemResponse;

    } else if (bulkItemResponse.getOpType() == DocWriteRequest.OpType.DELETE) { 
        DeleteResponse deleteResponse = (DeleteResponse) itemResponse;
    }
}

The following arguments can optionally be provided:

request.timeout(TimeValue.timeValueMinutes(2)); 
request.timeout("2m");

I hope this helps.



来源:https://stackoverflow.com/questions/51868548/elasticsearch-indexing-100k-documents-with-bulkrequest-api-using-java-resthighle

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!