Write to elasticsearch from spark is very slow

流过昼夜 提交于 2019-12-07 07:12:27

First, Let's start with what's happening in your application. Apache Spark is reading 1 (not so big) csv file which is compressed. Thus first spark will spend time decompressing data and scan it before writing it in elasticsearch.

This will create a Dataset/DataFrame with one partition (confirmed by the result of your df.rdd.getNumPartitions mentioned in the comments).

One straight-forward solution would be to repartition your data on read and cache it, before writing it to elasticsearch. Now I'm not sure what your data looks like, so deciding the number of partitions is subject of benchmark from your side.

val input = sqlContext.read.options(readOptions)
                      .csv(inputFile.getAbsolutePath)
                      .repartition(100) // 100 is just an example
                      .cache

I'm not sure how much will be the benefit on your application, because I believe there might be other bottlenecks (network IO, disk type for ES).

PS: I ought converting csv to parquet files before building ETL over them. There is real gain of performance here. (personal opinion and benchmarks)

Another possible optimization would be to tweak the es.batch.size.entries setting for the elasticsearch-spark connector. The default value is 1000.

You need to be careful when setting this parameter because you might overload elasticsearch. I strongly advice you take a look at the available configurations here.

I hope this helps !

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!