I have a pseudocode in python that reads from a Kafka stream and upsert documents in Elasticsearch (incrementing a counter view
if the document exists already.
for message in consumer: msg = json.loads(message.value) print(msg) index = INDEX_NAME es_id = msg["id"] script = {"script":"ctx._source.view+=1","upsert" : msg} es.update(index=index, doc_type="test", id=es_id, body=script)
Since I want to use it in a distributed environment, I am using Spark Structured Streaming
df.writeStream \ .format("org.elasticsearch.spark.sql")\ .queryName("ESquery")\ .option("es.resource","credentials/url") \ .option("checkpointLocation", "checkpoint").start()
or SparkStreaming in scala that reads from KafkaStream:
// Initializing Spark Streaming Context and kafka stream sparkConf.setMaster("local[2]") val ssc = new StreamingContext(sparkConf, Seconds(10)) [...] val messages = KafkaUtils.createDirectStream[String, String]( ssc, PreferConsistent, Subscribe[String, String](topicsSet, kafkaParams) ) [...] val urls = messages.map(record => JsonParser.parse(record.value()).values.asInstanceOf[Map[String, Any]]) urls.saveToEs("credentials/credential")
.saveToEs(...)
is the API of elastic-hadoop.jar
documented here. Unfortunately this repo is not really well documented. So I cannot understand where I can put the script command.
Is there anyone can help me? Thank you in advance