I have to store some message in ElasticSearch integrate with my python program. Now what I try to store the message is:
d={\"message\":\"this is message\"}
(the other approaches mentioned in this thread use python list for the ES update, which is not a good solution today, especially when you need to add millions of data to ES)
Better approach is using python generators -- process gigs of data without going out of memory or compromising much on speed.
Below is an example snippet from a practical use case - adding data from nginx log file to ES for analysis.
def decode_nginx_log(_nginx_fd):
for each_line in _nginx_fd:
# Filter out the below from each log line
remote_addr = ...
timestamp = ...
...
# Index for elasticsearch. Typically timestamp.
idx = ...
es_fields_keys = ('remote_addr', 'timestamp', 'url', 'status')
es_fields_vals = (remote_addr, timestamp, url, status)
# We return a dict holding values from each line
es_nginx_d = dict(zip(es_fields_keys, es_fields_vals))
# Return the row on each iteration
yield idx, es_nginx_d # <- Note the usage of 'yield'
def es_add_bulk(nginx_file):
# The nginx file can be gzip or just text. Open it appropriately.
...
es = Elasticsearch(hosts = [{'host': 'localhost', 'port': 9200}])
# NOTE the (...) round brackets. This is for a generator.
k = ({
"_index": "nginx",
"_type" : "logs",
"_id" : idx,
"_source": es_nginx_d,
} for idx, es_nginx_d in decode_nginx_log(_nginx_fd))
helpers.bulk(es, k)
# Now, just run it.
es_add_bulk('./nginx.1.log.gz')
This skeleton demonstrates the usage of generators. You can use this even on a bare machine if you need to. And you can go on expanding on this to tailor to your needs quickly.
Python Elasticsearch reference here.