What is the best way to create a subset of my data in Elasticsearch?

问题

I have an index in elasticsearch containing apache log data. Here is what I want to do:

Identify all visitors (by ip number) that accessed a certain file (e.g. /signup.php).
Do a search/query/aggregation on my data, but limit the documents that are examined to those containing an ip number found in step 1.

In the sql world, I would just create a temporary table and insert all the matching IP numbers from step one. Next I would query my main table and limit the result set by joining in my temporary table on IP number.

I understand joins are not possible in elasticsearch. The elasticsearch documentation suggests a few ways to handle situations like this:

Application side joins

This does not seem practical, because the list of IP numbers may be very large and it seems inefficient to send the results to the client and then pass it back to elasticsearch in one huge terms filter.

Denormalizing the data

This would involve iterating over the matching IP numbers and updating every document in the index for any given IP number with something like "in_group": true, so I can use that in my query later on. This also seems very impractical and inefficient, especially since the source query (step 1) is dynamic.

Nested Object and/or parent-Child relationship

I'm not sure if dynamically creating new documents with nested objects is practical in this case. It seems to me that I would end up copying huge parts of my data.

I'm new to elasticsearch and noSQL in general, so perhaps I'm just looking at the problem the wrong way and I shouldn't be trying to emulate a JOIN in the first place.

But this seems like such a common case for segmenting a dataset, it makes me wonder if I am overlooking some other obvious way of doing this?

Any help would be appreciated!

回答1:

If I understood your question correctly, you are trying to get a subset of your documents based on certain condition and use that sub set to query/search/aggregate it further.

If true, why would you like to store it in another view(sql types). The main power of elasticsearch is it's caching capability of filters and thus it highly reduces your query time. Using this feature, all the queries/searches/aggregation you need to perform on, would require a term filter which would specify the condition you are trying to do in step 1. Now, whatever other operations you want to do, you can do it in the same query on the already shrinked dataset.

If you have other different use cases, then the storage of document(mapping) might be considered to get changed for easier and faster retrieval.

回答2:

This is a current workaround that I use:

Run this bash script to save the first query ip-list to a temp index, then use a terms-query filter (in Kibana) to query using the ip-list from step1.

#!/usr/bin/env bash

es_host='https://************'
elk_user='************'
cred=($(pass ELK/************ | tr "\n" " ")) ##password
index_name='iis-************'
index_hostname='"************"'
temp_index_path='temp1/_doc/1'
results_limit=1000
timestamp_gte='"2018-03-20T13:00:00"' #UTC
timestamp_lte='"now"'                 #UTC



resp_data="$(curl -X POST $es_host/$index_name/_search -u $elk_user:${cred[0]} -H 'Content-Type: application/json; charset=utf-8' -d @- << EOF
{
        "query": {
            "bool": {
                "must": [{
                  "match": {
                      "index_hostname": {
                        "query": $index_hostname
                      }
                          }
                        },
            {
                    "regexp": {
                      "iis.access.url":{
                        "value": ".*((jpg)|(jpeg)|(png))"
                      }
                    }
                  }],
                "must_not": {
                    "match": {
                        "iis.access.agent": {
                            "query": "Amazon+CloudFront"
                        }
                    }
                },
                "filter": {
                    "range": {
                        "@timestamp": {
                            "gte": $timestamp_gte,
                            "lte": $timestamp_lte
                        }
                    }
                }
            }
        },
  "aggs" : {
        "whatever" : {
            "terms" : { "field" : "iis.access.remote_ip", "size":$results_limit }
        }
    },
    "size" : 0
    }
EOF
)"

ip_list="$(echo "$resp_data" | jq '.aggregations.whatever.buckets[].key' | tr "\n" ",\ " | head -c -1)"

resp_data2="$(curl -X PUT $es_host/$temp_index_path -u $elk_user:${cred[0]} -H 'Content-Type: application/json; charset=utf-8' -d @- << EOF
{
"ips" : [$ip_list]
}
EOF
)"

echo "$resp_data2"

Query DSL - "terms-query" filter:

{
 "query": {
   "terms": {
     "iis.access.remote_ip": {
       "id": "1",
       "index": "temp1",
       "path": "ips",
       "type": "_doc"
     }
   }
 }
}

来源：https://stackoverflow.com/questions/33503808/what-is-the-best-way-to-create-a-subset-of-my-data-in-elasticsearch

标签

sql

ElasticSearch

join

nosql