ElasticSearch: Getting old visitor data into an index

馋奶兔 提交于 2020-01-17 11:42:08

问题


I'm learning ElasticSearch in the hopes of dumping my business data into ES and viewing it with Kibana. After a week of various issues I finally have ES and Kibana working (1.7.0 and 4 respectively) on 2 Ubuntu 14.04 desktop machines (clustered).

The issue I'm having now is how to get the data into ES best. The data flow is that I capture the PHP global variables $_REQUEST and $_SERVER for each visit to text file with a unique ID. From there, if they fill in a form I capture that data in a text file also named with that unique ID in a different directory. Then my customers tell me if that form fill was any good with a delay of up to 50 days.

So I'm starting with the visitor data - $_REQUEST and $_SERVER. A lot of it is redundant so I'm really just attempting to capture the timestamp of their arrival, their IP, the IP of the server they visited, the domain they visited, the unique ID, and their User Agent. So I created this mapping:

time_date_mapping = { 'type': 'date_time' }
str_not_analyzed = { 'type': 'string'} # Originally this included 'index': 'not analyzed' as well

visit_mapping = {
    'properties': {
        'uniqID': str_not_analyzed,
        'pages': str_not_analyzed,
        'domain': str_not_analyzed,
        'Srvr IP': str_not_analyzed,
        'Visitor IP': str_not_analyzed,
        'Agent': { 'type': 'string' },
        'Referrer': { 'type': 'string' },
        'Entrance Time': time_date_mapping, # Stored as a Unix timestamp
        'Request Time': time_date_mapping, # Stored as a Unix timestamp
        'Raw': { 'type': 'string', 'index': 'not_analyzed' },
    },
}

I then enter it into ES with:

es.index(
            index=Visit_to_ElasticSearch.INDEX,
            doc_type=Visit_to_ElasticSearch.DOC_TYPE,
            id=self.uniqID,
            timestamp=int(math.floor(self._visit['Entrance Time'])),
            body=visit
        )

When I look at the data in the index on ES only Entrance Time, _id, _type, domain, and uniqID are indexed for searching (according to Kibana). All of the data is present in the document but most of the fields show "Unindexed fields can not be searched."

Additionally, I was attempting to just get a Pie chart of the Agents. But I couldn't figure out to get visualized because no matter what boxes I click on the Agent field is never an option for aggregation. Just mentioned it because it seems the fields which are indexed do show up.

I've attempting to mimic the mapping examples in the elasticsearch.py example which pulls in github. Can someone correct me on how I'm using that map?

Thanks

------------ Mapping -------------

{
  "visits": {
    "mappings": {
      "visit": {
        "properties": {
          "Agent": {
            "type": "string"
          },
          "Entrance Time": {
            "type": "date",
            "format": "dateOptionalTime"
          },
          "Raw": {
            "properties": {
              "Entrance Time": {
                "type": "double"
              },
              "domain": {
                "type": "string"
              },
              "uniqID": {
                "type": "string"
              }
            }
          },
          "Referrer": {
            "type": "string"
          },
          "Request Time": {
            "type": "string"
          },
          "Srvr IP": {
            "type": "string"
          },
          "Visitor IP": {
            "type": "string"
          },
          "domain": {
            "type": "string"
          },
          "uniqID": {
            "type": "string"
          }
        }
      }
    }
  }
}

------------- Update and New Mapping -----------

So I deleted the index and recreated it. The original index had some data in it from before I knew anything about mapping the data to specific field types. This seemed to fix the issue with only a few fields being indexed.

However, parts of my mapping appear to be ignored. Specifically the Agent string mapping:

visit_mapping = {
    'properties': {
        'uniqID': str_not_analyzed,
        'pages': str_not_analyzed,
        'domain': str_not_analyzed,
        'Srvr IP': str_not_analyzed,
        'Visitor IP': str_not_analyzed,
        'Agent': { 'type': 'string', 'index': 'not_analyzed' },
        'Referrer': { 'type': 'string' },
        'Entrance Time': time_date_mapping,
        'Request Time': time_date_mapping,
        'Raw': { 'type': 'string', 'index': 'not_analyzed' },
    },
}

Here's the output of http://localhost:9200/visits_test2/_mapping

{
  "visits_test2": {
    "mappings": {
        "visit": {
          "properties":  {
            "Agent":{"type":"string"},
            "Entrance Time": {"type":"date","format":"dateOptionalTime"},
            "Raw": {
              "properties": {
                "Entrance Time":{"type":"double"},
                "domain":{"type":"string"},
                "uniqID":{"type":"string"}
              }
            },
            "Referrer":{"type":"string"},
            "Request Time": {"type":"date","format":"dateOptionalTime"},
            "Srvr IP":{"type":"string"},
            "Visitor IP":{"type":"string"},
            "domain":{"type":"string"},
            "uniqID":{"type":"string"}
          }
        }
      }
    }
  }

Note that I've used an entirely new index. The reason being that I wanted to make to sure nothing was carrying over from one to the next.

Note that I'm using the Python library elasticsearch.py and following their examples for mapping syntax.

--------- Python Code for Entering Data into ES, per comment request -----------

Below is a file name mapping.py, I have not yet fully commented the code since this was just code to test whether this method of data entry into ES was viable. If it is not self-explanatory, let me know and I'll add additional comments.

Note, I programmed in PHP for years before picking up Python. In order to get up and running faster with Python I created a couple of files with basic string and file manipulation functions and made them into a package. They are written in Python and meant to mimic the behavior of a built-in PHP function. So when you see a call to php_basic_* it is one of those functions.

# Standard Library Imports
import json, copy, datetime, time, enum, os, sys, numpy, math
from datetime import datetime
from enum import Enum, unique
from elasticsearch import Elasticsearch

# My Library
import basicconfig, mybasics
from mybasics.cBaseClass import BaseClass, BaseClassErrors
from mybasics.cHelpers import HandleErrors, LogLvl

# This imports several constants, a couple of functions, and a helper class
from basicconfig.startup_config import *

# Connect to ElasticSearch
es = Elasticsearch([{'host': 'localhost', 'port': '9200'}])

# Create mappings of a visit
time_date_mapping = { 'type': 'date_time' }
str_not_analyzed = { 'type': 'string'} # This originally included 'index': 'not_analyzed' as well

visit_mapping = {
    'properties': {
        'uniqID': str_not_analyzed,
        'pages': str_not_analyzed,
        'domain': str_not_analyzed,
        'Srvr IP': str_not_analyzed,
        'Visitor IP': str_not_analyzed,
        'Agent': { 'type': 'string', 'index': 'not_analyzed' },
        'Referrer': { 'type': 'string' },
        'Entrance Time': time_date_mapping,
        'Request Time': time_date_mapping,
        'Raw': { 'type': 'string', 'index': 'not_analyzed' },
        'Pages': { 'type': 'string', 'index': 'not_analyzed' },
    },
}


class Visit_to_ElasticSearch(object):
    """

    """

    INDEX = 'visits'
    DOC_TYPE = 'visit'



    def __init__(self, fname, index=True):
        """

        """

        self._visit = json.loads(php_basic_files.file_get_contents(fname))
        self._pages = self._visit.pop('pages')

        self.uniqID = self._visit['uniqID']
        self.domain = self._visit['domain']
        self.entrance_time = self._convert_time(self._visit['Entrance Time'])

        # Get a list of the page IDs
        self.pages = self._pages.keys()

        # Extra IPs and such from a single page
        page = self._pages[self.pages[0]]
        srvr = page['SERVER']
        req = page['REQUEST']

        self.visitor_ip = srvr['REMOTE_ADDR']
        self.srvr_ip = srvr['SERVER_ADDR']
        self.request_time = self._convert_time(srvr['REQUEST_TIME'])

        self.agent = srvr['HTTP_USER_AGENT']

        # Now go grab data that might not be there...
        self._extract_optional()

        if index is True:
            self.index_with_elasticsearch()


    def _convert_time(self, ts):
        """

        """

        try:
            dt = datetime.fromtimestamp(ts)
        except TypeError:
            dt = datetime.fromtimestamp(float(ts))

        return dt.strftime('%Y-%m-%dT%H:%M:%S')         


    def _extract_optional(self):
        """

        """

        self.referrer = ''


    def index_with_elasticsearch(self):
        """

        """

        visit = {
            'uniqID': self.uniqID,
            'pages': [],
            'domain': self.domain,
            'Srvr IP': self.srvr_ip,
            'Visitor IP': self.visitor_ip,
            'Agent': self.agent,
            'Referrer': self.referrer,
            'Entrance Time': self.entrance_time,
            'Request Time': self.request_time,
            'Raw': self._visit,
            'Pages': php_basic_str.implode(', ', self.pages),
        }

        es.index(
            index=Visit_to_ElasticSearch.INDEX,
            doc_type=Visit_to_ElasticSearch.DOC_TYPE,
            id=self.uniqID,
            timestamp=int(math.floor(self._visit['Entrance Time'])),
            body=visit
        )   


es.indices.create(
    index=Visit_to_ElasticSearch.INDEX,
    body={
        'settings': {
            'number_of_shards': 5,
            'number_of_replicas': 1,
        }
    },
    # ignore already existing index
    ignore=400
)

In case it matters this is the simple loop I use to dump the data into ES:

for f in all_files:
    try:
        visit = mapping.Visit_to_ElasticSearch(f)
    except IOError:
        pass

where all_files is a list of all the visit files (full path) I have in my test data set.

Here is a sample visit file from a Google Bot visit:

{u'Entrance Time': 1407551587.7385,
     u'domain': u'############',
     u'pages': {u'6818555600ccd9880bf7acef228c5d47': {u'REQUEST': [],
       u'SERVER': {u'DOCUMENT_ROOT': u'/var/www/####/',
        u'Entrance Time': 1407551587.7385,
        u'GATEWAY_INTERFACE': u'CGI/1.1',
        u'HTTP_ACCEPT': u'*/*',
        u'HTTP_ACCEPT_ENCODING': u'gzip,deflate',
        u'HTTP_CONNECTION': u'Keep-alive',
        u'HTTP_FROM': u'googlebot(at)googlebot.com',
        u'HTTP_HOST': u'############',
        u'HTTP_IF_MODIFIED_SINCE': u'Fri, 13 Jun 2014 20:26:33 GMT',
        u'HTTP_USER_AGENT': u'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)',
        u'PATH': u'/usr/local/bin:/usr/bin:/bin',
        u'PHP_SELF': u'/index.php',
        u'QUERY_STRING': u'',
        u'REDIRECT_SCRIPT_URI': u'http://############/',
        u'REDIRECT_SCRIPT_URL': u'############',
        u'REDIRECT_STATUS': u'200',
        u'REDIRECT_URL': u'############',
        u'REMOTE_ADDR': u'############',
        u'REMOTE_PORT': u'46271',
        u'REQUEST_METHOD': u'GET',
        u'REQUEST_TIME': u'1407551587',
        u'REQUEST_URI': u'############',
        u'SCRIPT_FILENAME': u'/var/www/PIAN/index.php',
        u'SCRIPT_NAME': u'/index.php',
        u'SCRIPT_URI': u'http://############/',
        u'SCRIPT_URL': u'/############/',
        u'SERVER_ADDR': u'############',
        u'SERVER_ADMIN': u'admin@############',
        u'SERVER_NAME': u'############',
        u'SERVER_PORT': u'80',
        u'SERVER_PROTOCOL': u'HTTP/1.1',
        u'SERVER_SIGNATURE': u'<address>Apache/2.2.22 (Ubuntu) Server at ############ Port 80</address>\n',
        u'SERVER_SOFTWARE': u'Apache/2.2.22 (Ubuntu)',
        u'uniqID': u'bbc398716f4703cfabd761cc8d4101a1'},
       u'SESSION': {u'Entrance Time': 1407551587.7385,
        u'uniqID': u'bbc398716f4703cfabd761cc8d4101a1'}}},
     u'uniqID': u'bbc398716f4703cfabd761cc8d4101a1'}

回答1:


Now I understand better why the Raw field is an object instead of a simple string since it is assigned self._visit which in turn was initialized with json.loads(php_basic_files.file_get_contents(fname)).

Anyway, based on all the information you've given above, my take is that the mapping was never installed via put_mapping. From there on, there's no way anything else can work the way you like. I suggest you modify your code to install the mapping before you index your first visit document.



来源:https://stackoverflow.com/questions/32407467/elasticsearch-getting-old-visitor-data-into-an-index

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!