Session windows in Apache Beam with python

夙愿已清 提交于 2019-12-23 05:10:07

问题


I have a stream of user events. I've mapped them into KV{ userId, event }, and assigned timestamps.

This is to run in streaming mode. I would like to have be able to create the following input-output result:

session window gap=1

  • input: user=1, timestamp=1, event=a
  • input: user=2, timestamp=2, event=a
  • input: user=2, timestamp=3, event=a
  • input: user=1, timestamp=2, event=b
  • time: lwm=3
  • output: user=1, [ { event=a, timestamp=1 }, { event=b, timestamp=2 } ]
  • time: lwm=4
  • output: user=2, [ { event=a, timestamp=2 }, { event=a, timestamp=3 } ]

So that I can write my function to reduce thee list of events in the session window for the user as well as the start and end time of the session window.

How do I write this? (If you answer; "look at the examples", it's not a valid answer, because they never feed the list of events into the reducer with the window as a parameter)


回答1:


If I understand this correctly, this would be a follow-up to this question and naturally accomplished by adding the Group By Key step as I propose in my solution there.

So, referring to my previous explanation and focusing here on the changes only, if we have a pipeline like this:

events = (p
  | 'Create Events' >> beam.Create(user1_data + user2_data) \
  | 'Add Timestamps' >> beam.Map(lambda x: beam.window.TimestampedValue(x, x['timestamp'])) \
  | 'keyed_on_user_id'      >> beam.Map(lambda x: (x['user_id'], x))
  | 'user_session_window'   >> beam.WindowInto(window.Sessions(session_gap),
                                             timestamp_combiner=window.TimestampCombiner.OUTPUT_AT_EOW) \
  | 'Group' >> beam.GroupByKey() \
  | 'analyze_session'         >> beam.ParDo(AnalyzeSession()))

Now the elements are arranged as you describe in the question description so we can simply log them in AnalyzeSession:

class AnalyzeSession(beam.DoFn):
  """Prints per session information"""
  def process(self, element, window=beam.DoFn.WindowParam):
    logging.info(element)
    yield element

to obtain the desired results:

INFO:root:('Groot', [{'timestamp': 1554203778.904401, 'user_id': 'Groot', 'value': 'event_0'}, {'timestamp': 1554203780.904401, 'user_id': 'Groot', 'value': 'event_1'}])
INFO:root:('Groot', [{'timestamp': 1554203786.904402, 'user_id': 'Groot', 'value': 'event_2'}])
INFO:root:('Thanos', [{'timestamp': 1554203792.904399, 'user_id': 'Thanos', 'value': 'event_4'}])
INFO:root:('Thanos', [{'timestamp': 1554203784.904398, 'user_id': 'Thanos', 'value': 'event_3'}, {'timestamp': 1554203777.904395, 'user_id': 'Thanos', 'value': 'event_0'}, {'timestamp': 1554203778.904397, 'user_id': 'Thanos', 'value': 'event_1'}, {'timestamp': 1554203780.904398, 'user_id': 'Thanos', 'value': 'event_2'}])

If you want to avoid redundant information such as having the user_id and timestamp as part of the values they can be removed in the Map step. As per the complete use case (i.e. reducing the aggregated events on a per-session level) we can do stuff like counting the number of events or session duration with something like this:

class AnalyzeSession(beam.DoFn):
  """Prints per session information"""
  def process(self, element, window=beam.DoFn.WindowParam):
    user = element[0]
    num_events = str(len(element[1]))
    window_end = window.end.to_utc_datetime()
    window_start = window.start.to_utc_datetime()
    session_duration = window_end - window_start

    logging.info(">>> User %s had %s event(s) in %s session", user, num_events, session_duration)

    yield element

which, for my example, will output the following:

INFO:root:>>> User Groot had 2 event(s) in 0:00:07 session
INFO:root:>>> User Groot had 1 event(s) in 0:00:05 session
INFO:root:>>> User Thanos had 4 event(s) in 0:00:12 session
INFO:root:>>> User Thanos had 1 event(s) in 0:00:05 session

Full code here



来源:https://stackoverflow.com/questions/55261957/session-windows-in-apache-beam-with-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!