#Note# Analyzing Twitter Data with Apache Hadoo...

只谈情不闲聊 提交于 2021-01-10 08:33:36

#Note# Analyzing Twitter Data with Apache Hadoop 系列 1、2、3

Andy erpingwu@gmail.com
2013/09/28-2013/09/30

markdown的语法高亮格式在oschina的blog上有问题,在git.oschina.net上没有问题http://git.oschina.net/wuerping/notes/blob/master/2013/2013-09-30/AnalyzingTwitterDatawithApacheHadoop.md

Analyzing Twitter Data with Apache Hadoop

这是这个系列的第一篇,讲的是如何用 Apache Flume, Apache HDFS, Apache Oozie, 和 Apache Hive 去设计一个能够分析 Twitter数据的,端到端的数据 pipeline。

Who is Influential?

  • Now we know the question we want to ask: Which Twitter users get the most retweets? Who is influential within our industry?
  • 换成山寨版的说法就是:找到谁谁谁是大V

How Do We Answer These Questions?

  • However, querying Twitter data in a traditional RDBMS is inconvenient, since the Twitter Streaming API outputs tweets in a JSON format which can be arbitrarily complex.
  • 传统数据库也是可用的,不过 Twitter Streaming API 输出的 tweets 是复杂的 JSON format,用起来不方便

  • The diagram above shows a high-level view of how some of the CDH (Cloudera’s Distribution Including Apache Hadoop) components can be pieced together to build the data pipeline we need to answer the questions we have.

Gathering Data with Apache Flume

  • 数据流的两端是 sourcessinks
  • 每个独立的数据(tweets)补称之为 event
  • sources 产生 events, events 通过 channelsource 送到 sink
  • sink 负责写数据到预定义的位置。

flume 支持的 source

Partition Management with Oozie

  • Apache Oozie is a workflow coordination system that can be used to solve this problem.
  • Oozie is an extremely flexible system for designing job workflows, which can be scheduled to run based on a set of criteria.
  • We can configure the workflow to run an ALTER TABLE command that adds a partition containing the last hour’s worth of data into Hive, and we can instruct the workflow to occur every hour.

Apache Oozie 用来每小时加 partition

Querying Complex Data with Hive

  • Hive expects that input files use a delimited row format, but our Twitter data is in a JSON format, which will not work with the defaults.
  • The schema is only really enforced when we read the data, and we can use the Hive SerDe interface to specify how to interpret what we’ve loaded.
  • hive 缺省是 delimited row format, 如何处理 JSON format? 使用 Hive SerDe。示例的 JSON 太长,看原文
  • 一个查询语句
SELECT created_at, entities, text, user
FROM tweets
WHERE user.screen_name='ParvezJugon'
  AND retweeted_status.user.screen_name='ScottOstby';

Some Results

  • 一个更复杂的查询语句

Conclusion

Analyzing Twitter Data with Apache Hadoop, Part 2: Gathering Data with Flume

**这是这个系列的第二篇。第一部分是讲如何将 CDH 的组件整合成一个应用,这一部分是深入说明每个组件 **

Sources

源有两种不同的风格

  • event-driven
  • pollable

两者不同之处实际上是推与拉的区别

  • Event-driven sources typically receive events through mechanisms like callbacks or RPC calls.
  • Pollable sources, in contrast, operate by polling for events every so often in a loop.

Examining the TwitterSource

Configuring the Flume Agent

Channels

这个例子用的是 Memory Channel

TwitterAgent.channels.MemChannel.type = memory

Sinks

不错的配置功能

TwitterAgent.sinks.HDFS.hdfs.path = hdfs://hadoop1:8020/user/flume/tweets/%Y/%m/%d/%H/

其中的 timestamp 信息来自 TwitterSource 给每个 event 加的 header

headers.put("timestamp", String.valueOf(status.getCreatedAt().getTime()));

Starting the Agent

/etc/default/flume-ng-agent包含一个环境变量FLUME_AGENT_NAME

$ /etc/init.d/flume-ng-agent start

/user/flume/tweets

natty@hadoop1:~/source/cdh-twitter-example$ hadoop fs -ls /user/flume/tweets/2012/09/20/05
  Found 2 items
  -rw-r--r--   3 flume hadoop   255070 2012-09-20 05:30 /user/flume/tweets/2012/09/20/05/FlumeData.1348143893253
  -rw-r--r--   3 flume hadoop   538616 2012-09-20 05:39 /user/flume/tweets/2012/09/20/05/FlumeData.1348143893254.tmp

先写到.tmp文件,当 events 或 time 条件满足时 move 到 roll 文件, 参数是:rollCount,rollInterval

Conclusion

Analyzing Twitter Data with Apache Hadoop, Part 3: Querying Semi-structured Data with Apache Hive

这是这个系列的第三篇。讨论Hive的优劣,讨论在这个分析tweets数据的应用中Hive是正确的选择

Characterizing Data

  • well-structured
  • unstructured, semi-structured, and poly-structured

Complex Data Structures

SELECT array_column[0] FROM foo;
SELECT map_column[‘map_key’] FROM foo;

SELECT struct_column.struct_field FROM foo;

A Table for Tweets

表设计

CREATE EXTERNAL TABLE tweets (
 ...
 retweeted_status STRUCT<
   text:STRING,
   user:STRUCT>,
 entities STRUCT<
   urls:ARRAY>,
   user_mentions:ARRAY>,
   hashtags:ARRAY>>,
 text STRING,
 ...
)
PARTITIONED BY (datehour INT)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
LOCATION '/user/flume/tweets';


SELECT entities.user_mentions[0].screen_name FROM tweets;

JSON objects 映射到 Hive columns

Serializers and Deserializers

在 Hive , SerDe 是 Serializer 与 Deserializer 两者的缩写

Putting It All Together

One Thing to Watch Out For…

如果它看起来像 duck,声音听起来也像 duck, 所以它肯定是 duck, right? 对于 Hive 的新用户,不能错误地将 Hive 当成关系统型数据库

Conclusion

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!