How to read data files generated by flume from twitter

时间秒杀一切 提交于 2019-12-11 23:24:03

问题


I have generated few twitter data log files using flume on HDFS , what is the actual format of the log file ? I was expecting data in json format. But it looks like this. Could someone help me on how to read this data ? or what is wrong with the way I have done this


回答1:


DOWNLOAD THE FILE (hive-serdes-1.0-SNAPSHOT.jar) from this link
http://files.cloudera.com/samples/hive-serdes-1.0-SNAPSHOT.jar

Then put this file in your $HIVE_HOME/lib
Add the jar into hive shell

hive> ADD JAR file:///home/hadoop/work/hive-0.10.0/lib/hive-serdes-1.0-SNAPSHOT.jar

Create a table in hive

hive> CREATE TABLE tweets (
id BIGINT,
created_at STRING,
source STRING,
favorited BOOLEAN,
retweeted_status STRUCT<
text:STRING,
user:STRUCT<screen_name:STRING,name:STRING>,
retweet_count:INT>,
entities STRUCT<
urls:ARRAY<STRUCT<expanded_url:STRING>>,
user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>,
hashtags:ARRAY<STRUCT<text:STRING>>>,
text STRING,
user STRUCT<
screen_name:STRING,
name:STRING,
friends_count:INT,
followers_count:INT,
statuses_count:INT,
verified:BOOLEAN,
utc_offset:INT,
time_zone:STRING>,
in_reply_to_screen_name STRING
) 
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe';

load data into table from hdfs

hive> load data inpath '/home/hadoop/work/flumedata' into table tweets;

Now analyze you twitter data from this table

hive> select id,text,user from tweets;

you done, but it is deserialized data, now serialize from hive table..




回答2:


create a table using with serde in hive then load the twitter log data into hive table. then analyze it.



来源:https://stackoverflow.com/questions/35809205/how-to-read-data-files-generated-by-flume-from-twitter

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!