HDFS | 易学教程

Hive安装配置指北（含Hive Metastore详解）

阅读更多关于 Hive安装配置指北（含Hive Metastore详解）

个人主页: http://www.linbingdong.com 本文介绍Hive安装配置的整个过程，包括MySQL、Hive及Metastore的安装配置，并分析了Metastore三种配置方式的区别。网上有很多介绍Hive Metastore三种配置方式的文章，但是理解都不对，给读者造成了很多误导。本人详细阅读Apache和CDH官方文档中关于Hive Metastore的部分，并经过实践，终于填好各种坑，安装配置成功，遂记录下本文，供大家参考。 1. 相关概念 Hive Metastore有三种配置方式，分别是： Embedded Metastore Database (Derby) 内嵌模式 Local Metastore Server 本地元存储 Remote Metastore Server 远程元存储 1.1 Metadata、Metastore作用 metadata即元数据。元数据包含用Hive创建的database、tabel等的元信息。元数据存储在关系型数据库中。如Derby、MySQL等。 Metastore的作用是：客户端连接metastore服务，metastore再去连接MySQL数据库来存取元数据。有了metastore服务，就可以有多个客户端同时连接，而且这些客户端不需要知道MySQL数据库的用户名和密码，只需要连接metastore 服务即可。

hive个人使用持续更新

阅读更多关于 hive个人使用持续更新

1、连续n天例如连续 12 登陆，先日期进行从小到大进行排序，再rank , 然后日期减去rank的序号，有多少个相同的连续值就是连续多少天 2、数据只有本月和本月数添加第三列是之前12个月的数总和 ( sum ( ct2 . CREATE_PROJECT_CURRENT_MONTH_CNT ) over ( ORDER BY ct2 . CURRENT_MONTH_ID ASC ROWS BETWEEN 12 preceding AND 1 preceding ) 如果原始数据有缺失月份可以先进行缺失月份的补齐默认值补 0 3、数据只有本月和本月数添加第三列是上个月数第四列上年同月数使用 left join 配合 case when 使用灵活填充 4、数据只有本月和本月数添加第三列是本年截止当前数总和通过年来进行分组 5 、列转行行转列 concat concat_ws collect_set collect_list lateral view explode ( 集合 ) lateral view explode ( split ( order_value , ',' ) ) 6、数据类型转换 cast（xxx as xxx） 7、case when 灵活方式使用 8、脱敏 regexp_replace ( selphone , substr ( selphone

Improve Query Performance From a Large HDFStore Table with Pandas

阅读更多关于 Improve Query Performance From a Large HDFStore Table with Pandas

问题 I have a large (~160 million rows) dataframe that I've stored to disk with something like this: def fillStore(store, tablename): files = glob.glob('201312*.csv') names = ["ts", "c_id", "f_id","resp_id","resp_len", "s_id"] for f in files: df = pd.read_csv(f, parse_dates=True, index_col=0, names=names) store.append(tablename, df, format='table', data_columns=['c_id','f_id']) The table has a time index and I will query using c_id and f_id in addition to times (via the index). I have another

Load JSON array into Pig

阅读更多关于 Load JSON array into Pig

问题 I have a json file with the following format [ { "id": 2, "createdBy": 0, "status": 0, "utcTime": "Oct 14, 2014 4:49:47 PM", "placeName": "21/F, Cunningham Main Rd, Sampangi Rama NagarBengaluruKarnatakaIndia", "longitude": 77.5983817, "latitude": 12.9832418, "createdDate": "Sep 16, 2014 2:59:03 PM", "accuracy": 5, "loginType": 1, "mobileNo": "0000005567" }, { "id": 4, "createdBy": 0, "status": 0, "utcTime": "Oct 14, 2014 4:52:48 PM", "placeName": "21/F, Cunningham Main Rd, Sampangi Rama

Data lost after shutting down hadoop HDFS?

阅读更多关于 Data lost after shutting down hadoop HDFS?

问题 Hi I'm learning hadoop and I have a simple dumb question: After I shut down HDFS(by calling hadoop_home/sbin/stop-dfs.sh), is the data on HDFS lost or can I get it back? 回答1: Data wouldn't be lost if you stop HDFS, provided you store the data of NameNode and DataNode's in a persistent locations specified using the properties: dfs.namenode.name.dir -> Determines where on the local filesystem the DFS name node should store the name table(fsimage). If this is a comma-delimited list of

Flume NG and HDFS

阅读更多关于 Flume NG and HDFS

问题 I am very new to hadoop , so please excuse the dumb questions. I have the following knowledge Best usecase of Hadoop is large files thus helping in efficiency while running mapreduce tasks. Keeping the above in mind I am somewhat confused about Flume NG. Assume I am tailing a log file and logs are produced every second, the moment the log gets a new line it will be transferred to hdfs via Flume. a) Does this mean that flume creates a new file on every line that is logged in the log file I am

杂七杂八日常错误记录

阅读更多关于杂七杂八日常错误记录

日常错误 1、./cloudera-scm-agent start启动失败在此目录下创建文件夹： cd /opt/cloudera-manager/cm-5.7.0/run mkdir cloudera-scm-agent 赋权：chown cloudera-scm:cloudera-scm cloudera-scm-agent 2、./scm_prepare_database.sh mysql -h myhost1.sf.cloudera.com -utemp -ptemp --scm-host myhost2.sf.cloudera.com scm scm scm 失败提示classnotexception异常，原因缺少mysql-connection.jar包将此jar包放置到/opt/cloudera-manager/cm-5.7.0/share/cmf/lib 3、CDH登录后管理主机只有一台没有将/opt/cloudera-manager/cm-5.7.0/etc/cloudera-scm-agent/ 目录下的config.ini中的server改为主机地址。 4、修改linux下的mysql编码格式编辑、/etc目录下的my.cnf文件在[mysqld]下加：default_character_set=utf8 如果没有[client]新建[client

Hadoop优化第一篇 : HDFS/MapReduce

阅读更多关于 Hadoop优化第一篇 : HDFS/MapReduce

比较惭愧，博客很久（半年）没更新了。最近也自己搭了个博客，wordpress玩的还不是很熟，感兴趣的朋友可以多多交流哈！地址是：http://www.leocook.org/ 另外，我建了个QQ群：305994766，希望对大数据、算法研发、系统架构感兴趣的朋友能够加入进来，大家一起学习，共同进步（进群请说明自己的公司-职业-昵称）。 1.应用程序角度进行优化 1.1.减少不必要的reduce任务若对于同一份数据需要多次处理，可以尝试先排序、分区，然后自定义InputSplit将某一个分区作为一个Map的输入，在Map中处理数据，将Reduce的个数设置为空。 1.2.外部文件引用如字典、配置文件等需要在Task之间共享的数据，可使用分布式缓存DistributedCache或者使用-files 1.3.使用Combiner combiner是发生在map端的，作用是归并Map端输出的文件，这样Map端输出的数据量就小了，减少了Map端和reduce端间的数据传输。需要注意的是，Combiner不能影响作业的结果;不是每个MR都可以使用Combiner的，需要根据具体业务来定;Combiner是发生在Map端的，不能垮Map来执行（只有Reduce可以接收多个Map任务的输出数据） 1.4.使用合适的Writable类型尽可能使用二进制的Writable类型，例如

windows中eclipse连接虚拟机hdfs

阅读更多关于 windows中eclipse连接虚拟机hdfs

1.修改配置文件core-site.xml，将其中localhost改为虚拟机的ip地址：　　在Ubuntu中，打开控制台，使用命令ifconfig查看虚拟机ip，如图：　　　　修改【hadoop安装路径】/etc/hadoop下的core-site.xml文件，如图：　　　　2.安装Hadoop-Eclipse-Plugin 　　（以下操作在Window系统中进行）　　下载hadoop2x-eclipse-plugin：(下载地址： https://github.com/winghc/hadoop2x-eclipse-plugin ) 　　解压hadoop2x-eclipse-plugin，将其中的hadoop-eclipse-plugin-2.6.0.jar复制到Eclipse安装目录下的plugins文件夹中，启动Eclipse。　　将hadoop安装包解压到windows系统中（下载地址： http://mirror.bit.edu.cn/apache/hadoop/common/ ），这里我解压到D:\hadoop 　　选择 Window 菜单下的 Preference，左侧找到Hadoop Map/Reduce，填写刚刚解压的Hadoop目录，如图：　　　　 3.配置Hadoop-Eclipse-Plugin 　　选择 Window 菜单下Show

想成为大数据开发工程师，你必须掌握的开发流程图是这样的

阅读更多关于想成为大数据开发工程师，你必须掌握的开发流程图是这样的

1、数据处理主要技术 Sqoop ：（发音：skup）作为一款开源的离线数据传输工具，主要用于Hadoop(Hive) 与传统数据库（MySql,PostgreSQL）间的数据传递。它可以将一个关系数据库中数据导入Hadoop的HDFS中，也可以将HDFS中的数据导入关系型数据库中。 Flume：实时数据采集的一个开源框架，它是Cloudera提供的一个高可用用的、高可靠、分布式的海量日志采集、聚合和传输的系统。目前已经是Apache的顶级子项目。使用Flume可以收集诸如日志、时间等数据并将这些数据集中存储起来供下游使用（尤其是数据流框架，例如Storm）。和Flume类似的另一个框架是Scribe（FaceBook开源的日志收集系统，它为日志的分布式收集、统一处理提供一个可扩展的、高容错的简单方案）　Kafka：通常来说Flume采集数据的速度和下游处理的速度通常不同步，因此实时平台架构都会用一个消息中间件来缓冲，而这方面最为流行和应用最为广泛的无疑是Kafka。它是由LinkedIn开发的一个分布式消息系统，以其可以水平扩展和高吞吐率而被广泛使用。目前主流的开源分布式处理系统（如Storm和Spark等）都支持与Kafka 集成。 Kafka是一个基于分布式的消息发布-订阅系统，特点是速度快、可扩展且持久。与其他消息发布-订阅系统类似

订阅 HDFS