partition | 易学教程

go语言操作kafka

阅读更多关于 go语言操作kafka

go语言操作kafka Kafka是一种高吞吐量的分布式发布订阅消息系统，它可以处理消费者规模的网站中的所有动作流数据，具有高性能、持久化、多副本备份、横向扩展等特点。本文介绍了如何使用Go语言发送和接收kafka消息。 sarama Go语言中连接kafka使用第三方库: github.com/Shopify/sarama 。下载及安装 go get github.com/Shopify/sarama 注意事项 sarama v1.20之后的版本加入了 zstd 压缩算法，需要用到cgo，在Windows平台编译时会提示类似如下错误： # github.com/DataDog/zstd exec: "gcc":executable file not found in %PATH% 所以在Windows平台请使用v1.19版本的sarama。连接kafka发送消息 package main import ( "fmt" "github.com/Shopify/sarama" ) // 基于sarama第三方库开发的kafka client func main() { config := sarama.NewConfig() config.Producer.RequiredAcks = sarama.WaitForAll // 发送完数据需要leader和follow都确认

spark write to disk with N files less than N partitions

阅读更多关于 spark write to disk with N files less than N partitions

Can we write data to say 100 files, with 10 partitions in each file? I know we can use repartition or coalesce to reduce number of partition. But I have seen some hadoop generated avro data with much more partitions than number of files. The number of files that get written out is controlled by the parallelization of your DataFrame or RDD. So if your data is split across 10 Spark partitions you cannot write fewer than 10 files without reducing partitioning (e.g. coalesce or repartition ). Now, having said that when data is read back in it could be split into smaller chunks based on your

Creating temp table from another table including partition column in hive

阅读更多关于 Creating temp table from another table including partition column in hive

问题 I am creating a temp table from another table using AS clause where I am including the partition column of another table also be part of temp table and then I am getting the below error. Below is the table create statement where col4 is the partition column of table xyz . And while running the create statement i am getting the below error. And when I am removing the col4 from the create statement its running fine. Error: Error while compiling statement: FAILED: NumberFormatException For input

Kafka 转载

阅读更多关于 Kafka 转载

转载自： https://fangyeqing.github.io/2016/10/28/kafka---%E4%BB%8B%E7%BB%8D/ kafka---介绍 kafka kafka 学习流处理消息系统 Kafka是一种分布式的消息系统。本文基于0.9.0版本，新版kafka加入了流处理组件kafka stream，最新的官方文档又自称分布式流处理平台。概念 Broker Kafka的节点。kafka集群包含一个或多个broker Producer 消息的生产者。负责发布消息到Kafka broker Consumer 消息的消费者。每个consumer属于一个特定的consumer group（若不指定group id则属于默认的group）。使用consumer high level API时，同一topic的一条消息只能被同一个consumer group内的一个consumer消费，但多个consumer group可同时消费这一消息。 Topic 消息主题。例如pv日志、click日志、转化日志都可以作为topic。 Partition topic物理上的分组。每个topic包含一个或多个partition，创建topic时可指定parition数量。每个partition是一个有序的队列，对应于一个文件夹，该文件夹下存储该partition的数据和索引文件

Python 获取本地主机信息

阅读更多关于 Python 获取本地主机信息

import wmi c = wmi.WMI() for sys in c.Win32_OperatingSystem(): #系统信息 print(sys.Caption) #系统版本号 print(sys.BuildNumber) #32/64位 print(sys.OSArchitecture) #当前系统进程数 print(sys.NumberOfProcesses) #处理器信息 for pro in c.win32_Processor(): print(pro.DeviceID) print(pro.Name.strip()) #内存信息 for Memory in c.Win32_PhysicalMemory(): print((int(Memory.Capacity)/1048576)) # 获取硬盘分区 for physical_disk in c.Win32_DiskDrive(): for partition in physical_disk.associators("Win32_DiskDriveToDiskPartition"): for logical_disk in partition.associators("Win32_LogicalDiskToPartition"): print(physical_disk.Caption, partition

MapReduce中的shuffle过程

阅读更多关于 MapReduce中的shuffle过程

MapReduce的shuffle过程介绍 Shuffle的语义是洗牌、混洗，即把一组有一定规则的数据尽量转换成一组无规则的数据，随机性越高越好。 MapReduce中的Shuffle更像是洗牌的逆过程，把一组无规则的数据尽量转换成一组具有一定规则的数据。为什么MapReduce计算模型需要Shuffle过程？ MapReduce计算模型一般包括两个重要的阶段： Map是映射，负责数据的过滤分发； Reduce是规约，负责数据的计算归并。 Reduce的数据来源于Map，Map的输出即Reduce的输入，Reduce需要通过Shuffle来获取数据。从Map输出到Reduce输入的整个过程可以广义地称为Shuffle。Shuffle横跨Map端和Reduce端，在Map端报苦熬Spill过程，在Reduce端包括copy和sort过程，如下图所示： Spill过程 Map端的Shuffle过程 Spill过程包括输出、排序、溢写、合并等步骤，如图所示： Collect 每个Map任务不断地以对的形式把数据输出到在内存中构造的一个环形数据结构中。使用环形数据结构是为了更有效地使用内存空间，在内存中放置尽可能多的数据。环形数据结构该环形数据结构是字节数组，叫Kvbuffer。Kvbuffer中不光放置了处理的数据还放置了一些索引数据，放置索引数据的区域叫Kvmeta。

Creating temp table from another table including partition column in hive

阅读更多关于 Creating temp table from another table including partition column in hive

I am creating a temp table from another table using AS clause where I am including the partition column of another table also be part of temp table and then I am getting the below error. Below is the table create statement where col4 is the partition column of table xyz . And while running the create statement i am getting the below error. And when I am removing the col4 from the create statement its running fine. Error: Error while compiling statement: FAILED: NumberFormatException For input string: "HIVE_DEFAULT_PARTITION" (state=42000,code=40000) Please help. Example: CREATE TEMPORARY TABLE

Spark学习03（Spark任务提交流程+宽窄依赖）

阅读更多关于 Spark学习03（Spark任务提交流程+宽窄依赖）

Spark编程-----二次排序和分组取TopN RDD的宽窄依赖宽依赖：每一个父RDD的Partition中的数据，都可能传输到子RDD的每个Partition中，这种错综复杂的关系，叫宽依赖宽依赖划分依据：Shuffle 窄依赖：一个RDD对它的父RDD，只有一个一对一的依赖关系，也就是说，RDD的每个Partition，仅仅依赖于一个父RDD的Partition，一对一的关系叫窄依赖窄依赖划分依据：没有Shuffle Join有一个特殊情况，虽然Join是Shuffle算子，但是也会触发窄依赖例如：血缘父RDD与子RDD直接存在依赖关系，这种依赖关系叫血缘，同时通过血缘关系，可以达到容错的机制（RDD之间的容错）案例：基站解析案例根据用户产生日志的信息,在那个基站停留时间最长 19735E1C66.log 这个文件中存储着日志信息文件组成:手机号,时间戳,基站ID 连接状态(1连接 0断开) lac_info.txt 这个文件中存储基站信息文件组成基站ID, 经,纬度在一定时间范围内,求所用户经过的所有基站所停留时间最长的Top2 思路: 1.获取用户产生的日志信息并切分 2.用户在基站停留的总时长 3.获取基站的基础信息 4.把经纬度的信息join到用户数据中 5.求出用户在某些基站停留的时间top2 案例：统计某时间段学科访问量TopN

Spark任务执行流程

阅读更多关于 Spark任务执行流程

Spark任务执行流程 DAGScheduler 和TaskScheduler都在Driver端（开启spark-shell的那一端），main函数创建SparkContext时会使得driver和Master节点建立连接，Master会根据任务所需资源在集群中找符合条件的worker.然后Master对worker进行RPC通信，通知worker启动Executor ，Executor会和Driver 建立连接，随后的工作worker和Master不再有关系。然后Driver会向Executor提交Task。 1. RDD Objects RDD构建，RDD进行一系列transformation操作后最终遇到Action方法时，DAG图即确定了边界，DAG图形成,然后会将DAG提交给DAGScheduler. DAG(Directed Acyclic Graph)叫做有向无环图，原始的RDD通过一系列的转换就就形成了DAG，根据RDD之间的依赖关系的不同将DAG划分成不同的Stage，对于窄依赖，partition的转换处理在Stage中完成计算。对于宽依赖，由于有Shuffle的存在，只能在parent RDD处理完成后，才能开始接下来的计算，因此宽依赖是划分Stage的依据。 2、DAGScheduler(调度器) 将DAG切分成多个stage,切分依据(宽依赖

hive开发规范

阅读更多关于 hive开发规范

hive常用交互命令 “-e” 不进入hive的交互窗口执行sql语句。 eg： bin/hive -e "show tables;" “-f” 执行脚本中sql语句 eg: bin/hive -f "/home/user/hive/tmp/hivef.sql"; "!quit" 退出hive交互窗口 "help" 在hive窗口获取帮助 “dfs -ls /;” 在hive cli命令窗口中查看hdfs文件系统 hive的数据类型基本类型 hive数据类型 java数据类型长度例子 tinyint byte 1byte有符号整数 20 smalint short 2byte有符号整数 20 int int 4byte有符号整数 20 bigint long 8byte有符号整数 20 boolean boolean 布尔类型，true或者false true false float float 单精度浮点数 3.14159 double double 双精度浮点数 3.14159 string string 字符系列。可以指定字符集。可以使用单引号或者双引号。 ‘now is the time’ “for all good men” timestamp 时间类型 binary 字节数组集合类型数据类型描述语法示例 struct 和c语言中的struct类似

订阅 partition