Hadoop

How to Read a parquet file , change datatype and write to another Parquet file in Hadoop using pyspark

爱⌒轻易说出口 提交于 2021-02-11 14:10:27
问题 My source parquet file has everything as string. My destination parquet file needs to convert this to different datatype like int, string, date etc. How do I do this? 回答1: you may wanted to apply userdefined schema to speedup data loading. There are 2 ways to apply that- using the input DDL-formatted string spark.read.schema("a INT, b STRING, c DOUBLE").parquet("test.parquet") Use StructType schema customSchema = StructType([ StructField("a", IntegerType(), True), StructField("b", StringType(

Can't use hbase-shaded-client jar because of its internal dependency to log4j-1.2.17(CVE-2019-1757)

≡放荡痞女 提交于 2021-02-11 13:52:51
问题 Is there a way to exclude it.I did give it a try but got ClassNotFoundException: org.apache.log4j.Level I do see that hbase-shaded-client do have slf4j dependency so there might be a way to exclude log4j and use slf4j but I'm not able to. 回答1: Yes, you can exclude log4j , but you must add back in log4j-over-slf4j . <dependency> <groupId>org.apache.hbase</groupId> <artifactId>hbase-client</artifactId> <version>[some version]</version> <exclusions> <exclusion> <artifactId>log4j</artifactId>

MapReduce Hadoop on Linux - Change Reduce Key

无人久伴 提交于 2021-02-11 12:55:22
问题 I've being searching for a proper tutorial online about how to use map and reduce, but almost every code about WordCount sucks and doesn't really explain you how to use each function. I've seen everything about the theory, the keys, the map etc, but there is no CODE for example doing something different than WordCount. I am using Ubuntu 20.10 on Virtual Box and Hadoop version 3.2.1 (if you need any more info comment me). My task is to manage a file that contains several data for athletes that

HBase Need to export data from one cluster and import it to another with slight modification in row key

青春壹個敷衍的年華 提交于 2021-02-11 12:32:05
问题 I am trying to export data from HBase table 'mytable' which rowkey starts with 'abc'. scan 'mytable', {ROWPREFIXFILTER => 'abc'} The above exported data need to be imported into the another cluster by changing the rowkey prefix from 'abc' to 'def' Old Data: hbase(main):002:0> scan 'mytable', {ROWPREFIXFILTER => 'abc'} ROW COLUMN+CELL abc-6535523 column=track:aid, timestamp=1339121507633, value=some stream/pojos New Data: (In another cluster) hbase(main):002:0> get 'mytable', 'def-6535523' ROW

Apache Spark + Parquet not Respecting Configuration to use “Partitioned” Staging S3A Committer

妖精的绣舞 提交于 2021-02-11 12:31:30
问题 I am writing partitioned data (Parquet file) to AWS S3 using Apache Spark (3.0) from my local machine without having Hadoop installed in my machine. I was getting FileNotFoundException while writing to S3 when I have lot of files to write to around 50 partitions(partitionBy = date). Then I have come across new S3A committer, So I tried to configure "partitioned" committer instead. But still I could see that Spark uses ParquetOutputCommitter instead of PartitionedStagingCommitter when the file

Build Ambari 2.7.5 on Centos 7.8 from sources error

感情迁移 提交于 2021-02-11 12:29:26
问题 I am following the guide : https://cwiki.apache.org/confluence/display/AMBARI/Installation+Guide+for+Ambari+2.7.5 and with this info Ambari 2.7.5 installation failure on CentOS 7 , I managed to overcome the ambari-admin error, but now I am facing a new one : [INFO] Ambari Main 2.7.5.0.0 .............................. SUCCESS [ 2.950 s] [INFO] Apache Ambari Project POM 2.7.5.0.0 ................ SUCCESS [ 0.042 s] [INFO] Ambari Web 2.7.5.0.0 ............................... SUCCESS [01:03 min]

零编码制作报表可能吗?

爷,独闯天下 提交于 2021-02-10 18:33:47
要回答这个问题,首先要明确啥程度算“零编码”? 以 Excel 为例,如果把写 Excel 公式(包括复杂一些的)看做零编码;而把写 Excel VBA 看做编码的话, 报表开发是可以零编码的! 但是,这有个前提:在数据(集)准备好的情况下才可以零编码! 为什么这么说? 我们知道报表开发主要分两个阶段: 第一阶段是为报表准备数据,也就是把原始数据通过 SQL/ 存储过程加工成数据集; 第二阶段是使用已准备的数据编写表达式做报表呈现。在报表工具提供的 IDE 里可视化地画出报表样式,然后再填入一些把数据和单元格绑定的表达式就可以完成报表呈现了,虽然表达式可能比较复杂,但相对硬编码要简单得多(Excel 公式和 VBA 的关系)。所以说这个阶段是能做到“零编码”的。 那报表数据准备怎么办? 很遗憾,这个阶段没法零编码,一直以来只能硬编码,想想我们报表里写的嵌套 SQL、存储过程、JAVA 程序就知道了。为什么报表工具发展这么多年报表呈现已经完全工具化而报表数据准备的手段还这样原始呢?因为这个阶段太复杂了,不仅涉及计算逻辑的算法实现,还涉及报表性能(要知道大部分报表性能问题都是数据准备阶段引起的)。 那报表数据准备是不是没办法了呢? 虽然不能做到零编码,但可以朝着简单化的方向努力,将数据准备阶段也工具化,这样可以使用工具提供的便利来简化报表数据准备阶段的工作,从而进一步简化报表的开发。

How to provide arguments to IN Clause in HIve

蓝咒 提交于 2021-02-10 17:33:48
问题 Is there any way to read arguments in HIVEquery which can substitute to an IN Clause. I have the below query with me. Select count (*) from table where id in ('1','2','3','4','5'). Is there any way to supply the arguments to IN Clause from a text file ? 回答1: Use in_file: Put all ids into file, one id in a row. Select count (*) from table where in_file(id, '/tmp/myfilename'); --local file Also you can pass the list of values as a single parameter to the IN: https://stackoverflow.com/a/56963448

PySpark SQL 相关知识介绍

寵の児 提交于 2021-02-10 16:31:27
本文作者: foochane 本文链接: https://foochane.cn/article/2019060601.html 1 大数据简介 大数据是这个时代最热门的话题之一。但是什么是大数据呢?它描述了一个庞大的数据集,并且正在以惊人的速度增长。大数据除了体积(Volume)和速度(velocity)外,数据的多样性(variety)和准确性(veracity)也是大数据的一大特点。让我们详细讨论体积、速度、多样性和准确性。这些也被称为大数据的4V特征。 1.1 Volume 数据体积(Volume)指定要处理的数据量。对于大量数据,我们需要大型机器或分布式系统。计算时间随数据量的增加而增加。所以如果我们能并行化计算,最好使用分布式系统。数据可以是结构化数据、非结构化数据或介于两者之间的数据。如果我们有非结构化数据,那么情况就会变得更加复杂和计算密集型。你可能会想,大数据到底有多大?这是一个有争议的问题。但一般来说,我们可以说,我们无法使用传统系统处理的数据量被定义为大数据。现在让我们讨论一下数据的速度。 1.2 Velocity 越来越多的组织机构开始重视数据。每时每刻都在收集大量的数据。这意味着数据的速度在增加。一个系统如何处理这个速度?当必须实时分析大量流入的数据时,问题就变得复杂了。许多系统正在开发,以处理这种巨大的数据流入

Logger is not working inside spark UDF on cluster

陌路散爱 提交于 2021-02-10 15:54:51
问题 I have placed log.info statements inside my UDF but it is getting failed on cluster. Local working fine. Here is the snippet: def relType = udf((colValue: String, relTypeV: String) => { var relValue = "NA" val relType = relTypeV.split(",").toList val relTypeMap = relType.map { col => val split = col.split(":") (split(0), split(1)) }.toMap // val keySet = relTypeMap relTypeMap.foreach { x => if ((x._1 != null || colValue != null || x._1.trim() != "" || colValue.trim() != "") && colValue