bigdata

Vertica performance degradation while loading parquet files over delimited files from s3 to vertica

瘦欲@ 提交于 2019-12-11 14:10:16
问题 I have parquet files for 2 Billion records with GZIP compression and the same data with SNAPPY compression. Also, I have Delimited files for the same 2 Billion records. We have 72 Vertica nodes in AWS prod, we are seeing a huge performance spike for parquet files while moving data from s3 to Vertica with COPY command than Delimited files. Parquet takes 7x more time than Delimited files eventhough delimited file size is 50X more than parquet. Below are the stats for the test we conducted.

Spark Java Accumulator not incrementing

不羁的心 提交于 2019-12-11 13:37:08
问题 Just started with baby steps in Spark-Java. Below is a word count program that includes a stop word list that would skip words that are in the list. I have 2 accumulators to count the skipped words and unskipped words. However, the Sysout at the end of program always gives both accumulator values to be 0 . Please point out where I am going wrong. public static void main(String[] args) throws FileNotFoundException { SparkConf conf = new SparkConf(); conf.setAppName("Third App - Word Count WITH

Hive query to get max of count

為{幸葍}努か 提交于 2019-12-11 13:19:28
问题 My input file is like this: id,phnName,price,model 1,iphone,2000,abc 2,iphone,3000,abc1 3,nokia,4000,abc2 4,sony,5000,abc3 5,nokia,6000,abc4 6,iphone,7000,abc5 7,nokia,8500,abc6 I want to write a hive query to get the max count of a particular phone. output: iphone 3 nokia 3 till now I've tried the following query: select d.phnName,count(*) from phnDetails d group by d.phnName and got output like this: iphone 3 nokia 3 sony 1 Help me to retrieve only the max value. 回答1: I have the query

permission denied while executing START-DFS.SH in Ubuntu

谁说我不能喝 提交于 2019-12-11 13:15:31
问题 suresh@suresh-laptop:/$ usr/local/hadoop-2.6.0/sbin/start-all.sh This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh 15/02/01 00:24:52 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Incorrect configuration: namenode address dfs.namenode.servicerpc-address or dfs.namenode.rpc-address is not configured. Starting namenodes on [] suresh@localhost's password: localhost: mkdir: cannot create

Optimizing for loop in big data frame

杀马特。学长 韩版系。学妹 提交于 2019-12-11 12:26:39
问题 I have a large data frame (6 million rows) with one row for entry times and next one for exit times of the same unit (id). I need to put them together. Original data looks something like the following (please bear in mind that some "id" may entry and exit twice like in case of id=1): df <- read.table(header=T, text='id time 1 "15/12/2014 06:30" 1 "15/12/2014 06:31" 1 "15/12/2014 06:34" 1 "15/12/2014 06:35" 2 "15/12/2014 06:36" 2 "15/12/2014 06:37" 3 "15/12/2014 06:38" 3 "15/12/2014 06:39"')

How clean old segments from compacted log in Kafka 0.8.2

ぃ、小莉子 提交于 2019-12-11 12:25:43
问题 I know that in new Kafka versions we have new retention policy option - compaction of log which delete old version of messages with same keys. But after long time we will get too many compacted log segments with old messages. How can we clean this compacted log automatically? UDPATE : I should clarify that we need compact log and way to clean up old messages this in those time. I found discussion for same problem here http://grokbase.com/t/kafka/users/14bv6gaz0t/kafka-0-8-2-log-cleaner but

split 10 billion line file into 5,000 files by column value in Perl or Python

耗尽温柔 提交于 2019-12-11 12:17:26
问题 I have a 10 billion line tab-delimited file that I want to split into 5,000 sub-files, based on a column (first column). How can I do this efficiently in Perl or Python? This has been asked here before but all the approaches open a file for each row read, or they put all the data in memory. 回答1: This program will do as you ask. It expects the input file as a parameter on the command line, and writes output files whose names are taken from the first column of the input file records It keeps a

Performance bottleneck of Spark

China☆狼群 提交于 2019-12-11 12:15:13
问题 A paper "Making Sense of Performance in Data Analytics Frameworks" published in NSDI 2015 gives the conclusion that CPU(not IO or network) is the performance bottleneck of Spark. Kay has done some experiments on Spark including BDbench ,TPC-DS and a procdution workload(only Spark SQL is used?) in this paper. I wonder whether this conclusion is right for some frameworks built on Spark(like Streaming,with a continuous data stream received through network,both network IO and disk will suffer

How to apply a function to each row in SparkR?

╄→гoц情女王★ 提交于 2019-12-11 12:14:37
问题 I have a file in CSV format which contains a table with column "id", "timestamp", "action", "value" and "location". I want to apply a function to each row of the table and I've already written the code in R as follows: user <- read.csv(file_path,sep = ";") num <- nrow(user) curLocation <- "1" for(i in 1:num) { row <- user[i,] if(user$action != "power") curLocation <- row$value user[i,"location"] <- curLocation } The R script works fine and now I want to apply it SparkR. However, I couldn't

Hive: How to calculate time difference

匆匆过客 提交于 2019-12-11 11:48:29
问题 My requirement is simple how to calculate the time difference between two column in hive Example Time_Start: 10:15:00 Time_End: 11:45:00 I need to do (Time_End-Time_Start) = 1:30:00 Note both the columns are in String datatype kindly help to get required result.. 回答1: Language manual contains description of all available datetime functions. Difference in seconds can be calculated in such way: hour(time_end) * 3600 + minute(time_end) * 60 + second(time_end) - hour(time_start) * 3600 - minute