bigdata

Is Data Lake and Big Data the same?

扶醉桌前 提交于 2019-12-31 02:41:47
问题 I am trying to understand all if there is a real difference between data lake and Big data if you check the concepts both are like a Big repository which saves the information until it becomes necessary, so, When can we say that we are using big data or data lake? Thanks in advance 回答1: I can't say I've come across the term 'big repository' before, but to answer the original question, no, data lake and big data are not the same, although in fairness they are both thrown around a lot and the

R: Is it possible to parallelize / speed-up the reading in of a 20 million plus row CSV into R?

僤鯓⒐⒋嵵緔 提交于 2019-12-30 08:34:16
问题 Once the CSV is loaded via read.csv , it's fairly trivial to use multicore , segue etc to play around with the data in the CSV. Reading it in, however, is quite the time sink. Realise it's better to use mySQL etc etc. Assume the use of an AWS 8xl cluster compute instance running R2.13 Specs as follows: Cluster Compute Eight Extra Large specifications: 88 EC2 Compute Units (Eight-core 2 x Intel Xeon) 60.5 GB of memory 3370 GB of instance storage 64-bit platform I/O Performance: Very High (10

sorting large text data

跟風遠走 提交于 2019-12-30 06:14:32
问题 I have a large file (100 million lines of tab separated values - about 1.5GB in size). What is the fastest known way to sort this based on one of the fields? I have tried hive. I would like to see if this can be done faster using python. 回答1: Have you considered using the *nix sort program? in raw terms, it'll probably be faster than most Python scripts. Use -t $'\t' to specify that it's tab-separated, -k n to specify the field, where n is the field number, and -o outputfile if you want to

Skipping the first line of the .csv in Map reduce java

核能气质少年 提交于 2019-12-30 05:02:18
问题 As mapper function runs for every line , can i know the way how to skip the first line. For some file it consists of column header which i want to ignore 回答1: In mapper while reading the file, the data is read in as key-value pair. The key is the byte offset where the next line starts. For line 1 it is always zero. So in mapper function do the following @Override public void map(LongWritable key, Text value, Context context) throws IOException { try { if (key.get() == 0 && value.toString()

How to transform a categorical variable in Spark into a set of columns coded as {0,1}?

不想你离开。 提交于 2019-12-30 01:20:09
问题 I'm trying to perform a logistic regression (LogisticRegressionWithLBFGS) with Spark MLlib (with Scala) on a dataset which contains categorical variables. I discover Spark was not able to work with that kind of variable. In R there is a simple way to deal with that kind of problem : I transform the variable in factor (categories), so R creates a set of columns coded as {0,1} indicator variables. How can I perform this with Spark? 回答1: Using VectorIndexer, you may tell the indexer the number

How to column bind two ffdf

一个人想着一个人 提交于 2019-12-29 09:16:05
问题 Suppose two ffdf files: library(ff) ff1 <- as.ffdf(data.frame(matrix(rnorm(10*10),ncol=10))) ff2 <- ff1 colnames(ff2) <- 1:10 How can I column bind these without loading them into memory? cbind doesn't work. There is the same question http://stackoverflow.com/questions/18355686/columnbind-ff-data-frames-in-r but it does not have an MWE and the author abandoned it so I reposted. 回答1: You can use the following construct cbind.ffdf2 , making sure the column names of the two input ffdf 's are not

What are the differences between Sort Comparator and Group Comparator in Hadoop?

此生再无相见时 提交于 2019-12-29 05:18:28
问题 What are the differences between Sort Comparator and Group Comparator in Hadoop? 回答1: To understand GroupComparator , see my answer to this question - What is the use of grouping comparator in hadoop map reduce SortComparator :Used to define how map output keys are sorted Excerpts from the book Hadoop - Definitive Guide: Sort order for keys is found as follows: If the property mapred.output.key.comparator.class is set, either explicitly or by calling setSortComparatorClass() on Job, then an

Is there a way to transpose data in Hive?

随声附和 提交于 2019-12-28 06:54:06
问题 Can data in Hive be transposed? As in, the rows become columns and columns are the rows? If there is no function straight up, is there a way to do it in a couple of steps? I have a table like this: | ID | Names | Proc1 | Proc2 | Proc3 | | 1 | A1 | x | b | f | | 2 | B1 | y | c | g | | 3 | C1 | z | d | h | | 4 | D1 | a | e | i | I want it to be like this: | A1 | B1 | C1 | D1 | | x | y | z | a | | b | c | d | e | | f | g | h | i | I have been looking up other related questions and they all

sqoop import from vertica failed

假装没事ソ 提交于 2019-12-25 19:54:12
问题 I am trying to import dataset from Vertica to HDFS using sqoop2. I a running following query on sqoop machines to import data into hdfs from Vertica v6.0.1-7 sqoop import -m 1 --driver com.vertica.jdbc.Driver --connect "jdbc:vertica://10.10.10.10:5433/MYDB" --password dbpassword --username dbusername --target-dir "/user/my/hdfs/dir" --verbose --query 'SELECT * FROM ORDER_V2 LIMIT 10;' but i am getting some error here, 16/02/03 10:33:17 ERROR tool.ImportTool: Encountered IOException running

R - Fast Mode Function for use in data.table[,lapply(.SD,Mode),by=.()]

我怕爱的太早我们不能终老 提交于 2019-12-25 19:06:54
问题 I'm summarizing data in a data.table, group by, where I need to take a single value of a variable in a group. I want this value to be the mode of the group. I think it needs to be mode because usually a group is 8 rows and it will have 2 rows at one value and the other 6 or so rows will be another value. Here's a simplified example, from this: key1 2 key1 2 key1 2 key1 8 key1 2 key1 2 key1 2 key1 8 I want this: key1 2 I was having trouble using the standard mode function provided by base R,