bigdata

Hive - Out of Memory Exception - Java Heap Space

跟風遠走 提交于 2019-12-10 21:09:52
问题 I am running a Hive insert on top of parquet files(created using Spark). Hive insert is using partitioned by clause. But at the end when the screen is printing messages like "Loading partition {=xyz, =123, =abc} a Java Heap Space exception is coming. java.lang.OutOfMemoryError: Java heap space at java.util.HashMap.createEntry(HashMap.java:901) at java.util.HashMap.addEntry(HashMap.java:888) at java.util.HashMap.put(HashMap.java:509) at org.apache.hadoop.hive.metastore.api.Partition.<init>

R - Big Data - vector exceeds vector length limit

南笙酒味 提交于 2019-12-10 20:53:32
问题 I have the following R code: data <- read.csv('testfile.data', header = T) mat = as.matrix(data) Some more statistics of my testfile.data: > ncol(data) [1] 75713 > nrow(data) [1] 44771 Since this is a large dataset, so I am using Amazon EC2 with 64GB Ram space. So hopefully memory isn't an issue. I am able to load the data (1st line works). But as.matrix transformation (2nd line errors) throws the following error: resulting vector exceeds vector length limit in 'AnswerType' Any clue what

How to get absolute paths of files in a directory?

别来无恙 提交于 2019-12-10 20:27:09
问题 I have a directory with files, directories, subdirectories, etc. How I can get the list of absolute paths to all files and directories using the Apache Hadoop API? 回答1: Using HDFS API : package org.myorg.hdfsdemo; import java.io.BufferedReader; import java.io.IOException; import java.io.InputStreamReader; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileStatus; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; public class HdfsDemo { public

When does an action not run on the driver in Apache Spark?

孤街浪徒 提交于 2019-12-10 20:18:39
问题 I have just started with Spark and was struggling with the concept of tasks. Can any one please help me in understanding when does an action (say reduce) not run in the driver program. From the spark tutorial, "Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel. " I'm currently experimenting with an application which reads a directory on 'n'

Anonymization of Account Numbers in 2TB of CSV's

回眸只為那壹抹淺笑 提交于 2019-12-10 20:13:19
问题 I have ~2TB of CSV's where the first 2 columns contains two ID numbers . These need to be anonymized so the data can be used in academic research. The anonymization can be (but does not have to be) irreversible. These are NOT medical records, so I do not need the fanciest cryptographic algorithm. The Question: Standard hashing algorithms make really long strings, but I will have to do a bunch of ID-matching (i.e. 'for subset of rows in data containing ID XXX, do...)' to process the anonymized

Big matrix and memory problems

て烟熏妆下的殇ゞ 提交于 2019-12-10 15:53:03
问题 I am working on a huge dataset and I would like to derive the distribution of a test statistic. Hence I need to do calculations with huge matrices (200000x200000) and as you might predict I have memory issues. More precisely I get the following: Error: cannot allocate vector of size ... Gb. I work on the 64-bit version of R and my RAM is 8Gb. I tried to use the package bigmemory but with not big success. The first issue comes when I have to calculate the distance matrix. I found this nice

updating Hive external table with HDFS changes

大城市里の小女人 提交于 2019-12-10 14:40:00
问题 lets say, I created Hive external table "myTable" from file myFile.csv ( located in HDFS ). myFile.csv is changed every day, then I'm interested to update "myTable" once a day too. Is there any HiveQL query that tells to update the table every day? Thank you. P.S. I would like to know if it works the same way with directories: lets say, I create Hive partition from HDFS directory "myDir", when "myDir" contains 10 files. next day "myDIr" contains 20 files (10 files were added). Should I update

Storing a deep directory tree in a database

柔情痞子 提交于 2019-12-10 14:32:40
问题 I am working on a desktop application that is much like WinDirStat or voidtools' Everything - it maps hard drives, i.e. creates a deeply nested dictionary out of the directory tree. The desktop application should then store the directory trees in some kind of database, so that a web application can be used to browse them from root, depth level by depth level. Assume both applications run locally on the same machine for the time being. The question that comes to mind is how the data should be

Pyspark: how to duplicate a row n time in dataframe?

对着背影说爱祢 提交于 2019-12-10 14:22:06
问题 I've got a dataframe like this and I want to duplicate the row n times if the column n is bigger than one: A B n 1 2 1 2 9 1 3 8 2 4 1 1 5 3 3 And transform like this: A B n 1 2 1 2 9 1 3 8 2 3 8 2 4 1 1 5 3 3 5 3 3 5 3 3 I think I should use explode , but I don't understand how it works... Thanks 回答1: The explode function returns a new row for each element in the given array or map. One way to exploit this function is to use a udf to create a list of size n for each row. Then explode the

Is there an alternative to Twitter Storm that is written in Python? [closed]

梦想的初衷 提交于 2019-12-10 14:16:11
问题 As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance. Closed 6 years ago . I couldn't find much after various searches, for an alternative to Twitter Storm. Specifically a streaming big data processing library