bigdata | 易学教程

Hive - Out of Memory Exception - Java Heap Space

阅读更多关于 Hive - Out of Memory Exception - Java Heap Space

问题 I am running a Hive insert on top of parquet files(created using Spark). Hive insert is using partitioned by clause. But at the end when the screen is printing messages like "Loading partition {=xyz, =123, =abc} a Java Heap Space exception is coming. java.lang.OutOfMemoryError: Java heap space at java.util.HashMap.createEntry(HashMap.java:901) at java.util.HashMap.addEntry(HashMap.java:888) at java.util.HashMap.put(HashMap.java:509) at org.apache.hadoop.hive.metastore.api.Partition.<init>

R - Big Data - vector exceeds vector length limit

阅读更多关于 R - Big Data - vector exceeds vector length limit

问题 I have the following R code: data <- read.csv('testfile.data', header = T) mat = as.matrix(data) Some more statistics of my testfile.data: > ncol(data) [1] 75713 > nrow(data) [1] 44771 Since this is a large dataset, so I am using Amazon EC2 with 64GB Ram space. So hopefully memory isn't an issue. I am able to load the data (1st line works). But as.matrix transformation (2nd line errors) throws the following error: resulting vector exceeds vector length limit in 'AnswerType' Any clue what

How to get absolute paths of files in a directory?

阅读更多关于 How to get absolute paths of files in a directory?

问题 I have a directory with files, directories, subdirectories, etc. How I can get the list of absolute paths to all files and directories using the Apache Hadoop API? 回答1: Using HDFS API : package org.myorg.hdfsdemo; import java.io.BufferedReader; import java.io.IOException; import java.io.InputStreamReader; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileStatus; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; public class HdfsDemo { public

When does an action not run on the driver in Apache Spark?

阅读更多关于 When does an action not run on the driver in Apache Spark?

问题 I have just started with Spark and was struggling with the concept of tasks. Can any one please help me in understanding when does an action (say reduce) not run in the driver program. From the spark tutorial, "Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel. " I'm currently experimenting with an application which reads a directory on 'n'

Anonymization of Account Numbers in 2TB of CSV's

阅读更多关于 Anonymization of Account Numbers in 2TB of CSV's

问题 I have ~2TB of CSV's where the first 2 columns contains two ID numbers . These need to be anonymized so the data can be used in academic research. The anonymization can be (but does not have to be) irreversible. These are NOT medical records, so I do not need the fanciest cryptographic algorithm. The Question: Standard hashing algorithms make really long strings, but I will have to do a bunch of ID-matching (i.e. 'for subset of rows in data containing ID XXX, do...)' to process the anonymized

Big matrix and memory problems

阅读更多关于 Big matrix and memory problems

问题 I am working on a huge dataset and I would like to derive the distribution of a test statistic. Hence I need to do calculations with huge matrices (200000x200000) and as you might predict I have memory issues. More precisely I get the following: Error: cannot allocate vector of size ... Gb. I work on the 64-bit version of R and my RAM is 8Gb. I tried to use the package bigmemory but with not big success. The first issue comes when I have to calculate the distance matrix. I found this nice

updating Hive external table with HDFS changes

阅读更多关于 updating Hive external table with HDFS changes

问题 lets say, I created Hive external table "myTable" from file myFile.csv ( located in HDFS ). myFile.csv is changed every day, then I'm interested to update "myTable" once a day too. Is there any HiveQL query that tells to update the table every day? Thank you. P.S. I would like to know if it works the same way with directories: lets say, I create Hive partition from HDFS directory "myDir", when "myDir" contains 10 files. next day "myDIr" contains 20 files (10 files were added). Should I update

Storing a deep directory tree in a database

阅读更多关于 Storing a deep directory tree in a database

问题 I am working on a desktop application that is much like WinDirStat or voidtools' Everything - it maps hard drives, i.e. creates a deeply nested dictionary out of the directory tree. The desktop application should then store the directory trees in some kind of database, so that a web application can be used to browse them from root, depth level by depth level. Assume both applications run locally on the same machine for the time being. The question that comes to mind is how the data should be

Pyspark: how to duplicate a row n time in dataframe?

阅读更多关于 Pyspark: how to duplicate a row n time in dataframe?

问题 I've got a dataframe like this and I want to duplicate the row n times if the column n is bigger than one: A B n 1 2 1 2 9 1 3 8 2 4 1 1 5 3 3 And transform like this: A B n 1 2 1 2 9 1 3 8 2 3 8 2 4 1 1 5 3 3 5 3 3 5 3 3 I think I should use explode , but I don't understand how it works... Thanks 回答1: The explode function returns a new row for each element in the given array or map. One way to exploit this function is to use a udf to create a list of size n for each row. Then explode the

Is there an alternative to Twitter Storm that is written in Python? [closed]

阅读更多关于 Is there an alternative to Twitter Storm that is written in Python? [closed]

问题 As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance. Closed 6 years ago . I couldn't find much after various searches, for an alternative to Twitter Storm. Specifically a streaming big data processing library