bigdata | 易学教程

Apache Spark architecture

阅读更多关于 Apache Spark architecture

问题 Trying to find a complete documentation about an internal architecture of Apache Spark, but have no results there. For example I'm trying to understand next thing: Assume that we have 1Tb text file on HDFS (3 nodes in a cluster, replication factor is 1). This file will be spitted into 128Mb chunks and each chunk will be stored only on one node. We run Spark Workers on these nodes. I know that Spark is trying to work with data stored in HDFS on the same node (to avoid network I/O). For example

Extend numpy mask by n cells to the right for each bad value, efficiently

阅读更多关于 Extend numpy mask by n cells to the right for each bad value, efficiently

Let's say I have a length 30 array with 4 bad values in it. I want to create a mask for those bad values, but since I will be using rolling window functions, I'd also like a fixed number of subsequent indices after each bad value to be marked as bad. In the below, n = 3: I would like to do this as efficiently as possible because this routine will be run many times on large data series containing billions of datapoints. Thus I need as close to a numpy vectorized solution as possible because I'd like to avoid python loops. For avoidance of retyping, here is the array: import numpy as np a = np

Apache Drill vs Spark

阅读更多关于 Apache Drill vs Spark

问题 I have some expirience with Apache Spark and Spark-SQL. Recently I've found Apache Drill project. Could you describe me what are the most significant advantages/differences between them? I've already read Fast Hadoop Analytics (Cloudera Impala vs Spark/Shark vs Apache Drill) but this topic is still unclear for me. 回答1: Here's an article I came across that discusses some of the SQL technologies: http://www.zdnet.com/article/sql-and-hadoop-its-complicated/ Drill is fundamentally different in

How can I save an RDD into HDFS and later read it back?

阅读更多关于 How can I save an RDD into HDFS and later read it back?

问题 I have an RDD whose elements are of type (Long, String). For some reason, I want to save the whole RDD into the HDFS, and later also read that RDD back in a Spark program. Is it possible to do that? And if so, how? 回答1: It is possible. In RDD you have saveAsObjectFile and saveAsTextFile functions. Tuples are stored as (value1, value2) , so you can later parse it. Reading can be done with textFile function from SparkContext and then .map to eliminate () So: Version 1: rdd.saveAsTextFile ("hdfs

Hive execution hook

阅读更多关于 Hive execution hook

I am in need to hook a custom execution hook in Apache Hive. Please let me know if somebody know how to do it. The current environment I am using is given below: Hadoop : Cloudera version 4.1.2 Operating system : Centos Thanks, Arun There are several types of hooks depending on at which stage you want to inject your custom code: Driver run hooks (Pre/Post) Semantic analyizer hooks (Pre/Post) Execution hooks (Pre/Failure/Post) Client statistics publisher If you run a script the processing flow looks like as follows: Driver.run() takes the command HiveDriverRunHook.preDriverRun() ( HiveConf

doing PCA on very large data set in R

阅读更多关于 doing PCA on very large data set in R

问题 This question was migrated from Cross Validated because it can be answered on Stack Overflow. Migrated 7 years ago . I have a very large training set (~2Gb) in a CSV file. The file is too large to read directly into memory ( read.csv() brings the computer to a halt) and I would like to reduce the size of the data file using PCA. The problem is that (as far as I can tell) I need to read the file into memory in order to run a PCA algorithm (e.g., princomp() ). I have tried the bigmemory package

Reverse Sorting Reducer Keys

阅读更多关于 Reverse Sorting Reducer Keys

What is the best approach to get the Map Output keys to a reducer in reverse order? By default the reducer receives all keys in ascending order of keys. Any help or comments widely appreciated. In simple words, in the normal scenario, if a map emits keys 1,4,3,5,2 the reducer receives the same as 1,2,3,4,5 . I would like the reducer to receive 5,4,3,2,1 instead. In Hadoop 1.X, you can specify a custom comparator class for your outputs using JobConf.setOutputKeyComparatorClass . Your comparator must implement the RawComparator interface . With Hadoop 2.X, this is done by using Job

Understanding and building a social network algorithm

阅读更多关于 Understanding and building a social network algorithm

问题 I am not sure whether this is the right platform to ask this question. But my problem statement is : I have a book shop & x no of clients (x is huge). A client can tell me whether a book is a good or bad (not recommended). I have a internal logic to club books together , so if a client says a book is bad, he is saying that similar books are bad too and don't show him that. I oblige and hide those books. Clients can also interact among themselves, and have a mutual confidence level between

Why Kafka so fast [closed]

阅读更多关于 Why Kafka so fast [closed]

If I have same hardware, to use Kafka or our current solution(ServiceMix/Camel). Is there any difference? Can Kafka handle "bigger" data than it? Why? There is a article to talk about how fast could it be? But I still don't get clearly why Kafka is so fast comparing to other solutions? Benchmarking Apache Kafka: 2 Million Writes Per Second (On Three Cheap Machines) Kafka is fast for a number of reasons. To name a few. Zero Copy - See https://en.wikipedia.org/wiki/Zero-copy basically it calls the OS kernal direct rather than at the application layer to move data fast. Batch Data in Chunk s -

How to produce massive amount of data?

阅读更多关于 How to produce massive amount of data?

I'm doing some testing with nutch and hadoop and I need a massive amount of data. I want to start with 20GB, go to 100 GB, 500 GB and eventually reach 1-2 TB. The problem is that I don't have this amount of data, so I'm thinking of ways to produce it. The data itself can be of any kind. One idea is to take an initial set of data and duplicate it. But its not good enough because need files that are different from one another (Identical files are ignored). Another idea is to write a program that will create files with dummy data. Any other idea? Iterator This may be a better question for the