bigdata | 易学教程

Why Spark SQL considers the support of indexes unimportant?

阅读更多关于 Why Spark SQL considers the support of indexes unimportant?

Quoting the Spark DataFrames, Datasets and SQL manual : A handful of Hive optimizations are not yet included in Spark. Some of these (such as indexes) are less important due to Spark SQL’s in-memory computational model. Others are slotted for future releases of Spark SQL. Being new to Spark, I'm a bit baffled by this for two reasons: Spark SQL is designed to process Big Data, and at least in my use case the data size far exceeds the size of available memory. Assuming this is not uncommon, what is meant by "Spark SQL’s in-memory computational model"? Is Spark SQL recommended only for cases

How does the Apache Spark scheduler split files into tasks?

阅读更多关于 How does the Apache Spark scheduler split files into tasks?

问题 In spark-summit 2014, Aaron gives the speak A Deeper Understanding of Spark Internals , in his slide, page 17 show a stage has been splited into 4 tasks as bellow: Here I wanna know three things about how does a stage be splited into tasks? in this example above, it seems that tasks' number are created based on the file number, am I right? if I'm right in point 1, so if there was just 3 files under directory names, will it just create 3 tasks? If I'm right in point 2, what if there is just

MapReduce or Spark? [closed]

阅读更多关于 MapReduce or Spark? [closed]

I have tested hadoop and mapreduce with cloudera and I found it pretty cool, I thought I was the most recent and relevant BigData solution. But few days ago, I found this : https://spark.incubator.apache.org/ A "Lightning fast cluster computing system", able to work on the top of a Hadoop cluster, and apparently able to crush mapreduce. I saw that it worked more in RAM than mapreduce. I think that mapreduce is still relevant when you have to do cluster computing to overcome I/O problems you can have on a single machine. But since Spark can do the jobs that mapreduce do, and may be way more

How to compare two dataframe and print columns that are different in scala

阅读更多关于 How to compare two dataframe and print columns that are different in scala

We have two data frames here: the expected dataframe: +------+---------+--------+----------+-------+--------+ |emp_id| emp_city|emp_name| emp_phone|emp_sal|emp_site| +------+---------+--------+----------+-------+--------+ | 3| Chennai| rahman|9848022330| 45000|SanRamon| | 1|Hyderabad| ram|9848022338| 50000| SF| | 2|Hyderabad| robin|9848022339| 40000| LA| | 4| sanjose| romin|9848022331| 45123|SanRamon| +------+---------+--------+----------+-------+--------+ and the actual data frame: +------+---------+--------+----------+-------+--------+ |emp_id| emp_city|emp_name| emp_phone|emp_sal|emp_site|

Is there something like Redis DB, but not limited with RAM size? [closed]

阅读更多关于 Is there something like Redis DB, but not limited with RAM size? [closed]

I'm looking for a database matching these criteria: May be non-persistent; Almost all keys of DB need to be updated once in 3-6 hours (100M+ keys with total size of 100Gb) Ability to quickly select data by key (or Primary Key) This needs to be a DBMS (so LevelDB doesn't fit) When data is written, DB cluster must be able to serve queries (single nodes can be blocked though) Not in-memory – our dataset will exceed the RAM limits Horizontal scaling and replication Support full rewrite of all data (MongoDB doesn't clear space after deleting data) C# and Java support Here's my process of working

importance of PCA or SVD in machine learning

阅读更多关于 importance of PCA or SVD in machine learning

All this time (specially in Netflix contest), I always come across this blog (or leaderboard forum) where they mention how by applying a simple SVD step on data helped them in reducing sparsity in data or in general improved the performance of their algorithm in hand. I am trying to think (since long time) but I am not able to guess why is it so. In general, the data in hand I get is very noisy (which is also the fun part of bigdata) and then I do know some basic feature scaling stuff like log-transformation stuff , mean normalization. But how does something like SVD helps. So lets say i have

Prepare my bigdata with Spark via Python

阅读更多关于 Prepare my bigdata with Spark via Python

My 100m in size, quantized data: (1424411938', [3885, 7898]) (3333333333', [3885, 7898]) Desired result: (3885, [3333333333, 1424411938]) (7898, [3333333333, 1424411938]) So what I want, is to transform the data so that I group 3885 (for example) with all the data[0] that have it). Here is what I did in python : def prepare(data): result = [] for point_id, cluster in data: for index, c in enumerate(cluster): found = 0 for res in result: if c == res[0]: found = 1 if(found == 0): result.append((c, [])) for res in result: if c == res[0]: res[1].append(point_id) return result but when I

Spark RDD's - how do they work

阅读更多关于 Spark RDD's - how do they work

问题 I have a small Scala program that runs fine on a single-node. However, I am scaling it out so it runs on multiple nodes. This is my first such attempt. I am just trying to understand how the RDDs work in Spark so this question is based around theory and may not be 100% correct. Let's say I create an RDD: val rdd = sc.textFile(file) Now once I've done that, does that mean that the file at file is now partitioned across the nodes (assuming all nodes have access to the file path)? Secondly, I

How to insert big data on the laravel?

阅读更多关于 How to insert big data on the laravel?

I am using laravel 5.6 My script to insert big data is like this : ... $insert_data = []; foreach ($json['value'] as $value) { $posting_date = Carbon::parse($value['Posting_Date']); $posting_date = $posting_date->format('Y-m-d'); $data = [ 'item_no' => $value['Item_No'], 'entry_no' => $value['Entry_No'], 'document_no' => $value['Document_No'], 'posting_date' => $posting_date, .... ]; $insert_data[] = $data; } \DB::table('items_details')->insert($insert_data); I have tried to insert 100 record with the script, it works. It successfully insert data But if I try to insert 50000 record with the

How to column bind two ffdf

阅读更多关于 How to column bind two ffdf

Suppose two ffdf files: library(ff) ff1 <- as.ffdf(data.frame(matrix(rnorm(10*10),ncol=10))) ff2 <- ff1 colnames(ff2) <- 1:10 How can I column bind these without loading them into memory? cbind doesn't work. There is the same question http://stackoverflow.com/questions/18355686/columnbind-ff-data-frames-in-r but it does not have an MWE and the author abandoned it so I reposted. Audrey You can use the following construct cbind.ffdf2 , making sure the column names of the two input ffdf 's are not duplicate: library(ff) ff1 <- as.ffdf(data.frame(letA = letters[1:5], numA = 1:5)) ff2 <- as.ffdf