bigdata

clustering very large dataset in R

可紊 提交于 2019-11-28 23:54:21
I have a dataset consisting of 70,000 numeric values representing distances ranging from 0 till 50, and I want to cluster these numbers; however, if I'm trying the classical clustering approach, then I would have to establish a 70,000X70,000 distance matrix representing the distances between each two numbers in my dataset, which won't fit in memory, so I was wondering if there is any smart way to solve this problem without the need to do stratified sampling? I also tried bigmemory and big analytics libraries in R but still can't fit the data into memory You can use kmeans , which normally

Active tasks is a negative number in Spark UI

二次信任 提交于 2019-11-28 21:18:12
When using spark-1.6.2 and pyspark , I saw this: where you see that the active tasks are a negative number (the difference of the the total tasks from the completed tasks). What is the source of this error? Node that I have many executors. However, it seems like there is a task that seems to have been idle (I don't see any progress), while another identical task completed normally. Also this is related: that mail I can confirm that many tasks are being created, since I am using 1k or 2k executors. The error I am getting is a bit different: 16/08/15 20:03:38 ERROR LiveListenerBus: Dropping

Clustering Keys in Cassandra

不羁的心 提交于 2019-11-28 18:49:05
On a given physical node, rows for a given partition key are stored in the order induced by the clustering keys, making the retrieval of rows in that clustering order particularly efficient. http://cassandra.apache.org/doc/cql3/CQL.html#createTableStmt What kind of ordering is induced by clustering keys? Suppose your clustering keys are k1 t1, k2 t2, ..., kn tn where ki is the ith key name and ti is the ith key type. Then the order data is stored in is lexicographic ordering where each dimension is compared using the comparator for that type. So (a1, a2, ..., an) < (b1, b2, ..., bn) if a1 < b1

MongoDB as file storage

吃可爱长大的小学妹 提交于 2019-11-28 18:46:05
i'm trying to find the best solution to create scalable storage for big files. File size can vary from 1-2 megabytes and up to 500-600 gigabytes. I have found some information about Hadoop and it's HDFS, but it looks a little bit complicated, because i don't need any Map/Reduce jobs and many other features. Now i'm thinking to use MongoDB and it's GridFS as file storage solution. And now the questions: What will happen with gridfs when i try to write few files concurrently. Will there be any lock for read/write operations? (I will use it only as file storage) Will files from gridfs be cached

Haskell: Can I perform several folds over the same lazy list without keeping list in memory?

泪湿孤枕 提交于 2019-11-28 18:21:47
My context is bioinformatics, next-generation sequencing in particular, but the problem is generic; so I will use a log file as an example. The file is very large (Gigabytes large, compressed, so it will not fit in memory), but is easy to parse (each line is an entry), so we can easily write something like: parse :: Lazy.ByteString -> [LogEntry] Now, I have a lot of statistics that I would like to compute from the log file. It is easiest to write separate functions such as: totalEntries = length nrBots = sum . map fromEnum . map isBotEntry averageTimeOfDay = histogram . map extractHour All of

How does the pyspark mapPartitions function work?

怎甘沉沦 提交于 2019-11-28 17:36:44
So I am trying to learn Spark using Python (Pyspark). I want to know how the function mapPartitions work. That is what Input it takes and what Output it gives. I couldn't find any proper example from the internet. Lets say, I have an RDD object containing lists, such as below. [ [1, 2, 3], [3, 2, 4], [5, 2, 7] ] And I want to remove element 2 from all the lists, how would I achieve that using mapPartitions . mapPartition should be thought of as a map operation over partitions and not over the elements of the partition. It's input is the set of current partitions its output will be another set

MapReduce or Spark? [closed]

元气小坏坏 提交于 2019-11-28 17:30:01
问题 I have tested hadoop and mapreduce with cloudera and I found it pretty cool, I thought I was the most recent and relevant BigData solution. But few days ago, I found this : https://spark.incubator.apache.org/ A "Lightning fast cluster computing system", able to work on the top of a Hadoop cluster, and apparently able to crush mapreduce. I saw that it worked more in RAM than mapreduce. I think that mapreduce is still relevant when you have to do cluster computing to overcome I/O problems you

Why Spark SQL considers the support of indexes unimportant?

丶灬走出姿态 提交于 2019-11-28 17:09:07
问题 Quoting the Spark DataFrames, Datasets and SQL manual: A handful of Hive optimizations are not yet included in Spark. Some of these (such as indexes) are less important due to Spark SQL’s in-memory computational model. Others are slotted for future releases of Spark SQL. Being new to Spark, I'm a bit baffled by this for two reasons: Spark SQL is designed to process Big Data, and at least in my use case the data size far exceeds the size of available memory. Assuming this is not uncommon, what

How to compare two dataframe and print columns that are different in scala

烂漫一生 提交于 2019-11-28 16:50:58
问题 We have two data frames here: the expected dataframe: +------+---------+--------+----------+-------+--------+ |emp_id| emp_city|emp_name| emp_phone|emp_sal|emp_site| +------+---------+--------+----------+-------+--------+ | 3| Chennai| rahman|9848022330| 45000|SanRamon| | 1|Hyderabad| ram|9848022338| 50000| SF| | 2|Hyderabad| robin|9848022339| 40000| LA| | 4| sanjose| romin|9848022331| 45123|SanRamon| +------+---------+--------+----------+-------+--------+ and the actual data frame: +------+-

“Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used” on an EMR cluster with 75GB of memory

半腔热情 提交于 2019-11-28 15:18:20
I'm running a 5 node Spark cluster on AWS EMR each sized m3.xlarge (1 master 4 slaves). I successfully ran through a 146Mb bzip2 compressed CSV file and ended up with a perfectly aggregated result. Now I'm trying to process a ~5GB bzip2 CSV file on this cluster but I'm receiving this error: 16/11/23 17:29:53 WARN TaskSetManager: Lost task 49.2 in stage 6.0 (TID xxx, xxx.xxx.xxx.compute.internal): ExecutorLostFailure (executor 16 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used. Consider boosting