bigdata | 易学教程

Incremental PCA on big data

阅读更多关于 Incremental PCA on big data

问题 I just tried using the IncrementalPCA from sklearn.decomposition, but it threw a MemoryError just like the PCA and RandomizedPCA before. My problem is, that the matrix I am trying to load is too big to fit into RAM. Right now it is stored in an hdf5 database as dataset of shape ~(1000000, 1000), so I have 1.000.000.000 float32 values. I thought IncrementalPCA loads the data in batches, but apparently it tries to load the entire dataset, which does not help. How is this library meant to be

What methods can we use to reshape VERY large data sets?

阅读更多关于 What methods can we use to reshape VERY large data sets?

问题 When due to very large data calculations will take a long time and, hence, we don't want them to crash, it would be valuable to know beforehand which reshape method to use. Lately, methods for reshaping data have been further developed regarding performance, e.g. data.table::dcast and tidyr::spread . Especially dcast.data.table seems to set the tone [1], [2], [3], [4] . This makes other methods as base R's reshape in benchmarks seem outdated and almost useless [5] . Theory However , I've

Active tasks is a negative number in Spark UI

阅读更多关于 Active tasks is a negative number in Spark UI

问题 When using spark-1.6.2 and pyspark, I saw this: where you see that the active tasks are a negative number (the difference of the the total tasks from the completed tasks). What is the source of this error? Node that I have many executors. However, it seems like there is a task that seems to have been idle (I don't see any progress), while another identical task completed normally. Also this is related: that mail I can confirm that many tasks are being created, since I am using 1k or 2k

Reading big data with fixed width

阅读更多关于 Reading big data with fixed width

How can I read big data formated with fixed width? I read this question and tried some tips, but all answers are for delimited data (as .csv), and that's not my case. The data has 558MB, and I don't know how many lines. I'm using: dados <- read.fwf('TS_MATRICULA_RS.txt', width=c(5, 13, 14, 3, 3, 5, 4, 6, 6, 6, 1, 1, 1, 4, 3, 2, 9, 3, 2, 9, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 4, 11, 9, 2, 3, 9, 3, 2, 9, 9, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1), stringsAsFactors=FALSE, comment.char='', colClasses=c('integer', 'integer', 'integer', 'integer',

MongoDB as file storage

阅读更多关于 MongoDB as file storage

问题 i'm trying to find the best solution to create scalable storage for big files. File size can vary from 1-2 megabytes and up to 500-600 gigabytes. I have found some information about Hadoop and it's HDFS, but it looks a little bit complicated, because i don't need any Map/Reduce jobs and many other features. Now i'm thinking to use MongoDB and it's GridFS as file storage solution. And now the questions: What will happen with gridfs when i try to write few files concurrently. Will there be any

Clustering Keys in Cassandra

阅读更多关于 Clustering Keys in Cassandra

问题 On a given physical node, rows for a given partition key are stored in the order induced by the clustering keys, making the retrieval of rows in that clustering order particularly efficient. http://cassandra.apache.org/doc/cql3/CQL.html#createTableStmt What kind of ordering is induced by clustering keys? 回答1: Suppose your clustering keys are k1 t1, k2 t2, ..., kn tn where ki is the ith key name and ti is the ith key type. Then the order data is stored in is lexicographic ordering where each

Haskell: Can I perform several folds over the same lazy list without keeping list in memory?

阅读更多关于 Haskell: Can I perform several folds over the same lazy list without keeping list in memory?

问题 My context is bioinformatics, next-generation sequencing in particular, but the problem is generic; so I will use a log file as an example. The file is very large (Gigabytes large, compressed, so it will not fit in memory), but is easy to parse (each line is an entry), so we can easily write something like: parse :: Lazy.ByteString -> [LogEntry] Now, I have a lot of statistics that I would like to compute from the log file. It is easiest to write separate functions such as: totalEntries =

What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism?

阅读更多关于 What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism?

What's the difference between spark.sql.shuffle.partitions and spark.default.parallelism ? I have tried to set both of them in SparkSQL , but the task number of the second stage is always 200. From the answer here , spark.sql.shuffle.partitions configures the number of partitions that are used when shuffling data for joins or aggregations. spark.default.parallelism is the default number of partitions in RDD s returned by transformations like join , reduceByKey , and parallelize when not set explicitly by the user. Note that spark.default.parallelism seems to only be working for raw RDD and is

How does the pyspark mapPartitions function work?

阅读更多关于 How does the pyspark mapPartitions function work?

问题 So I am trying to learn Spark using Python (Pyspark). I want to know how the function mapPartitions work. That is what Input it takes and what Output it gives. I couldn't find any proper example from the internet. Lets say, I have an RDD object containing lists, such as below. [ [1, 2, 3], [3, 2, 4], [5, 2, 7] ] And I want to remove element 2 from all the lists, how would I achieve that using mapPartitions . 回答1: mapPartition should be thought of as a map operation over partitions and not

Confusion in hashing used by LSH

阅读更多关于 Confusion in hashing used by LSH

问题 Matrix M is the signatures matrix, which is produced via Minhashing of the actual data, has documents as columns and words as rows. So a column represents a document. Now it says that every stripe ( b in number, r in length) has its columns hashed, so that a column falls in a bucket. If two columns fall in the same bucket, for >= 1 stripes, then they are potentially similar. So that means that I should create b hashtables and find b independent hash functions? Or just one is enough and every