large-data | 易学教程

Preallocating a large array in a MATLAB matfile with something other than zeroes

阅读更多关于 Preallocating a large array in a MATLAB matfile with something other than zeroes

问题 I need to write an array that is too large to fit into memory to a .mat binary file. This can be accomplished with the matfile command, which allows random access to a .mat file on disc. I am trying to preallocate the array in this file, and the approach recommended by a MathWorks blog is matObj = matfile('myBigData.mat','Writable',true); matObj.X(10000,10000) = 0; This works, but leaves me with a large array of zeroes - which is risky, as some of the genuine values that I will be populating

numpy save/load corrupting an array

阅读更多关于 numpy save/load corrupting an array

问题 I am trying to save a large numpy array and reload it. Using numpy.save and numpy.load , the array values are corrupted/change. The shape and data type of the array pre-saving, and post-loading, are the same, but the post-loading array has the vast majority of the values zeroed. The array is (22915,22915), values are float64's, takes 3.94 gb's as a .npy file, and the data entries average about .1 (not tiny floats that might reasonably get converted to zeroes). I am using numpy 1.5.1. Any help

How to design HTTP API to push massive data?

阅读更多关于 How to design HTTP API to push massive data?

问题 I need to provide an HTTP API for clients to push massive data, in the shape of a set of records. My first idea was to provide a set of three calls, like: "BeginPushData" (no parameters, returns an Id), "PushSomeData" (parameters: id, subset of data, no return value) "EndPushData" (parameter: id) The first call should be used to initialize some temporary data structure and give the user an identifier, so that subsequent calls can refer to it and data from multiple users don't mess up. The

Load large amount of data to DB - Android

阅读更多关于 Load large amount of data to DB - Android

问题 I'm building a new application to the Android platform and I need a suggestion. I have a large amount of data that I want to import into a sqlite DB, something like 2000 rows. My question is what is the best method to load the data into the DB? Two ways I think of: An xml file that holds all the data and then loads it with DocumentBuilder object. A CSV file, and loads it with String Tokenizer. Found that post. I want it to be with external file because When I release an update I will replace

dcast efficiently large datasets with multiple variables

阅读更多关于 dcast efficiently large datasets with multiple variables

问题 I am trying to dcast a large dataset (millions of rows). I have one row for arrival time and origin, and another row for departure time and destination. There is an id to identify the unit in both cases. It looks similar to this: id time movement origin dest 1 10/06/2011 15:54 ARR 15 15 1 10/06/2011 16:14 DEP 15 29 2 10/06/2011 17:59 ARR 73 73 2 10/06/2011 18:10 DEP 73 75 2 10/06/2011 21:10 ARR 75 75 2 10/06/2011 21:20 DEP 75 73 3 10/06/2011 17:14 ARR 17 17 3 10/06/2011 18:01 DEP 17 48 4 10

SQL Server - Merging large tables without locking the data

阅读更多关于 SQL Server - Merging large tables without locking the data

问题 I have a very large set of data (~3 million records) which needs to be merged with updates and new records on a daily schedule. I have a stored procedure that actually breaks up the record set into 1000 record chunks and uses the MERGE command with temp tables in an attempt to avoid locking the live table while the data is updating. The problem is that it doesn't exactly help. The table still "locks up" and our website that uses the data receives timeouts when attempting to access the data. I

numpy.memmap for an array of strings?

阅读更多关于 numpy.memmap for an array of strings?

问题 Is it possible to use numpy.memmap to map a large disk-based array of strings into memory? I know it can be done for floats and suchlike, but this question is specifically about strings. I am interested in solutions for both fixed-length and variable-length strings. The solution is free to dictate any reasonable file format. 回答1: If all the strings have the same length, as suggested by the term "array", this is easily possible: a = numpy.memmap("data", dtype="S10") would be an example for

fread protection stack overflow error

阅读更多关于 fread protection stack overflow error

问题 I'm using fread in data.table (1.8.8, R 3.0.1) in a attempt to read very large files. The file in questions has 313 rows and ~6.6 million cols of numeric data rows and the file is around around 12gb. This is a Centos 6.4 with 512GB of RAM. When I attempt to read in the file: g=fread('final.results',header=T,sep=' ') 'header' changed by user from 'auto' to TRUE Error: protect(): protection stack overflow I tried starting R with --max-ppsize 500000 , which is the max, but the same error. I also

fread protection stack overflow error

阅读更多关于 fread protection stack overflow error

How much data can R handle? [closed]

阅读更多关于 How much data can R handle? [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 3 years ago . By "handle" I mean manipulate multi-columnar rows of data. How does R stack up against tools like Excel, SPSS, SAS, and others? Is R a viable tool for looking at "BIG DATA" (hundreds of millions to billions of rows)? If not, which statistical programming tools are best suited for analysis large data sets? 回答1: