large-data | 易学教程

Large fixed effects binomial regression in R

阅读更多关于 Large fixed effects binomial regression in R

问题 I need to run a logistic regression on a relatively large data frame with 480.000 entries with 3 fixed effect variables. Fixed effect var A has 3233 levels, var B has 2326 levels, var C has 811 levels. So all in all I have 6370 fixed effects. The data is cross-sectional. If I can't run this regression using the normal glm function because the regression matrix seems too large for my memory (I get the message " Error: cannot allocate vector of size 22.9 Gb "). I am looking for alternative ways

Best of breed indexing data structures for Extremely Large time-series

阅读更多关于 Best of breed indexing data structures for Extremely Large time-series

问题 I'd like to ask fellow SO'ers for their opinions regarding best of breed data structures to be used for indexing time-series (aka column-wise data, aka flat linear). Two basic types of time-series exist based on the sampling/discretisation characteristic: Regular discretisation (Every sample is taken with a common frequency) Irregular discretisation(Samples are taken at arbitary time-points) Queries that will be required: All values in the time range [t0,t1] All values in the time range [t0

Best of breed indexing data structures for Extremely Large time-series

阅读更多关于 Best of breed indexing data structures for Extremely Large time-series

D3: How to show large dataset

阅读更多关于 D3: How to show large dataset

问题 I've a large dataset comprises 10^5 data points. And now I'm considering the following question related to large dataset: Is there any efficient way to visualize very large dataset? In my case I have a user set and each user has 10^3 items. There are 10^5 items in total. I want to show all the items for each user at a time to enable quick comparison between users. Some body suggests using a list, but I don't think a list is the only choice when dealing with this big dataset. Note I want to

MATLAB randomly permuting columns differently

阅读更多关于 MATLAB randomly permuting columns differently

问题 I have a very large matrix A with N rows and M columns. I want to basically do the following operation for k = 1:N A(k,:) = A(k,randperm(M)); end but fast and efficiently. (Both M and N are very large, and this is only an inner loop in a more massive outer loop.) More context: I am trying to implement a permutation test for a correlation matrix (http://en.wikipedia.org/wiki/Resampling_%28statistics%29). My data is very large and I am very impatient. If anyone knows of a fast way to implement

Insert large amount of data to BigQuery via bigquery-python library

阅读更多关于 Insert large amount of data to BigQuery via bigquery-python library

问题 I have large csv files and excel files where I read them and create the needed create table script dynamically depending on the fields and types it has. Then insert the data to the created table. I have read this and understood that I should send them with jobs.insert() instead of tabledata.insertAll() for large amount of data. This is how I call it (Works for smaller files not large ones). result = client.push_rows(datasetname,table_name,insertObject) # insertObject is a list of dictionaries

Big Satellite Image Processing

阅读更多关于 Big Satellite Image Processing

问题 Im trying to run Mort Canty's http://mcanty.homepage.t-online.de/ Python iMAD implementation on bitemporal RapidEye Multispectral images. Which basically calculates the canonical correlation for the two images and then substracts them. The problem I'm having is that the images are of 5000 x 5000 x 5 (bands) pixels. If I try to run this on the whole image I get a memory error. Would the use of something like pyTables help me with this? What Mort Canty's code tries to do is that it loads the

How to find all the unique substrings of a very long string?

阅读更多关于 How to find all the unique substrings of a very long string?

问题 I have a very long string. I want to find all the unique substrings of this string. I tried to write the code where I used a set (python) to store all the substrings to ensure uniqueness. I am getting correct result for many medium and large strings however in case of very large strings, I am getting a MemoryError. I googled a bit and found out that the set data structure in python has a large RAM footprint and maybe thats why I am getting a MemoryError. Here is my code : a = set() for i in

Writing Panda Dataframes to csv file in chunks

阅读更多关于 Writing Panda Dataframes to csv file in chunks

问题 I have a set of large data files (1M rows x 20 cols). However, only 5 or so columns of that data is of interest to me. I figure I can make things easier for me by creating copies of these files with only the columns of interest so I have smaller files to work with for post processing. My plan was to read the file into a dataframe and then write to csv file. I've been looking into reading large data files in chunks into a dataframe. However, I haven't been able to find anything on how to write

Writing Panda Dataframes to csv file in chunks

阅读更多关于 Writing Panda Dataframes to csv file in chunks