large-data

Finding the transpose of a very, very large matrix

拟墨画扇 提交于 2019-12-05 22:02:35
I have this huge 2 dimensional array of data. It is stored in row order: A(1,1) A(1,2) A(1,3) ..... A(n-2,n) A(n-1,n) A(n,n) I want to rearrange it into column order A(1,1) A(2,1) A(3,1) ..... A(n,n-2) A(n,n-1) A(n,n) The data set is rather large - more than will fit on the RAM on a computer. (n is about 10,000, but each data item takes about 1K of space.) Does anyone know slick or efficient algorithms to do this? Create n empty files (reserve enough space for n elements, if you can). Iterate through your original matrix. Append element (i,j) to file j . Once you are done with that, append the

Find common third on large data set

北城以北 提交于 2019-12-05 17:12:43
I have a large dataframe like df <- data.frame(group= c("a","a","b","b","b","c"), person = c("Tom","Jerry","Tom","Anna","Sam","Nic"), stringsAsFactors = FALSE) df group person 1 a Tom 2 a Jerry 3 b Tom 4 b Anna 5 b Sam 6 c Nic and would like to get as a result df.output pers1 pers2 person_in_common 1 Anna Jerry Tom 2 Jerry Sam Tom 3 Sam Tom Anna 4 Anna Tom Sam 6 Anna Sam Tom The result dataframe gives basically a table with all pairs of persons who have another person in common. I found a way to do it in SQL but it takes an awfully long time so I wonder if there is a efficient way to do it in

Replacing punctuation in a data frame based on punctuation list [duplicate]

有些话、适合烂在心里 提交于 2019-12-05 11:45:58
This question already has answers here : Fast punctuation removal with pandas (3 answers) Closed last year . Using Canopy and Pandas, I have data frame a which is defined by: a=pd.read_csv('text.txt') df=pd.DataFrame(a) df.columns=["test"] test.txt is a single column file that contains a list of string that contains text, numerical and punctuation. Assuming df looks like: test %hgh&12 abc123!!! porkyfries I want my results to be: test hgh12 abc123 porkyfries Effort so far: from string import punctuation /-- import punctuation list from python itself a=pd.read_csv('text.txt') df=pd.DataFrame(a)

Symfony2 / Doctrine make $statement->execute() not “buffer” all values

流过昼夜 提交于 2019-12-05 02:32:56
问题 I've got a basic codeset like this (inside a controller): $sql = 'select * from someLargeTable limit 1000'; $em = $this->getDoctrine()->getManager(); $conn = $em->getConnection(); $statement = $conn->prepare($sql); $statement->execute(); My difficulty is that when the resultset is only a few records, the memory usage is not that bad. I echoed some debugging information before and after running the $statement->execute(); part of the code, and found for my implementation that I have the

When writing a large array directly to disk in MATLAB, is there any need to preallocate?

﹥>﹥吖頭↗ 提交于 2019-12-05 00:08:45
I need to write an array that is too large to fit into memory to a .mat binary file. This can be accomplished with the matfile function, which allows random access to a .mat file on disk. Normally, the accepted advice is to preallocate arrays, because expanding them on every iteration of a loop is slow. However, when I was asking how to do this , it occurred to me that this may not be good advice when writing to disk rather than RAM. Will the same performance hit from growing the array apply , and if so, will it be significant when compared to the time it takes to write to disk anyway? (Assume

Computing the null space of a bigmatrix in R

你说的曾经没有我的故事 提交于 2019-12-04 23:26:25
I can not find any function or package to calculate the null space or (QR decomposition) of a bigmatrix (from library(bigmemory) ) in R. For example: library(bigmemory) a <- big.matrix(1000000, 1000, type='double', init=0) I tried the following but got the errors shown. How can I find the null space of a bigmemory object? a.qr <- Matrix::qr(a) # Error in as.vector(data) : # no method for coercing this S4 class to a vector q.null <- MASS::Null(a) # Error in as.vector(data) : # no method for coercing this S4 class to a vector If you want to compute the full SVD of the matrix, you can use package

Python fork(): passing data from child to parent

走远了吗. 提交于 2019-12-04 19:23:22
问题 I have a main Python process, and a bunch or workers created by the main process using os.fork(). I need to pass large and fairly involved data structures from the workers back to the main process. What existing libraries would you recommend for that? The data structures are a mix of lists, dictionaries, numpy arrays, custom classes (which I can tweak) and multi-layer combinations of the above. Disk I/O should be avoided. If I could also avoid creating copies of the data -- for example by

How can I cluster thousands of documents using the R tm package?

醉酒当歌 提交于 2019-12-04 19:14:12
I have about 25000 documents which need to be clustered and I was hoping to be able to use the R tm package. Unfortunately I am running out of memory with about 20000 documents. The following function shows what I am trying to do using dummy data. I run out of memory when I call the function with n = 20 on a Windows machine with 16GB of RAM. Are there any optimizations I could make? Thank you for any help. make_clusters <- function(n) { require(tm) require(slam) docs <- unlist(lapply(letters[1:n],function(x) rep(x,1000))) tdf <- TermDocumentMatrix(Corpus(VectorSource(docs)),control=list

Crash on Core Data Migration

末鹿安然 提交于 2019-12-04 15:50:54
Some of our users crash on Core Data migration. There are already several questions about "Core Data Migration & crash", mainly about memory usage and UI response. Migrating large Core Data database crash Out-Of-Memory while doing Core Data migration Core Data causing app to crash while migrating Core Data lightweight migration crash For high memory peak, Apple suggests multiple passes solution , and here is another large datasets solution . When I try to reproduce the problem, like migrating large datasets using lightweight migration, Xcode will sometimes terminate my app due to memory usage.

Using R with tidyquant and massiv data

你。 提交于 2019-12-04 15:40:14
While working with R I encountered a strange problem: I am processing date in the follwing manner: Reading data from a database into a dataframe, filling missing values, grouping and nesting the data to a combined primary key, creating a timeseries and forecastting it for every group, ungroup and clean the data, write it back into the DB. Somehting like this: https://cran.rstudio.com/web/packages/sweep/vignettes/SW01_Forecasting_Time_Series_Groups.html For small data sets this works like a charm, but with lager ones (over about 100000 entries) I do get the "R Session Aborted" screen from R