large-data

Compare two different size matrices to make one large matrix - Speed Improvements?

*爱你&永不变心* 提交于 2019-12-11 00:42:35
问题 I have two matrices, that I need to use to create a larger matrix. Each matrix is simply a tab-delimited text file that is read. Each matrix has 48 cols with identical identifiers per matrix, with different numbers of rows. The first matrix is 108887x48, and the second is 55482x48. The entries at each position, for each matrix, can be a 0 or 1, so binary. The final output should have the first matrix row ids as the rows, and the second matrix row ids as the cols, so the final matrix is

Import text file using ff package

≯℡__Kan透↙ 提交于 2019-12-10 13:19:16
问题 I have a textfile of 4.5 million rows and 90 columns to import into R. Using read.table I get the cannot allocate vector of size... error message so am trying to import using the ff package before subsetting the data to extract the observations which interest me (see my previous question for more details: Add selection crteria to read.table). So, I use the following code to import: test<-read.csv2.ffdf("FD_INDCVIZC_2010.txt", header=T) but this returns the following error message : Error in

How do I design a table which will store very large data?

让人想犯罪 __ 提交于 2019-12-10 10:43:52
问题 I need to design a table in Oracle, which will store 2-5 TB of data in a day. It can grow to 200TB, and records will purged when it crosses 200 TB. Is it a feasible choice to keep it in OLTP, or do I need to shift it to data warehouse DB? Please advice considerations I should keep in mind when designing the schema of this table, or the database. Also, please advice if it is a SQL server, as I can use either database. 回答1: That size puts you in the VLDB territory (very large databases). Things

Comparison of extra-large subsets of strings

别来无恙 提交于 2019-12-10 10:06:16
问题 Hi guys :) I really confused of one task :-/ There is one every-day file from 2000000 to 4000000 strings, which contains unique 15-symbol numbers line by line like this: 850025000010145 401115000010152 400025000010166 770025555010152 512498004158752 From beginning of current year you have some amount of such files accordingly. So I have to compare every line of today's file with all previous files from beginning of the year and return only that numbers which never meet before in all checked

Find common third on large data set

扶醉桌前 提交于 2019-12-10 08:54:06
问题 I have a large dataframe like df <- data.frame(group= c("a","a","b","b","b","c"), person = c("Tom","Jerry","Tom","Anna","Sam","Nic"), stringsAsFactors = FALSE) df group person 1 a Tom 2 a Jerry 3 b Tom 4 b Anna 5 b Sam 6 c Nic and would like to get as a result df.output pers1 pers2 person_in_common 1 Anna Jerry Tom 2 Jerry Sam Tom 3 Sam Tom Anna 4 Anna Tom Sam 6 Anna Sam Tom The result dataframe gives basically a table with all pairs of persons who have another person in common. I found a way

How to improve performance of populating a massive tree view?

只愿长相守 提交于 2019-12-10 03:45:23
问题 First of all I am answering my own question Q/A style, so I don't necessarily need anyone to answer this. It's something I've learned and many can make use of this. I have a tree view which consists of many different nodes. Each node has an object behind it in its Data property, and the objects refer to different levels of hierarchy from one master list of objects, which is quite large (many thousands of items). One node represents a specific property on this main listed object, where the

When writing a large array directly to disk in MATLAB, is there any need to preallocate?

穿精又带淫゛_ 提交于 2019-12-10 01:16:25
问题 I need to write an array that is too large to fit into memory to a .mat binary file. This can be accomplished with the matfile function, which allows random access to a .mat file on disk. Normally, the accepted advice is to preallocate arrays, because expanding them on every iteration of a loop is slow. However, when I was asking how to do this, it occurred to me that this may not be good advice when writing to disk rather than RAM. Will the same performance hit from growing the array apply ,

How can I cluster thousands of documents using the R tm package?

…衆ロ難τιáo~ 提交于 2019-12-09 23:14:27
问题 I have about 25000 documents which need to be clustered and I was hoping to be able to use the R tm package. Unfortunately I am running out of memory with about 20000 documents. The following function shows what I am trying to do using dummy data. I run out of memory when I call the function with n = 20 on a Windows machine with 16GB of RAM. Are there any optimizations I could make? Thank you for any help. make_clusters <- function(n) { require(tm) require(slam) docs <- unlist(lapply(letters

Using R with tidyquant and massiv data

与世无争的帅哥 提交于 2019-12-09 21:41:40
问题 While working with R I encountered a strange problem: I am processing date in the follwing manner: Reading data from a database into a dataframe, filling missing values, grouping and nesting the data to a combined primary key, creating a timeseries and forecastting it for every group, ungroup and clean the data, write it back into the DB. Somehting like this: https://cran.rstudio.com/web/packages/sweep/vignettes/SW01_Forecasting_Time_Series_Groups.html For small data sets this works like a

PHP and the million array baby

旧巷老猫 提交于 2019-12-09 15:05:02
问题 Imagine you have the following array of integers: array(1, 2, 1, 0, 0, 1, 2, 4, 3, 2, [...] ); The integers go on up to one million entries; only instead of being hardcoded they've been pre-generated and stored in a JSON formatted file (of approximately 2MB in size). The order of these integers matters, I can't randomly generate it every time because it should be consistent and always have the same values at the same indexes. If this file is read back in PHP afterwards (e.g. using file_get