bigdata

How to speed up GLM estimation?

倖福魔咒の 提交于 2019-11-27 03:29:07
问题 I am using RStudio 0.97.320 (R 2.15.3) on Amazon EC2. My data frame has 200k rows and 12 columns. I am trying to fit a logistic regression with approximately 1500 parameters. R is using 7% CPU and has 60+GB memory and is still taking a very long time. Here is the code: glm.1.2 <- glm(formula = Y ~ factor(X1) * log(X2) * (X3 + X4 * (X5 + I(X5^2)) * (X8 + I(X8^2)) + ((X6 + I(X6^2)) * factor(X7))), family = binomial(logit), data = df[1:150000,]) Any suggestions to speed this up by a significant

clustering very large dataset in R

泪湿孤枕 提交于 2019-11-27 02:21:52
问题 I have a dataset consisting of 70,000 numeric values representing distances ranging from 0 till 50, and I want to cluster these numbers; however, if I'm trying the classical clustering approach, then I would have to establish a 70,000X70,000 distance matrix representing the distances between each two numbers in my dataset, which won't fit in memory, so I was wondering if there is any smart way to solve this problem without the need to do stratified sampling? I also tried bigmemory and big

Is Spark's KMeans unable to handle bigdata?

一笑奈何 提交于 2019-11-27 02:14:29
KMeans has several parameters for its training , with initialization mode defaulted to kmeans||. The problem is that it marches quickly (less than 10min) to the first 13 stages, but then hangs completely , without yielding an error! Minimal Example which reproduces the issue (it will succeed if I use 1000 points or random initialization): from pyspark.context import SparkContext from pyspark.mllib.clustering import KMeans from pyspark.mllib.random import RandomRDDs if __name__ == "__main__": sc = SparkContext(appName='kmeansMinimalExample') # same with 10000 points data = RandomRDDs

iPad - Parsing an extremely huge json - File (between 50 and 100 mb)

一世执手 提交于 2019-11-27 01:10:25
问题 I'm trying to parse an extremely big json-File on an iPad. The filesize will vary between 50 and 100 mb (there is an initial file and there will be one new full set of data every month, which will be downloaded, parsed and saved into coredata) I'm building this app for a company as an Enterprise solution - the json file contains sensitive customerdata and it needs to be saved locally on the ipad so it will work even offline. It worked when the file was below 20mb, but now the set of data

Best way to delete millions of rows by ID

时间秒杀一切 提交于 2019-11-27 00:12:35
I need to delete about 2 million rows from my PG database. I have a list of IDs that I need to delete. However, any way I try to do this is taking days. I tried putting them in a table and doing it in batches of 100. 4 days later, this is still running with only 297268 rows deleted. (I had to select 100 id's from an ID table, delete where IN that list, delete from ids table the 100 I selected). I tried: DELETE FROM tbl WHERE id IN (select * from ids) That's taking forever, too. Hard to gauge how long, since I can't see it's progress till done, but the query was still running after 2 days. Just

Serious Memory Leak When Iteratively Parsing XML Files

眉间皱痕 提交于 2019-11-26 23:08:08
问题 Context When iterating over a set of Rdata files (each containing a character vector of HTML code) that are loaded, analyzed (via XML functionality) and then removed from memory again, I experience a significant increase in an R process' memory consumption (killing the process eventually). It just seems like freeing objects via free() , removing them via rm() and running gc() do not have any effects, so the memory consumption cumulates until there's no more memory left. EDIT 2012-02-13 23:30

Hbase quickly count number of rows

我的未来我决定 提交于 2019-11-26 22:37:07
问题 Right now I implement row count over ResultScanner like this for (Result rs = scanner.next(); rs != null; rs = scanner.next()) { number++; } If data reaching millions time computing is large.I want to compute in real time that i don't want to use Mapreduce How to quickly count number of rows. 回答1: Use RowCounter in HBase RowCounter is a mapreduce job to count all the rows of a table. This is a good utility to use as a sanity check to ensure that HBase can read all the blocks of a table if

Spark parquet partitioning : Large number of files

梦想与她 提交于 2019-11-26 22:36:03
问题 I am trying to leverage spark partitioning. I was trying to do something like data.write.partitionBy("key").parquet("/location") The issue here each partition creates huge number of parquet files which result slow read if I am trying to read from the root directory. To avoid that I tried data.coalese(numPart).write.partitionBy("key").parquet("/location") This however creates numPart number of parquet files in each partition. Now my partition size is different. SO I would ideally like to have

How can I import a large (14 GB) MySQL dump file into a new MySQL database?

吃可爱长大的小学妹 提交于 2019-11-26 22:28:47
问题 How can I import a large (14 GB) MySQL dump file into a new MySQL database? 回答1: I've searched around, and only this solution helped me: mysql -u root -p set global net_buffer_length=1000000; --Set network buffer length to a large byte number set global max_allowed_packet=1000000000; --Set maximum allowed packet size to a large byte number SET foreign_key_checks = 0; --Disable foreign key checking to avoid delays,errors and unwanted behaviour source file.sql --Import your sql dump file SET

Reading big data with fixed width

好久不见. 提交于 2019-11-26 22:22:23
问题 How can I read big data formated with fixed width? I read this question and tried some tips, but all answers are for delimited data (as .csv), and that's not my case. The data has 558MB, and I don't know how many lines. I'm using: dados <- read.fwf('TS_MATRICULA_RS.txt', width=c(5, 13, 14, 3, 3, 5, 4, 6, 6, 6, 1, 1, 1, 4, 3, 2, 9, 3, 2, 9, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 4, 11, 9, 2, 3, 9, 3, 2, 9, 9, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1