bigdata

Scala immutable Map slow

柔情痞子 提交于 2019-12-06 05:13:35
问题 I have a piece of code when I create a map like: val map = gtfLineArr(8).split(";").map(_ split "\"").collect { case Array(k, v) => (k, v) }.toMap Then I use this map to create my object: case class MyObject(val attribute1: String, val attribute2: Map[String:String]) I'm reading millions of lines and converting to MyObjects using an iterator. Like MyObject("1", map) When I do it is really slow, more than 1h for 2'000'000 entries. I remove the map from the object creation, but still I do the

clusterExport to single thread in R parallel

∥☆過路亽.° 提交于 2019-12-06 04:52:32
I would like to split a large data.frame into chunks and pass each individually to the different members of the cluster. Something like: library(parallel) cl <- makeCluster(detectCores()) for (i in 1:detectCores()) { clusterExport(cl, mydata[indices[[i]]], <extra option to specify a thread/process>) } Is this possible? Here is an example that uses clusterCall inside a for loop to send a different chunk of the data frame to each of the workers: library(parallel) cl <- makeCluster(detectCores()) df <- data.frame(a=1:10, b=1:10) ix <- splitIndices(nrow(df), length(cl)) for (i in seq_along(cl)) {

Apache Spark - How does internal job scheduler in spark define what are users and what are pools

时间秒杀一切 提交于 2019-12-06 03:51:22
问题 I am sorry about being a little general here, but I am a little confused about how job scheduling works internally in spark. From the documentation here I get that it is some sort of implementation of Hadoop Fair Scheduler. I am unable to come around to understand that who exactly are users here (are the linux users, hadoop users, spark clients?). I am also unable to understand how are the pools defined here. For example, In my hadoop cluster I have given resource allocation to two different

Shuffled vs non-shuffled coalesce in Apache Spark

女生的网名这么多〃 提交于 2019-12-06 03:33:41
问题 What is the difference between the following transformations when they are executed right before writing RDD to a file? coalesce(1, shuffle = true) coalesce(1, shuffle = false) Code example: val input = sc.textFile(inputFile) val filtered = input.filter(doSomeFiltering) val mapped = filtered.map(doSomeMapping) mapped.coalesce(1, shuffle = true).saveAsTextFile(outputFile) vs mapped.coalesce(1, shuffle = false).saveAsTextFile(outputFile) And how does it compare with collect()? I'm fully aware

Is Hive faster than Spark?

风流意气都作罢 提交于 2019-12-06 03:32:51
问题 After reading What is hive, Is it a database?, a colleague yesterday mentioned that he was able to filter a 15B table, join it with another table after doing a "group by", which resulted in 6B records, in only 10 minutes! I wonder if this would be slower in Spark, since now with the DataFrames, they may be comparable, but I am not sure, thus the question. Is Hive faster than Spark? Or this question doesn't have meaning? Sorry, for my ignorance. He uses the latest Hive, which from seems to be

check number of unique values in each column of a matrix in spark

假装没事ソ 提交于 2019-12-06 03:30:26
I have a csv file currently stored as a dataframe in spark scala> df res11: org.apache.spark.sql.DataFrame = [2013-03-25 12:49:36.000: string, OES_PSI603_EC1: string, 250.3315__SI: string, 250.7027__SI: string, 251.0738__SI: string, 251.4448__SI: string, 251.8159__SI: string, 252.1869__SI: string, 252.5579__SIF: string, 252.9288__SI: string, 253.2998__SIF: string, 253.6707__SIF: string, 254.0415__CI2: string, 254.4124__CI2: string, 254.7832__CI2: string, 255.154: string, 255.5248__NO: string, 255.8955__NO: string, 256.2662__NO: string, 256.6369: string, 257.0075: string, 257.3782: string, 257

SPARK read.json throwing java.io.IOException: Too many bytes before newline

↘锁芯ラ 提交于 2019-12-06 03:25:44
I am getting following error on reading a large 6gb single line json file: Job aborted due to stage failure: Task 5 in stage 0.0 failed 1 times, most recent failure: Lost task 5.0 in stage 0.0 (TID 5, localhost): java.io.IOException: Too many bytes before newline: 2147483648 spark does not read json files with new lines hence the entire 6 gb json file is on a single line: jf = sqlContext.read.json("jlrn2.json") configuration: spark.driver.memory 20g Yep, you have more than Integer.MAX_VALUE bytes in your line. You need to split it up. Keep in mind that Spark is expecting each line to be a

Azure 4 min timeout in web app

怎甘沉沦 提交于 2019-12-06 03:06:05
问题 My project is an ASP.NET MVC 4 project. While in localhost it works fine when I host it in Azure I get a timeout in ajax calls that take more than 4 minutes. I am sure that the problem is with azure because it doesn't matter what I'm doing in the server. even just set Thread.sleep(300000) I get a timeout. I read in: https://azure.microsoft.com/en-us/blog/new-configurable-idle-timeout-for-azure-load-balancer/ That a common practice to keep the connection active for a longer period is to use

fitting a linear mixed model to a very large data set

喜夏-厌秋 提交于 2019-12-06 03:01:14
问题 I want to run a mixed model (using lme4::lmer ) on 60M observations of the following format; all predictor/dependent variables are categorical (factors) apart from the continuous dependent variable tc ; patient is the grouping variable for a random intercept term. I have 64-bit R and 16Gb RAM and I'm working from a central server. RStudio is the most recent server version. model <- lmer(tc~sex+age+lho+atc+(1|patient), data=master,REML=TRUE) lho sex tc age atc patient 18 M 16.61 45-54 H 628143

Error when enabling data encryption using local key MONGODB

南楼画角 提交于 2019-12-06 01:25:09
问题 I have successfully encrypted the communication in mongoDB but when I try to enable the data encryption I'm getting errors. I am using the enterprise edition of mongoDB with version 3.2.4. I get the following message in the console: ERROR: child process failed, exited with error number 14 But when I look at the logs I see detailed error as follows: Unable to retrieve key .system, error: there are existing data files, but no valid keystore could be located. Fatal Assertion 28561 following is