bigdata | 易学教程

Scala immutable Map slow

阅读更多关于 Scala immutable Map slow

问题 I have a piece of code when I create a map like: val map = gtfLineArr(8).split(";").map(_ split "\"").collect { case Array(k, v) => (k, v) }.toMap Then I use this map to create my object: case class MyObject(val attribute1: String, val attribute2: Map[String:String]) I'm reading millions of lines and converting to MyObjects using an iterator. Like MyObject("1", map) When I do it is really slow, more than 1h for 2'000'000 entries. I remove the map from the object creation, but still I do the

clusterExport to single thread in R parallel

阅读更多关于 clusterExport to single thread in R parallel

I would like to split a large data.frame into chunks and pass each individually to the different members of the cluster. Something like: library(parallel) cl <- makeCluster(detectCores()) for (i in 1:detectCores()) { clusterExport(cl, mydata[indices[[i]]], <extra option to specify a thread/process>) } Is this possible? Here is an example that uses clusterCall inside a for loop to send a different chunk of the data frame to each of the workers: library(parallel) cl <- makeCluster(detectCores()) df <- data.frame(a=1:10, b=1:10) ix <- splitIndices(nrow(df), length(cl)) for (i in seq_along(cl)) {

Apache Spark - How does internal job scheduler in spark define what are users and what are pools

阅读更多关于 Apache Spark - How does internal job scheduler in spark define what are users and what are pools

问题 I am sorry about being a little general here, but I am a little confused about how job scheduling works internally in spark. From the documentation here I get that it is some sort of implementation of Hadoop Fair Scheduler. I am unable to come around to understand that who exactly are users here (are the linux users, hadoop users, spark clients?). I am also unable to understand how are the pools defined here. For example, In my hadoop cluster I have given resource allocation to two different

Shuffled vs non-shuffled coalesce in Apache Spark

阅读更多关于 Shuffled vs non-shuffled coalesce in Apache Spark

问题 What is the difference between the following transformations when they are executed right before writing RDD to a file? coalesce(1, shuffle = true) coalesce(1, shuffle = false) Code example: val input = sc.textFile(inputFile) val filtered = input.filter(doSomeFiltering) val mapped = filtered.map(doSomeMapping) mapped.coalesce(1, shuffle = true).saveAsTextFile(outputFile) vs mapped.coalesce(1, shuffle = false).saveAsTextFile(outputFile) And how does it compare with collect()? I'm fully aware

Is Hive faster than Spark?

阅读更多关于 Is Hive faster than Spark?

问题 After reading What is hive, Is it a database?, a colleague yesterday mentioned that he was able to filter a 15B table, join it with another table after doing a "group by", which resulted in 6B records, in only 10 minutes! I wonder if this would be slower in Spark, since now with the DataFrames, they may be comparable, but I am not sure, thus the question. Is Hive faster than Spark? Or this question doesn't have meaning? Sorry, for my ignorance. He uses the latest Hive, which from seems to be

check number of unique values in each column of a matrix in spark

阅读更多关于 check number of unique values in each column of a matrix in spark

I have a csv file currently stored as a dataframe in spark scala> df res11: org.apache.spark.sql.DataFrame = [2013-03-25 12:49:36.000: string, OES_PSI603_EC1: string, 250.3315__SI: string, 250.7027__SI: string, 251.0738__SI: string, 251.4448__SI: string, 251.8159__SI: string, 252.1869__SI: string, 252.5579__SIF: string, 252.9288__SI: string, 253.2998__SIF: string, 253.6707__SIF: string, 254.0415__CI2: string, 254.4124__CI2: string, 254.7832__CI2: string, 255.154: string, 255.5248__NO: string, 255.8955__NO: string, 256.2662__NO: string, 256.6369: string, 257.0075: string, 257.3782: string, 257

SPARK read.json throwing java.io.IOException: Too many bytes before newline

阅读更多关于 SPARK read.json throwing java.io.IOException: Too many bytes before newline

I am getting following error on reading a large 6gb single line json file: Job aborted due to stage failure: Task 5 in stage 0.0 failed 1 times, most recent failure: Lost task 5.0 in stage 0.0 (TID 5, localhost): java.io.IOException: Too many bytes before newline: 2147483648 spark does not read json files with new lines hence the entire 6 gb json file is on a single line: jf = sqlContext.read.json("jlrn2.json") configuration: spark.driver.memory 20g Yep, you have more than Integer.MAX_VALUE bytes in your line. You need to split it up. Keep in mind that Spark is expecting each line to be a

Azure 4 min timeout in web app

阅读更多关于 Azure 4 min timeout in web app

问题 My project is an ASP.NET MVC 4 project. While in localhost it works fine when I host it in Azure I get a timeout in ajax calls that take more than 4 minutes. I am sure that the problem is with azure because it doesn't matter what I'm doing in the server. even just set Thread.sleep(300000) I get a timeout. I read in: https://azure.microsoft.com/en-us/blog/new-configurable-idle-timeout-for-azure-load-balancer/ That a common practice to keep the connection active for a longer period is to use

fitting a linear mixed model to a very large data set

阅读更多关于 fitting a linear mixed model to a very large data set

问题 I want to run a mixed model (using lme4::lmer ) on 60M observations of the following format; all predictor/dependent variables are categorical (factors) apart from the continuous dependent variable tc ; patient is the grouping variable for a random intercept term. I have 64-bit R and 16Gb RAM and I'm working from a central server. RStudio is the most recent server version. model <- lmer(tc~sex+age+lho+atc+(1|patient), data=master,REML=TRUE) lho sex tc age atc patient 18 M 16.61 45-54 H 628143

Error when enabling data encryption using local key MONGODB

阅读更多关于 Error when enabling data encryption using local key MONGODB

问题 I have successfully encrypted the communication in mongoDB but when I try to enable the data encryption I'm getting errors. I am using the enterprise edition of mongoDB with version 3.2.4. I get the following message in the console: ERROR: child process failed, exited with error number 14 But when I look at the logs I see detailed error as follows: Unable to retrieve key .system, error: there are existing data files, but no valid keystore could be located. Fatal Assertion 28561 following is