bigdata | 易学教程

Is Hive faster than Spark?

阅读更多关于 Is Hive faster than Spark?

After reading What is hive, Is it a database? , a colleague yesterday mentioned that he was able to filter a 15B table, join it with another table after doing a "group by", which resulted in 6B records, in only 10 minutes! I wonder if this would be slower in Spark, since now with the DataFrames, they may be comparable, but I am not sure, thus the question. Is Hive faster than Spark? Or this question doesn't have meaning? Sorry, for my ignorance. He uses the latest Hive, which from seems to be using Tez. Hive is just a framework that gives sql functionality to MapReduce type workloads. These

creating partition in external table in hive

阅读更多关于 creating partition in external table in hive

问题 I have successfully created and added Dynamic partitions in an Internal table in hive. i.e. by using following steps: 1-created a source table 2-loaded data from local into source table 3- created another table with partitions - partition_table 4- inserted the data to this table from source table resulting in creation of all the partitions dynamically My question is, how to perform this in external table? I read so many articles on this, but i am confused , that do I have to specify path to

Does anyone have a List of hive error codes?

阅读更多关于 Does anyone have a List of hive error codes?

Does anyone have the list of hive error codes? For example, if we get a table not found error in hive, the value of "echo $?" will be 17. If you look at https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/ErrorMsg.java you'll see most of the error codes. Although in order to map these to an exit code, you'd probably have to walk through the CLI code to trace that. 来源： https://stackoverflow.com/questions/49063409/does-anyone-have-a-list-of-hive-error-codes

How do I determine the size of my HBase Tables ?. Is there any command to do so?

阅读更多关于 How do I determine the size of my HBase Tables ?. Is there any command to do so?

问题 I have multiple tables on my Hbase shell that I would like to copy onto my file system. Some tables exceed 100gb. However, I only have 55gb free space left in my local file system. Therefore, I would like to know the size of my hbase tables so that I could export only the small sized tables. Any suggestions are appreciated. Thanks, gautham 回答1: try hdfs dfs -du -h /hbase/data/default/ (or /hbase/ depending on hbase version you use) This will show how much space is used by files of your tables

In spark, how does broadcast work?

阅读更多关于 In spark, how does broadcast work?

问题 This is a very simple question: in spark, broadcast can be used to send variables to executors efficiently. How does this work ? More precisely: when are values sent : as soon as I call broadcast , or when the values are used ? Where exactly is the data sent : to all executors, or only to the ones that will need it ? where is the data stored ? In memory, or on disk ? Is there a difference in how simple variables and broadcast variables are accessed ? What happens under the hood when I call

R, issue with a Hierarchical clustering after a Multiple correspondence analysis

阅读更多关于 R, issue with a Hierarchical clustering after a Multiple correspondence analysis

I want to cluster a dataset (600000 observations), and for each cluster I want to get the principal components. My vectors are composed by one email and by 30 qualitative variables. Each quantitative variable has 4 classes: 0,1,2 and 3. So first thing I'm doing is to load the library FactoMineR and to load my data: library(FactoMineR) mydata = read.csv("/home/tom/Desktop/ACM/acm.csv") Then I'm setting my variables as qualitative (I'm excluding the variable 'email' though): for(n in 1:length(mydata)){mydata[[n]] <- factor(mydata[[n]])} I'm removing the emails from my vectors: mydata2 = mydata[2

Shuffled vs non-shuffled coalesce in Apache Spark

阅读更多关于 Shuffled vs non-shuffled coalesce in Apache Spark

What is the difference between the following transformations when they are executed right before writing RDD to a file? coalesce(1, shuffle = true) coalesce(1, shuffle = false) Code example: val input = sc.textFile(inputFile) val filtered = input.filter(doSomeFiltering) val mapped = filtered.map(doSomeMapping) mapped.coalesce(1, shuffle = true).saveAsTextFile(outputFile) vs mapped.coalesce(1, shuffle = false).saveAsTextFile(outputFile) And how does it compare with collect()? I'm fully aware that Spark save methods will store it with HDFS-style structure, however I'm more interested in data

Azure 4 min timeout in web app

阅读更多关于 Azure 4 min timeout in web app

My project is an ASP.NET MVC 4 project. While in localhost it works fine when I host it in Azure I get a timeout in ajax calls that take more than 4 minutes. I am sure that the problem is with azure because it doesn't matter what I'm doing in the server. even just set Thread.sleep(300000) I get a timeout. I read in: https://azure.microsoft.com/en-us/blog/new-configurable-idle-timeout-for-azure-load-balancer/ That a common practice to keep the connection active for a longer period is to use TCP Keep-alive and there is no other option for web apps. So I guess that what I need is help to keep

Fast Hadoop Analytics (Cloudera Impala vs Spark/Shark vs Apache Drill)

阅读更多关于 Fast Hadoop Analytics (Cloudera Impala vs Spark/Shark vs Apache Drill)

问题 I want to do some "near real-time" data analysis (OLAP-like) on the data in a HDFS. My research showed that the three mentioned frameworks report significant performance gains compared to Apache Hive. Does anyone have some practical experience with either one of those? Not only concerning performance, but also with respect of stability? 回答1: Comparison between Hive and Impala or Spark or Drill sometimes sounds inappropriate to me. The goals behind developing Hive and these tools were

fitting a linear mixed model to a very large data set

阅读更多关于 fitting a linear mixed model to a very large data set

I want to run a mixed model (using lme4::lmer ) on 60M observations of the following format; all predictor/dependent variables are categorical (factors) apart from the continuous dependent variable tc ; patient is the grouping variable for a random intercept term. I have 64-bit R and 16Gb RAM and I'm working from a central server. RStudio is the most recent server version. model <- lmer(tc~sex+age+lho+atc+(1|patient), data=master,REML=TRUE) lho sex tc age atc patient 18 M 16.61 45-54 H 628143 7 F 10.52 12-15 G 2013855 30 M 92.73 35-44 N 2657693 19 M 24.92 70-74 G 2420965 12 F 17.44 65-69 A