bigdata

Skewed tables in Hive

笑着哭i 提交于 2019-12-04 20:46:55
问题 I am learning hive and came across skewed tables. Help me understanding it. What are skewed tables in Hive? How do we create skewed tables? How does it effect performance? 回答1: What are skewed tables in Hive? A skewed table is a special type of table where the values that appear very often (heavy skew) are split out into separate files and rest of the values go to some other file.. How do we create skewed tables? create table <T> (schema) skewed by (keys) on ('value1', 'value2') [STORED as

Package bigmemory not installing on R 64 3.0.2 [duplicate]

江枫思渺然 提交于 2019-12-04 20:23:35
This question already has answers here : How should I deal with “package 'xxx' is not available (for R version x.y.z)” warning? (15 answers) Closed 5 years ago . I am trying to install the bigmemory package in R 64 version 3.0.2 on Windows. I get the following error: install.packages('bigmemory') Installing package into ‘C:/Users/Audrey/Documents/R/win-library/3.0’(as ‘lib’ is unspecified) Warning message: package ‘bigmemory’ is not available (for R version 3.0.2) > library(bigmemory) Error in library(bigmemory) : there is no package called ‘bigmemory’ Any help or insight will be much

How to load xls data from multiple xls file into hive?

旧巷老猫 提交于 2019-12-04 19:40:39
I am learning to use Hadoop for performing Big Data related operations. I need to perform some queries on a collection of data sets split across 8 xls files. Each xls file has multiple sheets and the query concerns only one of the sheets. The dataset can be downloaded here : http://www.census.gov/hhes/www/hlthins/data/utilization/tables.html I am not using any commerical distro of hadoop for my tasks, just have one master and a slave VM set up in VmWare with Hadoop, Hive, Pig in them. I am a novice with Hadoop and Big Data, so if anyone could guide me with how to proceed further I'd be very

C++ buffered file reading

会有一股神秘感。 提交于 2019-12-04 16:56:52
I wonder if reading a large text file line by line (e.g., std::getline or fgets) can be buffered with predefined read buffer size, or one must use special bytewise functions? I mean reading very large files with I/O operations number optimization (e.g., reading 32 MB from the HDD at a time). Of course I can handcraft buffered reading, but I thought standard file streams had that possibility. Neither line-by-line, nor special byte-wise functions. Instead, the following should do your job: std::ifstream file("input.txt"); std::istream_iterator<char> begin(file), end; std::vector<char> buffer

How to perform Standard Deviation and Mean operations on a Java Spark RDD?

柔情痞子 提交于 2019-12-04 16:54:36
I have a JavaRDD which looks like this., [ [A,8] [B,3] [C,5] [A,2] [B,8] ... ... ] I want my result to be Mean [ [A,5] [B,5.5] [C,5] ] How do I do this using Java RDDs only. P.S : I want to avoid groupBy operation so I am not using DataFrames. Here you go : import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaPairRDD; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.util.StatCounter; import scala.Tuple2; import scala.Tuple3; import java.util.Arrays; import java.util.List; public class

Sqoop import job fails due to task timeout

微笑、不失礼 提交于 2019-12-04 16:03:22
I was trying to import a 1 TB table in MySQL to HDFS using sqoop. The command used was: sqoop import --connect jdbc:mysql://xx.xx.xxx.xx/MyDB --username myuser --password mypass --table mytable --split-by rowkey -m 14 After executing the bounding vals query, all the mappers start, but after some time, the tasks get killed due to timeout (1200 seconds). This, I think, is because the time taken to execute the select query running in each mapper takes more than the time set for timeout (in sqoop it seems to be 1200 seconds); and hence it fails to report status, and the task subsequently gets

custom inputformat for reading json in hadoop

旧巷老猫 提交于 2019-12-04 16:02:20
i am a beginner of hadoop,i have been told to create a custom inputformat class to read json data,i have googled up and learnt how to create a custom inputformat class to read data from the file.but i am stuck on parsing the json data. my json data looks like this [ { "_count": 30, "_start": 0, "_total": 180, "values": [ { "attachment": { "contentDomain": "techcarnival2013.eventbrite.com", "contentUrl": "http://techcarnival2013.eventbrite.com/", "imageUrl": "http://ebmedia.eventbrite.com/s3-s3/static/images/django/logos/eb_home_tm-trans-fb.png", "summary": "Get to know a few thousand of

spark unix_timestamp data type mismatch

99封情书 提交于 2019-12-04 15:49:48
Could someone help guide me in what data type or format I need to submit from_unixtime for the spark from_unixtime() function to work? When I try the following it works, but responds not with current_timestamp. from_unixtime(current_timestamp()) The response is below: fromunixtime(currenttimestamp(),yyyy-MM-dd HH:mm:ss) When I try to input from_unixtime(1392394861,"yyyy-MM-dd HH:mm:ss.SSSS") The above simply fails with a type mismatch: error: type mismatch; found : Int(1392394861) required: org.apache.spark.sql.Column from_unixtime(1392394861,"yyyy-MM-dd HH:mm:ss.SSSS") What am I missing? I've

How to subtract months from date in HIVE

五迷三道 提交于 2019-12-04 15:36:16
I am looking for a method that helps me subtract months from a date in HIVE I have a date 2015-02-01 . Now i need to subtract 2 months from this date so that result should be 2014-12-01 . Can you guys help me out here? Manoj R select add_months('2015-02-01',-2); if you need to go back to first day of the resulting month: select add_months(trunc('2015-02-01','MM'),-2); Please try add_months date function and pass -2 as months. Internally add_months uses Java Calendar.add method, which supports adding or subtracting (by passing negative integer). https://cwiki.apache.org/confluence/display/Hive

Lambda Architecture - Why batch layer

爷,独闯天下 提交于 2019-12-04 14:55:22
I am going through the lambda architecture and understanding how it can be used to build fault tolerant big data systems. I am wondering how batch layer is useful when everything can be stored in realtime view and generate the results out of it? is it because realtime storage cant be used to store all of the data, then it wont be realtime as the time taken to retrieve the data is dependent on the the space it took for the data to store. Why batch layer To save Time and Money! It basically has two functionalities, To manage the master dataset (assumed to be immutable) To pre-compute the batch