bigdata | 易学教程

Skewed tables in Hive

阅读更多关于 Skewed tables in Hive

问题 I am learning hive and came across skewed tables. Help me understanding it. What are skewed tables in Hive? How do we create skewed tables? How does it effect performance? 回答1: What are skewed tables in Hive? A skewed table is a special type of table where the values that appear very often (heavy skew) are split out into separate files and rest of the values go to some other file.. How do we create skewed tables? create table <T> (schema) skewed by (keys) on ('value1', 'value2') [STORED as

Package bigmemory not installing on R 64 3.0.2 [duplicate]

阅读更多关于 Package bigmemory not installing on R 64 3.0.2 [duplicate]

This question already has answers here : How should I deal with “package 'xxx' is not available (for R version x.y.z)” warning? (15 answers) Closed 5 years ago . I am trying to install the bigmemory package in R 64 version 3.0.2 on Windows. I get the following error: install.packages('bigmemory') Installing package into ‘C:/Users/Audrey/Documents/R/win-library/3.0’(as ‘lib’ is unspecified) Warning message: package ‘bigmemory’ is not available (for R version 3.0.2) > library(bigmemory) Error in library(bigmemory) : there is no package called ‘bigmemory’ Any help or insight will be much

How to load xls data from multiple xls file into hive?

阅读更多关于 How to load xls data from multiple xls file into hive?

I am learning to use Hadoop for performing Big Data related operations. I need to perform some queries on a collection of data sets split across 8 xls files. Each xls file has multiple sheets and the query concerns only one of the sheets. The dataset can be downloaded here : http://www.census.gov/hhes/www/hlthins/data/utilization/tables.html I am not using any commerical distro of hadoop for my tasks, just have one master and a slave VM set up in VmWare with Hadoop, Hive, Pig in them. I am a novice with Hadoop and Big Data, so if anyone could guide me with how to proceed further I'd be very

C++ buffered file reading

阅读更多关于 C++ buffered file reading

I wonder if reading a large text file line by line (e.g., std::getline or fgets) can be buffered with predefined read buffer size, or one must use special bytewise functions? I mean reading very large files with I/O operations number optimization (e.g., reading 32 MB from the HDD at a time). Of course I can handcraft buffered reading, but I thought standard file streams had that possibility. Neither line-by-line, nor special byte-wise functions. Instead, the following should do your job: std::ifstream file("input.txt"); std::istream_iterator<char> begin(file), end; std::vector<char> buffer

How to perform Standard Deviation and Mean operations on a Java Spark RDD?

阅读更多关于 How to perform Standard Deviation and Mean operations on a Java Spark RDD?

I have a JavaRDD which looks like this., [ [A,8] [B,3] [C,5] [A,2] [B,8] ... ... ] I want my result to be Mean [ [A,5] [B,5.5] [C,5] ] How do I do this using Java RDDs only. P.S : I want to avoid groupBy operation so I am not using DataFrames. Here you go : import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaPairRDD; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.util.StatCounter; import scala.Tuple2; import scala.Tuple3; import java.util.Arrays; import java.util.List; public class

Sqoop import job fails due to task timeout

阅读更多关于 Sqoop import job fails due to task timeout

I was trying to import a 1 TB table in MySQL to HDFS using sqoop. The command used was: sqoop import --connect jdbc:mysql://xx.xx.xxx.xx/MyDB --username myuser --password mypass --table mytable --split-by rowkey -m 14 After executing the bounding vals query, all the mappers start, but after some time, the tasks get killed due to timeout (1200 seconds). This, I think, is because the time taken to execute the select query running in each mapper takes more than the time set for timeout (in sqoop it seems to be 1200 seconds); and hence it fails to report status, and the task subsequently gets

custom inputformat for reading json in hadoop

阅读更多关于 custom inputformat for reading json in hadoop

i am a beginner of hadoop,i have been told to create a custom inputformat class to read json data,i have googled up and learnt how to create a custom inputformat class to read data from the file.but i am stuck on parsing the json data. my json data looks like this [ { "_count": 30, "_start": 0, "_total": 180, "values": [ { "attachment": { "contentDomain": "techcarnival2013.eventbrite.com", "contentUrl": "http://techcarnival2013.eventbrite.com/", "imageUrl": "http://ebmedia.eventbrite.com/s3-s3/static/images/django/logos/eb_home_tm-trans-fb.png", "summary": "Get to know a few thousand of

spark unix_timestamp data type mismatch

阅读更多关于 spark unix_timestamp data type mismatch

Could someone help guide me in what data type or format I need to submit from_unixtime for the spark from_unixtime() function to work? When I try the following it works, but responds not with current_timestamp. from_unixtime(current_timestamp()) The response is below: fromunixtime(currenttimestamp(),yyyy-MM-dd HH:mm:ss) When I try to input from_unixtime(1392394861,"yyyy-MM-dd HH:mm:ss.SSSS") The above simply fails with a type mismatch: error: type mismatch; found : Int(1392394861) required: org.apache.spark.sql.Column from_unixtime(1392394861,"yyyy-MM-dd HH:mm:ss.SSSS") What am I missing? I've

How to subtract months from date in HIVE

阅读更多关于 How to subtract months from date in HIVE

I am looking for a method that helps me subtract months from a date in HIVE I have a date 2015-02-01 . Now i need to subtract 2 months from this date so that result should be 2014-12-01 . Can you guys help me out here? Manoj R select add_months('2015-02-01',-2); if you need to go back to first day of the resulting month: select add_months(trunc('2015-02-01','MM'),-2); Please try add_months date function and pass -2 as months. Internally add_months uses Java Calendar.add method, which supports adding or subtracting (by passing negative integer). https://cwiki.apache.org/confluence/display/Hive

Lambda Architecture - Why batch layer

阅读更多关于 Lambda Architecture - Why batch layer

I am going through the lambda architecture and understanding how it can be used to build fault tolerant big data systems. I am wondering how batch layer is useful when everything can be stored in realtime view and generate the results out of it? is it because realtime storage cant be used to store all of the data, then it wont be realtime as the time taken to retrieve the data is dependent on the the space it took for the data to store. Why batch layer To save Time and Money! It basically has two functionalities, To manage the master dataset (assumed to be immutable) To pre-compute the batch