bigdata

oozie Sqoop action fails to import data to hive

前提是你 提交于 2019-12-07 17:58:27
问题 I am facing issue while executing oozie sqoop action. In logs I can see that sqoop is able to import data to temp directory then sqoop creates hive scripts to import data. It fails while importing temp data to hive. In logs I am not getting any exception. Below is a sqoop action I am using. <workflow-app name="testSqoopLoadWorkflow" xmlns="uri:oozie:workflow:0.4"> <credentials> <credential name='hive_credentials' type='hcat'> <property> <name>hcat.metastore.uri</name> <value>${HIVE_THRIFT_URL

check number of unique values in each column of a matrix in spark

蓝咒 提交于 2019-12-07 16:31:13
问题 I have a csv file currently stored as a dataframe in spark scala> df res11: org.apache.spark.sql.DataFrame = [2013-03-25 12:49:36.000: string, OES_PSI603_EC1: string, 250.3315__SI: string, 250.7027__SI: string, 251.0738__SI: string, 251.4448__SI: string, 251.8159__SI: string, 252.1869__SI: string, 252.5579__SIF: string, 252.9288__SI: string, 253.2998__SIF: string, 253.6707__SIF: string, 254.0415__CI2: string, 254.4124__CI2: string, 254.7832__CI2: string, 255.154: string, 255.5248__NO: string

Why we need a coarse quantizer?

大憨熊 提交于 2019-12-07 16:13:27
In Product Quantization for Nearest Neighbor Search , when it comes to section IV.A, it says they they will use a coarse quantizer too (which they way I feel it, is just a really smaller product quantizer, smaller w.r.t. k , the number of centroids). I don't really get why this helps the search procedure and the cause might be that I think I don't get the way they use it. Any ides please ? As mentioned in the NON EXHAUSTIVE SEARCH section, Approximate nearest neighbor search with product quantizers is fast and reduces significantly the memory requirements for storing the descriptors.

Split dataset per rows into smaller files in R

▼魔方 西西 提交于 2019-12-07 15:37:52
问题 I am analyzing a dataset, with 1.14 GB (1,232,705,653 bytes). When reading the data in R: trade = read.csv("commodity_trade_statistics_data.csv") One can see that it has 8225871 instances and 10 attributes. As I intend to analyze the dataset through a Data Wrangling web app that has a limit on the imports of 100MB, I am wondering how can I split the data into files with a max of 100MB? The split that I intend to do is per rows and each file should contain the header. 回答1: Split up the

SPARK read.json throwing java.io.IOException: Too many bytes before newline

Deadly 提交于 2019-12-07 14:42:43
问题 I am getting following error on reading a large 6gb single line json file: Job aborted due to stage failure: Task 5 in stage 0.0 failed 1 times, most recent failure: Lost task 5.0 in stage 0.0 (TID 5, localhost): java.io.IOException: Too many bytes before newline: 2147483648 spark does not read json files with new lines hence the entire 6 gb json file is on a single line: jf = sqlContext.read.json("jlrn2.json") configuration: spark.driver.memory 20g 回答1: Yep, you have more than Integer.MAX

Element-wise mean of several big.matrix objects in R

狂风中的少年 提交于 2019-12-07 13:55:03
问题 I have 17 filebacked big.matrix objects (dim 10985 x 52598, 4.3GB each) of which I would like to calculate the element-wise mean. The result can be stored in another big.matrix (gcm.res.outputM). biganalytics::apply() doesn't work as the MARGIN can be set to 1 OR 2 only. I tried to use 2 for loops as shown here gcm.res.outputM <- filebacked.big.matrix(10958, 52598, separated = FALSE, backingfile = "gcm.res.outputM.bin", backingpath = NULL, descriptorfile = "gcm.res.outputM.desc",

How to use Spark SQL to parse the JSON array of objects

五迷三道 提交于 2019-12-07 11:55:34
问题 now has JSON data as follows {"Id":11,"data":[{"package":"com.browser1","activetime":60000},{"package":"com.browser6","activetime":1205000},{"package":"com.browser7","activetime":1205000}]} {"Id":12,"data":[{"package":"com.browser1","activetime":60000},{"package":"com.browser6","activetime":1205000}]} ...... This JSON is the activation time of app, the purpose of which is to analyze the total activation time of each app I use sparK SQL to parse JSON scala val sqlContext = sc.sqlContext val

(R error) Error: cons memory exhausted (limit reached?)

眉间皱痕 提交于 2019-12-07 11:38:52
问题 I am working with big data and I have a 70GB JSON file. I am using jsonlite library to load in the file into memory. I have tried AWS EC2 x1.16large machine (976 GB RAM) to perform this load but R breaks with the error: Error: cons memory exhausted (limit reached?) after loading in 1,116,500 records. Thinking that I do not have enough RAM, I tried to load in the same JSON on a bigger EC2 machine with 1.95TB of RAM. The process still broke after loading 1,116,500 records. I am using R version

Efficient operations of big non-sparse matrices in Matlab

五迷三道 提交于 2019-12-07 11:21:16
问题 I need to operate in big 3-dim non-sparse matrices in Matlab. Using pure vectorization gives a high computation time. So, I have tried to split the operations into 10 blocks and then parse the results. I got surprised when I saw the the pure vectorization does not scale very well with the data size as presented in the following figure. I include an example of the two approaches. % Parameters: M = 1e6; N = 50; L = 4; K = 10; % Method 1: Pure vectorization mat1 = randi(L,[M,N,L]); mat2 = repmat

HDFS as volume in cloudera quickstart docker

戏子无情 提交于 2019-12-07 08:12:31
问题 I am fairly new to both hadoop and docker. I haven been working on extending the cloudera/quickstart docker image docker file and wanted to mount a directory form host and map it to hdfs location, so that performance is increased and data are persist localy. When i mount volume anywhere with -v /localdir:/someDir everything works fine, but that's not my goal. But when i do -v /localdir:/var/lib/hadoop-hdfs both datanode and namenode fails to start and I get : "cd /var/lib/hadoop-hdfs: