bigdata

How to output a file using tab delimiter in Netezza NZSQL

亡梦爱人 提交于 2019-12-12 11:18:51
问题 I am trying to output some files using NZSQL CLI but not able to output as tab delimited files. Can somebody who has worked on NZ share your thoughts on this below command. Tried so far :- nzsql -o sample.txt -F= -A -t -c "SELECT * FROM DW_ETL.USER WHERE datasliceid % 20 = 2 LIMIT 5;" 回答1: To specify tab as the delimiter use $ in conjunction with the -F option. nzsql -o sample.txt -F $'\t' -A -t -c "SELECT * FROM DW_ETL.USER WHERE datasliceid % 20 = 2 LIMIT 5;" This is documented in the nzsql

How to optimize below spark code (scala)?

笑着哭i 提交于 2019-12-12 10:15:05
问题 I have some huge files (of 19GB, 40GB etc.). I need to execute following algorithm on these files: Read the file Sort it on the basis of 1 column Take 1st 70% of the data: a) Take all the distinct records of the subset of the columns b) write it to train file Take the last 30% of the data: a) Take all the distinct records of the subset of the columns b) write it to test file I tried running following code in spark (using Scala). import scala.collection.mutable.ListBuffer import java.io

spark unix_timestamp data type mismatch

折月煮酒 提交于 2019-12-12 09:05:09
问题 Could someone help guide me in what data type or format I need to submit from_unixtime for the spark from_unixtime() function to work? When I try the following it works, but responds not with current_timestamp. from_unixtime(current_timestamp()) The response is below: fromunixtime(currenttimestamp(),yyyy-MM-dd HH:mm:ss) When I try to input from_unixtime(1392394861,"yyyy-MM-dd HH:mm:ss.SSSS") The above simply fails with a type mismatch: error: type mismatch; found : Int(1392394861) required:

Oozie s3 as job folder

。_饼干妹妹 提交于 2019-12-12 07:24:34
问题 Oozie is failing with following error when workflow.xml is provided from s3, But the same worked provided workflow.xml from HDFS. Same has worked with earlier versions of oozie, Is there anything changed from 4.3 version of oozie.? Env: HDP 3.1.0 Oozie 4.3.1 oozie.service.HadoopAccessorService.supported.filesystems=* Job.properties nameNode=hdfs://ambari-master-1a.xdata.com:8020 jobTracker=ambari-master-2a.xdata.com:8050 queue=default #OOZIE job details basepath=s3a://mybucket/test/oozie

How to insert big data on the laravel?

我的梦境 提交于 2019-12-12 07:07:28
问题 I am using laravel 5.6 My script to insert big data is like this : ... $insert_data = []; foreach ($json['value'] as $value) { $posting_date = Carbon::parse($value['Posting_Date']); $posting_date = $posting_date->format('Y-m-d'); $data = [ 'item_no' => $value['Item_No'], 'entry_no' => $value['Entry_No'], 'document_no' => $value['Document_No'], 'posting_date' => $posting_date, .... ]; $insert_data[] = $data; } \DB::table('items_details')->insert($insert_data); I have tried to insert 100 record

Generating a co-occurrance matrix in R on a LARGE dataset

妖精的绣舞 提交于 2019-12-12 06:13:12
问题 I'm trying to create a co-occurrence matrix in R on a very large dataset (26M lines) that looks basically like this: ID Observation 11000 ficus 11112 cherry 11112 ficus 12223 juniper 12223 olive 12223 juniper 12223 ficus 12334 olive 12334 cherry 12334 olive ... ... And on for a long time. I want to consolidate the observations by ID and generate a co-occurance matrix of observations observed by observer ID. I managed this on a subset of the data but some of the stuff I did "manually" that it

Clean one column from long and big data set

拜拜、爱过 提交于 2019-12-12 04:59:30
问题 I am trying to clean only one column from the long and big data sets. The data has 18 columns, more than 10k+ rows about 100s of csv files, Of which I want to clean only one column. Input fields only few from the long list userLocation, userTimezone, Coordinates, India, Hawaii, {u'type': u'Point', u'coordinates': [73.8567, 18.5203]} California, USA , New Delhi, Ft. Sam Houston,Mountain Time (US & Canada),{u'type': u'Point', u'coordinates': [86.99643, 23.68088]} Kathmandu,Nepal, Kathmandu, {u

Hive - Bucketing and Partitioning

本小妞迷上赌 提交于 2019-12-12 04:52:25
问题 What should be basis for us to narrow down whether to use partition or bucketing on a set of columns in Hive? Suppose we have a huge data set, where we have two columns which are queried most often - so my obvious choice might be to make the partition based on these two columns, but also if this would result into a huge number of small files created in huge number of directories, than it would be a wrong decision to partition data based on these columns, and may be bucketing would have been a

KafkaSpout tuple replay throws null pointer exception

社会主义新天地 提交于 2019-12-12 04:31:36
问题 I am using storm 1.0.1 and Kafka 0.10.0.0 with storm-kafka-client 1.0.3. please find the code config I have below. kafkaConsumerProps.put(KafkaSpoutConfig.Consumer.KEY_DESERIALIZER, "org.apache.kafka.common.serialization.ByteArrayDeserializer"); kafkaConsumerProps.put(KafkaSpoutConfig.Consumer.VALUE_DESERIALIZER, "org.apache.kafka.common.serialization.ByteArrayDeserializer"); KafkaSpoutStreams kafkaSpoutStreams = new KafkaSpoutStreamsNamedTopics.Builder(new Fields(fieldNames), topics) .build(

Result of hdfs dfs -ls command

只谈情不闲聊 提交于 2019-12-12 03:48:09
问题 In the execution of hdfs dfs -ls command I wuold like to know if the result are all the files stored in the cluster or just the partitions in the node where it is executed. I'm a newby in hadoop and I´m having some problems serching the partitions in each node. Thank you 回答1: Question: "...if the result are all the files stored in the cluster or..." What you see from ls command are all the files stored in the cluster. More specifically, what you see is a bunch of file paths and names. These