bigdata | 易学教程

How to output a file using tab delimiter in Netezza NZSQL

阅读更多关于 How to output a file using tab delimiter in Netezza NZSQL

问题 I am trying to output some files using NZSQL CLI but not able to output as tab delimited files. Can somebody who has worked on NZ share your thoughts on this below command. Tried so far :- nzsql -o sample.txt -F= -A -t -c "SELECT * FROM DW_ETL.USER WHERE datasliceid % 20 = 2 LIMIT 5;" 回答1: To specify tab as the delimiter use $ in conjunction with the -F option. nzsql -o sample.txt -F $'\t' -A -t -c "SELECT * FROM DW_ETL.USER WHERE datasliceid % 20 = 2 LIMIT 5;" This is documented in the nzsql

How to optimize below spark code (scala)?

阅读更多关于 How to optimize below spark code (scala)?

问题 I have some huge files (of 19GB, 40GB etc.). I need to execute following algorithm on these files: Read the file Sort it on the basis of 1 column Take 1st 70% of the data: a) Take all the distinct records of the subset of the columns b) write it to train file Take the last 30% of the data: a) Take all the distinct records of the subset of the columns b) write it to test file I tried running following code in spark (using Scala). import scala.collection.mutable.ListBuffer import java.io

spark unix_timestamp data type mismatch

阅读更多关于 spark unix_timestamp data type mismatch

问题 Could someone help guide me in what data type or format I need to submit from_unixtime for the spark from_unixtime() function to work? When I try the following it works, but responds not with current_timestamp. from_unixtime(current_timestamp()) The response is below: fromunixtime(currenttimestamp(),yyyy-MM-dd HH:mm:ss) When I try to input from_unixtime(1392394861,"yyyy-MM-dd HH:mm:ss.SSSS") The above simply fails with a type mismatch: error: type mismatch; found : Int(1392394861) required:

Oozie s3 as job folder

阅读更多关于 Oozie s3 as job folder

问题 Oozie is failing with following error when workflow.xml is provided from s3, But the same worked provided workflow.xml from HDFS. Same has worked with earlier versions of oozie, Is there anything changed from 4.3 version of oozie.? Env: HDP 3.1.0 Oozie 4.3.1 oozie.service.HadoopAccessorService.supported.filesystems=* Job.properties nameNode=hdfs://ambari-master-1a.xdata.com:8020 jobTracker=ambari-master-2a.xdata.com:8050 queue=default #OOZIE job details basepath=s3a://mybucket/test/oozie

How to insert big data on the laravel?

阅读更多关于 How to insert big data on the laravel?

问题 I am using laravel 5.6 My script to insert big data is like this : ... $insert_data = []; foreach ($json['value'] as $value) { $posting_date = Carbon::parse($value['Posting_Date']); $posting_date = $posting_date->format('Y-m-d'); $data = [ 'item_no' => $value['Item_No'], 'entry_no' => $value['Entry_No'], 'document_no' => $value['Document_No'], 'posting_date' => $posting_date, .... ]; $insert_data[] = $data; } \DB::table('items_details')->insert($insert_data); I have tried to insert 100 record

Generating a co-occurrance matrix in R on a LARGE dataset

阅读更多关于 Generating a co-occurrance matrix in R on a LARGE dataset

问题 I'm trying to create a co-occurrence matrix in R on a very large dataset (26M lines) that looks basically like this: ID Observation 11000 ficus 11112 cherry 11112 ficus 12223 juniper 12223 olive 12223 juniper 12223 ficus 12334 olive 12334 cherry 12334 olive ... ... And on for a long time. I want to consolidate the observations by ID and generate a co-occurance matrix of observations observed by observer ID. I managed this on a subset of the data but some of the stuff I did "manually" that it

Clean one column from long and big data set

阅读更多关于 Clean one column from long and big data set

问题 I am trying to clean only one column from the long and big data sets. The data has 18 columns, more than 10k+ rows about 100s of csv files, Of which I want to clean only one column. Input fields only few from the long list userLocation, userTimezone, Coordinates, India, Hawaii, {u'type': u'Point', u'coordinates': [73.8567, 18.5203]} California, USA , New Delhi, Ft. Sam Houston,Mountain Time (US & Canada),{u'type': u'Point', u'coordinates': [86.99643, 23.68088]} Kathmandu,Nepal, Kathmandu, {u

Hive - Bucketing and Partitioning

阅读更多关于 Hive - Bucketing and Partitioning

问题 What should be basis for us to narrow down whether to use partition or bucketing on a set of columns in Hive? Suppose we have a huge data set, where we have two columns which are queried most often - so my obvious choice might be to make the partition based on these two columns, but also if this would result into a huge number of small files created in huge number of directories, than it would be a wrong decision to partition data based on these columns, and may be bucketing would have been a

KafkaSpout tuple replay throws null pointer exception

阅读更多关于 KafkaSpout tuple replay throws null pointer exception

问题 I am using storm 1.0.1 and Kafka 0.10.0.0 with storm-kafka-client 1.0.3. please find the code config I have below. kafkaConsumerProps.put(KafkaSpoutConfig.Consumer.KEY_DESERIALIZER, "org.apache.kafka.common.serialization.ByteArrayDeserializer"); kafkaConsumerProps.put(KafkaSpoutConfig.Consumer.VALUE_DESERIALIZER, "org.apache.kafka.common.serialization.ByteArrayDeserializer"); KafkaSpoutStreams kafkaSpoutStreams = new KafkaSpoutStreamsNamedTopics.Builder(new Fields(fieldNames), topics) .build(

Result of hdfs dfs -ls command

阅读更多关于 Result of hdfs dfs -ls command

问题 In the execution of hdfs dfs -ls command I wuold like to know if the result are all the files stored in the cluster or just the partitions in the node where it is executed. I'm a newby in hadoop and I´m having some problems serching the partitions in each node. Thank you 回答1: Question: "...if the result are all the files stored in the cluster or..." What you see from ls command are all the files stored in the cluster. More specifically, what you see is a bunch of file paths and names. These