bigdata

Job queue for Hive action in oozie

丶灬走出姿态 提交于 2019-11-28 02:21:33
I have a oozie workflow. I am submitting all the hive actions with <name>mapred.job.queue.name</name> <value>${queueName}</value> But for few hive actions, the job launched is not in specified queue; it is invoked in default queue. Please suggest me the cause behind this behavior and solution. Samson Scharfrichter A. Oozie specifics Oozie propagates the "regular" Hadoop properties to a "regular" MapReduce Action. But for other types of Action (Shell, Hive, Java, etc.) where Oozie runs a single Mapper task in YARN, it does not consider that it's a real MapReduce job. Hence it uses a different

jq --stream filter on multiple values of same key

≯℡__Kan透↙ 提交于 2019-11-28 02:17:57
问题 I am processing a very large JSON wherein I need to filter the inner JSON objects using a value of a key. My JSON looks like as follows: {"userActivities":{"L3ATRosRdbDgSmX75Z":{"deviceId":"60ee32c2fae8dcf0","dow":"Friday","localDate":"2018-01-20"},"L3ATSFGrpAYRkIIKqrh":{"deviceId":"60ee32c2fae8dcf0","dow":"Friday","localDate":"2018-01-21"},"L3AVHvmReBBPNGluvHl":{"deviceId":"60ee32c2fae8dcf0","dow":"Friday","localDate":"2018-01-22"},"L3AVIcqaDpZxLf6ispK":{"deviceId":"60ee32c2fae8dcf0","dow":

find all two word phrases that appear in more than one row in a dataset

笑着哭i 提交于 2019-11-28 01:54:19
问题 We would like to run a query that returns two word phrases that appear in more than one row. So for e.g. take the string "Data Ninja". Since it appears in more than one row in our dataset, the query should return that. The query should find all such phrases from all the rows in our dataset, by querying for two adjacent word combination (forming a phrase) in the rows that are in the dataset. These two adjacent word combinations should come from the dataset we loaded into BigQuery How can we

How can I calculate exact median with Apache Spark?

寵の児 提交于 2019-11-27 23:35:12
This page contains some statistics functions (mean, stdev, variance, etc.) but it does not contain the median. How can I calculate exact median? Thanks You need to sort RDD and take element in the middle or average of two elements. Here is example with RDD[Int]: import org.apache.spark.SparkContext._ val rdd: RDD[Int] = ??? val sorted = rdd.sortBy(identity).zipWithIndex().map { case (v, idx) => (idx, v) } val count = sorted.count() val median: Double = if (count % 2 == 0) { val l = count / 2 - 1 val r = l + 1 (sorted.lookup(l).head + sorted.lookup(r).head).toDouble / 2 } else sorted.lookup

Serious Memory Leak When Iteratively Parsing XML Files

三世轮回 提交于 2019-11-27 22:13:19
Context When iterating over a set of Rdata files (each containing a character vector of HTML code) that are loaded, analyzed (via XML functionality) and then removed from memory again, I experience a significant increase in an R process' memory consumption (killing the process eventually). It just seems like freeing objects via free() , removing them via rm() and running gc() do not have any effects, so the memory consumption cumulates until there's no more memory left. EDIT 2012-02-13 23:30:00 Thanks to valuable insight shared by the author and maintainer of package XML , Duncan Temple Lang

Would Spark unpersist the RDD itself when it realizes it won't be used anymore?

你离开我真会死。 提交于 2019-11-27 20:28:15
We can persist an RDD into memory and/or disk when we want to use it more than once. However, do we have to unpersist it ourselves later on, or does Spark does some kind of garbage collection and unpersist the RDD when it is no longer needed? I notice that If I call unpersist function myself, I get slower performance. Yes, Apache Spark will unpersist the RDD when it's garbage collected. In RDD.persist you can see: sc.cleaner.foreach(_.registerRDDForCleanup(this)) This puts a WeakReference to the RDD in a ReferenceQueue leading to ContextCleaner.doCleanupRDD when the RDD is garbage collected.

Spark parquet partitioning : Large number of files

自作多情 提交于 2019-11-27 18:01:28
I am trying to leverage spark partitioning. I was trying to do something like data.write.partitionBy("key").parquet("/location") The issue here each partition creates huge number of parquet files which result slow read if I am trying to read from the root directory. To avoid that I tried data.coalese(numPart).write.partitionBy("key").parquet("/location") This however creates numPart number of parquet files in each partition. Now my partition size is different. SO I would ideally like to have separate coalesce per partition. This is however doesn't look like an easy thing. I need to visit all

Hbase quickly count number of rows

允我心安 提交于 2019-11-27 17:58:20
Right now I implement row count over ResultScanner like this for (Result rs = scanner.next(); rs != null; rs = scanner.next()) { number++; } If data reaching millions time computing is large.I want to compute in real time that i don't want to use Mapreduce How to quickly count number of rows. Basil Saju Use RowCounter in HBase RowCounter is a mapreduce job to count all the rows of a table. This is a good utility to use as a sanity check to ensure that HBase can read all the blocks of a table if there are any concerns of metadata inconsistency. It will run the mapreduce all in a single process

is there any way to import a json file(contains 100 documents) in elasticsearch server.?

时光怂恿深爱的人放手 提交于 2019-11-27 17:24:23
Is there any way to import a JSON file (contains 100 documents) in elasticsearch server? I want to import a big json file into es-server.. You should use Bulk API . Note that you will need to add a header line before each json document. $ cat requests { "index" : { "_index" : "test", "_type" : "type1", "_id" : "1" } } { "field1" : "value1" } $ curl -s -XPOST localhost:9200/_bulk --data-binary @requests; echo {"took":7,"items":[{"create":{"_index":"test","_type":"type1","_id":"1","_version":1,"ok":true}}]} As dadoonet already mentioned, the bulk API is probably the way to go. To transform your

Recommended package for very large dataset processing and machine learning in R

纵然是瞬间 提交于 2019-11-27 17:10:24
It seems like R is really designed to handle datasets that it can pull entirely into memory. What R packages are recommended for signal processing and machine learning on very large datasets that can not be pulled into memory? If R is simply the wrong way to do this, I am open to other robust free suggestions (e.g. scipy if there is some nice way to handle very large datasets) Have a look at the "Large memory and out-of-memory data" subsection of the high performance computing task view on CRAN. bigmemory and ff are two popular packages. For bigmemory (and the related biganalytics , and