bigdata | 易学教程

Job queue for Hive action in oozie

阅读更多关于 Job queue for Hive action in oozie

I have a oozie workflow. I am submitting all the hive actions with <name>mapred.job.queue.name</name> <value>${queueName}</value> But for few hive actions, the job launched is not in specified queue; it is invoked in default queue. Please suggest me the cause behind this behavior and solution. Samson Scharfrichter A. Oozie specifics Oozie propagates the "regular" Hadoop properties to a "regular" MapReduce Action. But for other types of Action (Shell, Hive, Java, etc.) where Oozie runs a single Mapper task in YARN, it does not consider that it's a real MapReduce job. Hence it uses a different

jq --stream filter on multiple values of same key

阅读更多关于 jq --stream filter on multiple values of same key

问题 I am processing a very large JSON wherein I need to filter the inner JSON objects using a value of a key. My JSON looks like as follows: {"userActivities":{"L3ATRosRdbDgSmX75Z":{"deviceId":"60ee32c2fae8dcf0","dow":"Friday","localDate":"2018-01-20"},"L3ATSFGrpAYRkIIKqrh":{"deviceId":"60ee32c2fae8dcf0","dow":"Friday","localDate":"2018-01-21"},"L3AVHvmReBBPNGluvHl":{"deviceId":"60ee32c2fae8dcf0","dow":"Friday","localDate":"2018-01-22"},"L3AVIcqaDpZxLf6ispK":{"deviceId":"60ee32c2fae8dcf0","dow":

find all two word phrases that appear in more than one row in a dataset

阅读更多关于 find all two word phrases that appear in more than one row in a dataset

问题 We would like to run a query that returns two word phrases that appear in more than one row. So for e.g. take the string "Data Ninja". Since it appears in more than one row in our dataset, the query should return that. The query should find all such phrases from all the rows in our dataset, by querying for two adjacent word combination (forming a phrase) in the rows that are in the dataset. These two adjacent word combinations should come from the dataset we loaded into BigQuery How can we

How can I calculate exact median with Apache Spark?

阅读更多关于 How can I calculate exact median with Apache Spark?

This page contains some statistics functions (mean, stdev, variance, etc.) but it does not contain the median. How can I calculate exact median? Thanks You need to sort RDD and take element in the middle or average of two elements. Here is example with RDD[Int]: import org.apache.spark.SparkContext._ val rdd: RDD[Int] = ??? val sorted = rdd.sortBy(identity).zipWithIndex().map { case (v, idx) => (idx, v) } val count = sorted.count() val median: Double = if (count % 2 == 0) { val l = count / 2 - 1 val r = l + 1 (sorted.lookup(l).head + sorted.lookup(r).head).toDouble / 2 } else sorted.lookup

Serious Memory Leak When Iteratively Parsing XML Files

阅读更多关于 Serious Memory Leak When Iteratively Parsing XML Files

Context When iterating over a set of Rdata files (each containing a character vector of HTML code) that are loaded, analyzed (via XML functionality) and then removed from memory again, I experience a significant increase in an R process' memory consumption (killing the process eventually). It just seems like freeing objects via free() , removing them via rm() and running gc() do not have any effects, so the memory consumption cumulates until there's no more memory left. EDIT 2012-02-13 23:30:00 Thanks to valuable insight shared by the author and maintainer of package XML , Duncan Temple Lang

Would Spark unpersist the RDD itself when it realizes it won't be used anymore?

阅读更多关于 Would Spark unpersist the RDD itself when it realizes it won't be used anymore?

We can persist an RDD into memory and/or disk when we want to use it more than once. However, do we have to unpersist it ourselves later on, or does Spark does some kind of garbage collection and unpersist the RDD when it is no longer needed? I notice that If I call unpersist function myself, I get slower performance. Yes, Apache Spark will unpersist the RDD when it's garbage collected. In RDD.persist you can see: sc.cleaner.foreach(_.registerRDDForCleanup(this)) This puts a WeakReference to the RDD in a ReferenceQueue leading to ContextCleaner.doCleanupRDD when the RDD is garbage collected.

Spark parquet partitioning : Large number of files

阅读更多关于 Spark parquet partitioning : Large number of files

I am trying to leverage spark partitioning. I was trying to do something like data.write.partitionBy("key").parquet("/location") The issue here each partition creates huge number of parquet files which result slow read if I am trying to read from the root directory. To avoid that I tried data.coalese(numPart).write.partitionBy("key").parquet("/location") This however creates numPart number of parquet files in each partition. Now my partition size is different. SO I would ideally like to have separate coalesce per partition. This is however doesn't look like an easy thing. I need to visit all

Hbase quickly count number of rows

阅读更多关于 Hbase quickly count number of rows

Right now I implement row count over ResultScanner like this for (Result rs = scanner.next(); rs != null; rs = scanner.next()) { number++; } If data reaching millions time computing is large.I want to compute in real time that i don't want to use Mapreduce How to quickly count number of rows. Basil Saju Use RowCounter in HBase RowCounter is a mapreduce job to count all the rows of a table. This is a good utility to use as a sanity check to ensure that HBase can read all the blocks of a table if there are any concerns of metadata inconsistency. It will run the mapreduce all in a single process

is there any way to import a json file(contains 100 documents) in elasticsearch server.?

阅读更多关于 is there any way to import a json file(contains 100 documents) in elasticsearch server.?

Is there any way to import a JSON file (contains 100 documents) in elasticsearch server? I want to import a big json file into es-server.. You should use Bulk API . Note that you will need to add a header line before each json document. $ cat requests { "index" : { "_index" : "test", "_type" : "type1", "_id" : "1" } } { "field1" : "value1" } $ curl -s -XPOST localhost:9200/_bulk --data-binary @requests; echo {"took":7,"items":[{"create":{"_index":"test","_type":"type1","_id":"1","_version":1,"ok":true}}]} As dadoonet already mentioned, the bulk API is probably the way to go. To transform your

Recommended package for very large dataset processing and machine learning in R

阅读更多关于 Recommended package for very large dataset processing and machine learning in R

It seems like R is really designed to handle datasets that it can pull entirely into memory. What R packages are recommended for signal processing and machine learning on very large datasets that can not be pulled into memory? If R is simply the wrong way to do this, I am open to other robust free suggestions (e.g. scipy if there is some nice way to handle very large datasets) Have a look at the "Large memory and out-of-memory data" subsection of the high performance computing task view on CRAN. bigmemory and ff are two popular packages. For bigmemory (and the related biganalytics , and