bigdata

Elasticsearch Mapping - Rename existing field

旧时模样 提交于 2019-12-03 15:52:05
Is there anyway I can rename an element in an existing elasticsearch mapping without having to add a new element ? If so whats the best way to do it in order to avoid breaking the existing mapping? e.g. from fieldCamelcase to fieldCamelCase { "myType": { "properties": { "timestamp": { "type": "date", "format": "date_optional_time" }, "fieldCamelcase": { "type": "string", "index": "not_analyzed" }, "field_test": { "type": "double" } } } } You could do this by creating an Ingest pipeline, that contains a Rename Processor in combination with the Reindex API . PUT _ingest/pipeline/my_rename

using clojure-csv.core to parse a huge csv file

折月煮酒 提交于 2019-12-03 15:34:48
So far I have : (:require [clojure-csv.core :as csv]) (:require [clojure.java.io :as io])) (def csv-file (.getFile (clojure.java.io/resource "verbs.csv"))) (defn process-csv [file] (with-open [rdr (io/reader file)] (csv/parse-csv rdr))) But I am getting java.io.IOException: Stream closed . I am using clojure-csv and it exposes two methods, the first of which I am using, parse-csv , the doc says : Takes a CSV as a char sequence or string, and returns a lazy sequence of vectors of strings What I think I know : with-open is lazy, and the rdr in (csv/parse-csv rdr))) is a single line of the csv

Flink Streaming: How to output one data stream to different outputs depending on the data?

时光毁灭记忆、已成空白 提交于 2019-12-03 15:16:19
问题 In Apache Flink I have a stream of tuples. Let's assume a really simple Tuple1<String> . The tuple can have an arbitrary value in it's value field (e.g. 'P1', 'P2', etc.). The set of possible values is finite but I don't know the full set beforehand (so there could be a 'P362'). I want to write that tuple to a certain output location depending on the value inside of the tuple. So e.g. I would like to have the following file structure: /output/P1 /output/P2 In the documentation I only found

Memory map file in MATLAB?

*爱你&永不变心* 提交于 2019-12-03 15:11:02
I have decided to use memmapfile because my data (typically 30Gb to 60Gb) is too big to fit in a computer's memory. My data files consist two columns of data that correspond to the outputs of two sensors and I have them in both .bin and .txt formats. m=memmapfile('G:\E-Stress Research\Data\2013-12-18\LD101_3\EPS/LD101_3.bin','format','int32') m.data(1) I used the above code to memory map my data to a variable "m" but I have no idea what data format to use (int8', 'int16', 'int32', 'int64','uint8', 'uint16', 'uint32', 'uint64', 'single', and 'double'). In fact I tried all of the data formats

How many partitions does Spark create when a file is loaded from S3 bucket?

橙三吉。 提交于 2019-12-03 14:38:51
If the file is loaded from HDFS by default spark creates one partition per block. But how does spark decide partitions when a file is loaded from S3 bucket? See the code of org.apache.hadoop.mapred.FileInputFormat.getSplits() . Block size depends on S3 file system implementation (see FileStatus.getBlockSize() ). E.g. S3AFileStatus just set it equals to 0 (and then FileInputFormat.computeSplitSize() comes into play). Also, you don't get splits if your InputFormat is not splittable :) Spark will treat S3 as if it were a block-based filesystem, so partitioning rules for HDFS and S3 inputs are the

speed up large result set processing using rmongodb

*爱你&永不变心* 提交于 2019-12-03 14:36:27
I'm using rmongodb to get every document in a a particular collection. It works but I'm working with millions of small documents, potentially 100M or more. I'm using the method suggested by the author on the website: cnub.org/rmongodb.ashx count <- mongo.count(mongo, ns, query) cursor <- mongo.find(mongo, query) name <- vector("character", count) age <- vector("numeric", count) i <- 1 while (mongo.cursor.next(cursor)) { b <- mongo.cursor.value(cursor) name[i] <- mongo.bson.value(b, "name") age[i] <- mongo.bson.value(b, "age") i <- i + 1 } df <- as.data.frame(list(name=name, age=age)) This

Is there a good way to avoid memory deep copy or to reduce time spent in multiprocessing?

人走茶凉 提交于 2019-12-03 13:40:49
问题 I am making a memory-based real-time calculation module of "Big data" using Pandas module of the Python environment. So response time is the quality of this module and very critical and important. To process large data set, I split the data and process sub split data in parallel. In the part of storing the result of sub data, much time spend(21th line). I think that internally memory deep copy arises or sub data passed are not shared in memory. If I written the module in C or C++, I will use

Skewed tables in Hive

荒凉一梦 提交于 2019-12-03 13:40:26
I am learning hive and came across skewed tables. Help me understanding it. What are skewed tables in Hive? How do we create skewed tables? How does it effect performance? Tariq What are skewed tables in Hive? A skewed table is a special type of table where the values that appear very often (heavy skew) are split out into separate files and rest of the values go to some other file.. How do we create skewed tables? create table <T> (schema) skewed by (keys) on ('value1', 'value2') [STORED as DIRECTORIES]; Example : create table T (c1 string, c2 string) skewed by (c1) on ('x1') How does it

Why Kafka so fast [closed]

徘徊边缘 提交于 2019-12-03 13:33:59
问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 4 years ago . If I have same hardware, to use Kafka or our current solution(ServiceMix/Camel). Is there any difference? Can Kafka handle "bigger" data than it? Why? There is a article to talk about how fast could it be? But I still don't get clearly why Kafka is so fast comparing to other

Does a flatMap in spark cause a shuffle?

旧街凉风 提交于 2019-12-03 13:28:54
Does flatMap in spark behave like the map function and therefore cause no shuffling, or does it trigger a shuffle. I suspect it does cause shuffling. Can someone confirm it? There is no shuffling with either map or flatMap. The operations that cause shuffle are: Repartition operations: Repartition: Coalesce: ByKey operations (except for counting): GroupByKey: ReduceByKey: Join operations: Cogroup: Join: Although the set of elements in each partition of newly shuffled data will be deterministic, and so is the ordering of partitions themselves, the ordering of these elements is not. If one