bigdata | 易学教程

Elasticsearch Mapping - Rename existing field

阅读更多关于 Elasticsearch Mapping - Rename existing field

Is there anyway I can rename an element in an existing elasticsearch mapping without having to add a new element ? If so whats the best way to do it in order to avoid breaking the existing mapping? e.g. from fieldCamelcase to fieldCamelCase { "myType": { "properties": { "timestamp": { "type": "date", "format": "date_optional_time" }, "fieldCamelcase": { "type": "string", "index": "not_analyzed" }, "field_test": { "type": "double" } } } } You could do this by creating an Ingest pipeline, that contains a Rename Processor in combination with the Reindex API . PUT _ingest/pipeline/my_rename

using clojure-csv.core to parse a huge csv file

阅读更多关于 using clojure-csv.core to parse a huge csv file

So far I have : (:require [clojure-csv.core :as csv]) (:require [clojure.java.io :as io])) (def csv-file (.getFile (clojure.java.io/resource "verbs.csv"))) (defn process-csv [file] (with-open [rdr (io/reader file)] (csv/parse-csv rdr))) But I am getting java.io.IOException: Stream closed . I am using clojure-csv and it exposes two methods, the first of which I am using, parse-csv , the doc says : Takes a CSV as a char sequence or string, and returns a lazy sequence of vectors of strings What I think I know : with-open is lazy, and the rdr in (csv/parse-csv rdr))) is a single line of the csv

Flink Streaming: How to output one data stream to different outputs depending on the data?

阅读更多关于 Flink Streaming: How to output one data stream to different outputs depending on the data?

问题 In Apache Flink I have a stream of tuples. Let's assume a really simple Tuple1<String> . The tuple can have an arbitrary value in it's value field (e.g. 'P1', 'P2', etc.). The set of possible values is finite but I don't know the full set beforehand (so there could be a 'P362'). I want to write that tuple to a certain output location depending on the value inside of the tuple. So e.g. I would like to have the following file structure: /output/P1 /output/P2 In the documentation I only found

Memory map file in MATLAB?

阅读更多关于 Memory map file in MATLAB?

I have decided to use memmapfile because my data (typically 30Gb to 60Gb) is too big to fit in a computer's memory. My data files consist two columns of data that correspond to the outputs of two sensors and I have them in both .bin and .txt formats. m=memmapfile('G:\E-Stress Research\Data\2013-12-18\LD101_3\EPS/LD101_3.bin','format','int32') m.data(1) I used the above code to memory map my data to a variable "m" but I have no idea what data format to use (int8', 'int16', 'int32', 'int64','uint8', 'uint16', 'uint32', 'uint64', 'single', and 'double'). In fact I tried all of the data formats

How many partitions does Spark create when a file is loaded from S3 bucket?

阅读更多关于 How many partitions does Spark create when a file is loaded from S3 bucket?

If the file is loaded from HDFS by default spark creates one partition per block. But how does spark decide partitions when a file is loaded from S3 bucket? See the code of org.apache.hadoop.mapred.FileInputFormat.getSplits() . Block size depends on S3 file system implementation (see FileStatus.getBlockSize() ). E.g. S3AFileStatus just set it equals to 0 (and then FileInputFormat.computeSplitSize() comes into play). Also, you don't get splits if your InputFormat is not splittable :) Spark will treat S3 as if it were a block-based filesystem, so partitioning rules for HDFS and S3 inputs are the

speed up large result set processing using rmongodb

阅读更多关于 speed up large result set processing using rmongodb

I'm using rmongodb to get every document in a a particular collection. It works but I'm working with millions of small documents, potentially 100M or more. I'm using the method suggested by the author on the website: cnub.org/rmongodb.ashx count <- mongo.count(mongo, ns, query) cursor <- mongo.find(mongo, query) name <- vector("character", count) age <- vector("numeric", count) i <- 1 while (mongo.cursor.next(cursor)) { b <- mongo.cursor.value(cursor) name[i] <- mongo.bson.value(b, "name") age[i] <- mongo.bson.value(b, "age") i <- i + 1 } df <- as.data.frame(list(name=name, age=age)) This

Is there a good way to avoid memory deep copy or to reduce time spent in multiprocessing?

阅读更多关于 Is there a good way to avoid memory deep copy or to reduce time spent in multiprocessing?

问题 I am making a memory-based real-time calculation module of "Big data" using Pandas module of the Python environment. So response time is the quality of this module and very critical and important. To process large data set, I split the data and process sub split data in parallel. In the part of storing the result of sub data, much time spend(21th line). I think that internally memory deep copy arises or sub data passed are not shared in memory. If I written the module in C or C++, I will use

Skewed tables in Hive

阅读更多关于 Skewed tables in Hive

I am learning hive and came across skewed tables. Help me understanding it. What are skewed tables in Hive? How do we create skewed tables? How does it effect performance? Tariq What are skewed tables in Hive? A skewed table is a special type of table where the values that appear very often (heavy skew) are split out into separate files and rest of the values go to some other file.. How do we create skewed tables? create table <T> (schema) skewed by (keys) on ('value1', 'value2') [STORED as DIRECTORIES]; Example : create table T (c1 string, c2 string) skewed by (c1) on ('x1') How does it

Why Kafka so fast [closed]

阅读更多关于 Why Kafka so fast [closed]

问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 4 years ago . If I have same hardware, to use Kafka or our current solution(ServiceMix/Camel). Is there any difference? Can Kafka handle "bigger" data than it? Why? There is a article to talk about how fast could it be? But I still don't get clearly why Kafka is so fast comparing to other

Does a flatMap in spark cause a shuffle?

阅读更多关于 Does a flatMap in spark cause a shuffle?

Does flatMap in spark behave like the map function and therefore cause no shuffling, or does it trigger a shuffle. I suspect it does cause shuffling. Can someone confirm it? There is no shuffling with either map or flatMap. The operations that cause shuffle are: Repartition operations: Repartition: Coalesce: ByKey operations (except for counting): GroupByKey: ReduceByKey: Join operations: Cogroup: Join: Although the set of elements in each partition of newly shuffled data will be deterministic, and so is the ordering of partitions themselves, the ordering of these elements is not. If one