bigdata | 易学教程

When to use dynamoDB -UseCases

阅读更多关于 When to use dynamoDB -UseCases

I've tried to figure out what will be the best use cases that suit for Amazon dynamoDB. When I googled most of the blogs says DyanmoDb will be used only for a large amount of data (BigData). I'm having a background of relational DB. NoSQL DB is new for me.So when I've tried to relate this to normal relation DB knowledge. Most of the concepts related to DynamoDb is to create a schema-less table with partition keys/sort keys. And try to query them based on the keys.Also, there is no such concept of stored procedure which makes queries easier and simple. If we managing such huge Data's doing such

Switch from Mysql to MongoDB 200 millions rows

阅读更多关于 Switch from Mysql to MongoDB 200 millions rows

问题 We are trying to move from mysql to mongodb. mysql structure is id_src int id_dest int unique key : id_src,id_dest They are about 200 millions rows in mysql data exemple : {id_src,id_dest} {1,2} {1,3} {1,10} {2,3} {2,10} {4,3} We need to retrive data : {id_dest,count} {3,3} {10,2} {2,1} I started to repoduced the structure of mysql in mongodb. Insert performance were huge (very good) : about 1 hour to insert 200 millions rows. But I needed to use map reduce to get the group by. Map reduce

Apache Spark: In SparkSql, are sql's vulnerable to Sql Injection [duplicate]

阅读更多关于 Apache Spark: In SparkSql, are sql's vulnerable to Sql Injection [duplicate]

This question already has an answer here : Spark SQL security considerations (1 answer) Closed 2 years ago . Scenario: Say there is a table in Hive , and it is queried using the below SparkSql in Apache Spark , where table name is passed as an argument and concatenated to the query. In case of non-distributed system, I have basic understanding of SQL-Injection vulnerability and in the context of JDBC understand the usage of createStatement/preparedStatement in the those kind of scenario. But what about this scenario in the case of sparksql, is this code vulnerable? Any insights ? def main(args

Transferring files from remote node to HDFS with Flume

阅读更多关于 Transferring files from remote node to HDFS with Flume

I have a bunch of binary files compressed into *gz format. These are generated on a remote node and must be transferred to HDFS located one of the datacenter's server. I'm exploring the option of sending the files with Flume; I explore the option of doing this with a Spooling Directory configuration, but apparently this only works when the file's directory is located locally on the same HDFS node. Any suggestions how to tackle this problem? arghtype There is no out-of-box solution for such case. But you could try these workarounds: You could create your own source implementation for such

using clojure-csv.core to parse a huge csv file

阅读更多关于 using clojure-csv.core to parse a huge csv file

问题 So far I have : (:require [clojure-csv.core :as csv]) (:require [clojure.java.io :as io])) (def csv-file (.getFile (clojure.java.io/resource "verbs.csv"))) (defn process-csv [file] (with-open [rdr (io/reader file)] (csv/parse-csv rdr))) But I am getting java.io.IOException: Stream closed . I am using clojure-csv and it exposes two methods, the first of which I am using, parse-csv , the doc says : Takes a CSV as a char sequence or string, and returns a lazy sequence of vectors of strings What

How to parse a JSON string from a column with Pig

阅读更多关于 How to parse a JSON string from a column with Pig

问题 I have tsv log files where a column is populated by a json string. I want to parse that column with JsonLoader in a Pig script. I saw many examples where JsonLoader is used in cases where each row is only a json string. I have other columns I want to skip and I don't know how to do that. The file looks like this: foo bar {"version":1; "type":"an event"; "count": 1} foo bar {"version":1; "type":"another event"; "count": 1} How can I do that? 回答1: You are looking for the JsonStringToMap UDF

MySQL View is faster or not for querying from DB that containd 40 million data [closed]

阅读更多关于 MySQL View is faster or not for querying from DB that containd 40 million data [closed]

I have a table that contains 40 million data. I want to reduce the query time. is it possible to do it using views?. if yes, could you please explain how?. Yes, It is possible to reduce query time using view because it have clustered index assigned and they'll store temporary result that can speed up resulting queries. Well, the MySQL statement for creating a view is: CREATE ALGORITHM = MERGE VIEW my_view AS SELECT ... Where the algorithm is one of MERGE, TEMPTABLE, or UNDEFINED. However, as always the answer to how to increase performance is, "it depends". We really need more details on the

Elasticsearch Mapping - Rename existing field

阅读更多关于 Elasticsearch Mapping - Rename existing field

问题 Is there anyway I can rename an element in an existing elasticsearch mapping without having to add a new element ? If so whats the best way to do it in order to avoid breaking the existing mapping? e.g. from fieldCamelcase to fieldCamelCase { "myType": { "properties": { "timestamp": { "type": "date", "format": "date_optional_time" }, "fieldCamelcase": { "type": "string", "index": "not_analyzed" }, "field_test": { "type": "double" } } } } 回答1: You could do this by creating an Ingest pipeline,

What is Big Data & What classifies as Big data? [closed]

阅读更多关于 What is Big Data & What classifies as Big data? [closed]

Closed . This question is opinion-based . It is not currently accepting answers. Want to improve this question? Update the question so it can be answered with facts and citations by editing this post . Closed 3 years ago . I have went through a lot of articles but I dont seem to get a perfectly clear answer on what exactly a BIG DATA is. In one page I saw "any data which is bigger for your usage, is big data i.e. 100 MB is considered big data for your mailbox but not your hard disc". Whereas another article said "big data to be usually more than 1 TB with different volume / variety / velocity

How many partitions does Spark create when a file is loaded from S3 bucket?

阅读更多关于 How many partitions does Spark create when a file is loaded from S3 bucket?

问题 If the file is loaded from HDFS by default spark creates one partition per block. But how does spark decide partitions when a file is loaded from S3 bucket? 回答1: See the code of org.apache.hadoop.mapred.FileInputFormat.getSplits() . Block size depends on S3 file system implementation (see FileStatus.getBlockSize() ). E.g. S3AFileStatus just set it equals to 0 (and then FileInputFormat.computeSplitSize() comes into play). Also, you don't get splits if your InputFormat is not splittable :) 回答2: