bigdata | 易学教程

Zookeeper cluster on AWS

阅读更多关于 Zookeeper cluster on AWS

问题 I am trying to setup a zookeeper cluster on 3 AWS ec2 machines, but continuously getting same error: 2016-10-19 16:30:23,177 [myid:2] - WARN [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@382] - Cannot open channel to 3 at election address /xxx.31.34.102:3888 java.net.SocketTimeoutException: connect timed out at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl

Hive execution hook

阅读更多关于 Hive execution hook

问题 I am in need to hook a custom execution hook in Apache Hive. Please let me know if somebody know how to do it. The current environment I am using is given below: Hadoop : Cloudera version 4.1.2 Operating system : Centos Thanks, Arun 回答1: There are several types of hooks depending on at which stage you want to inject your custom code: Driver run hooks (Pre/Post) Semantic analyizer hooks (Pre/Post) Execution hooks (Pre/Failure/Post) Client statistics publisher If you run a script the processing

Remove single quotes from data using Pig

阅读更多关于 Remove single quotes from data using Pig

This is what my data looks like (10, 'ACCOUNTING', 'NEW YORK') (20, 'RESEARCH', 'DALLAS') (30, 'SALES', 'CHICAGO') (40, 'OPERATIONS', 'BOSTON') I want to remove ( , ) and ' from this data using Pig Script. I want my data to look like this- 10, ACCOUNTING, NEW YORK 20, RESEARCH, DALLAS 30, SALES, CHICAGO 40, OPERATIONS, BOSTON I am stuck on this from quite long time. Please help. Thanks in advance. Can you try REPLACE function with the below regex? Explanation: In Regex there are few characters have special meanings \ ^ $ . , | ? * + ( ) [ { . These special characters are called as "

Scala immutable Map slow

阅读更多关于 Scala immutable Map slow

I have a piece of code when I create a map like: val map = gtfLineArr(8).split(";").map(_ split "\"").collect { case Array(k, v) => (k, v) }.toMap Then I use this map to create my object: case class MyObject(val attribute1: String, val attribute2: Map[String:String]) I'm reading millions of lines and converting to MyObjects using an iterator. Like MyObject("1", map) When I do it is really slow, more than 1h for 2'000'000 entries. I remove the map from the object creation, but still I do the split process (section 1): val map = gtfLineArr(8).split(";").map(_ split "\"").collect { case Array(k,

Why does the user need write permission on the location of external hive table?

阅读更多关于 Why does the user need write permission on the location of external hive table?

In Hive, you can create two kinds of tables: Managed and External In case of managed table, you own the data and hence when you drop the table the data is deleted. In case of external table, you don't have ownership of the data and hence when you delete such a table, the underlying data is not deleted. Only metadata is deleted. Now, recently i have observed that you can not create an external table over a location on which you don't have write (modification) permissions in HDFS. I completely fail to understand this. Use case: It is quite common that the data you are churning is huge and read

How do you import Big Data public data sets into AWS?

阅读更多关于 How do you import Big Data public data sets into AWS?

Loading any of Amazon's listed public data sets (http://aws.amazon.com/datasets) would take a lot of resources and bandwidth. What's the best way to import them into AWS so you start working with them quickly? bardiir You will need to create a new EBS Instance using the Snapshot-ID for the public dataset. That way you won't need to pay for transfer. But be careful, some data sets are only available in one region, most likely denoted by a note similar to this. You should register your EC2 instance in the same region then. These datasets are hosted in the us-east-1 region. If you process these

How to produce massive amount of data?

阅读更多关于 How to produce massive amount of data?

问题 I'm doing some testing with nutch and hadoop and I need a massive amount of data. I want to start with 20GB, go to 100 GB, 500 GB and eventually reach 1-2 TB. The problem is that I don't have this amount of data, so I'm thinking of ways to produce it. The data itself can be of any kind. One idea is to take an initial set of data and duplicate it. But its not good enough because need files that are different from one another (Identical files are ignored). Another idea is to write a program

How to do a join in Elasticsearch — or at the Lucene level

阅读更多关于 How to do a join in Elasticsearch — or at the Lucene level

问题 What's the best way to do the equivalent of an SQL join in Elasticsearch? I have an SQL setup with two large tables: Persons and Items. A Person can own many items. Both Person and Item rows can change (i.e. be updated). I have to run searches which filter by aspects of both the person and the item. In Elasticsearch, it looks like you could make Person a nested document of Item, then use has_child . But: if you then update a Person, I think you'd need to update every Item they own (which

Best way to prepare for Design and Architecture questions related to big data [closed]

阅读更多关于 Best way to prepare for Design and Architecture questions related to big data [closed]

问题 As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance. Closed 7 years ago . Recently, I attended an onsite interview for a company and I was asked design questions related to big data like e.g: get me the list

Read n lines of a big text file

阅读更多关于 Read n lines of a big text file

问题 The smallest file I have has > 850k lines and every line is of unknown length. The goal is to read n lines from this file in the browser. Reading it fully is not going to happen. Here is the HTML <input type="file" name="file" id="file"> and the JS I have: var n = 10; var reader = new FileReader(); reader.onload = function(progressEvent) { // Entire file console.log(this.result); // By lines var lines = this.result.split('\n'); for (var line = 0; line < n; line++) { console.log(lines[line]);