bigdata | 易学教程

How do I output the results of a HiveQL query to CSV using a shell script?

阅读更多关于 How do I output the results of a HiveQL query to CSV using a shell script?

问题 I would like to run multiple Hive queries, preferably in parallel rather than sequentially, and store the output of each query into a csv file. For example, query1 output in csv1 , query2 output in csv2 , etc. I would be running these queries after leaving work with the goal of having output to analyze during the next business day. I am interested in using a bash shell script because then I'd be able to set-up a cron task to run it at a specific time of day. I know how to store the results of

Subtract all pairs of values from two arrays

阅读更多关于 Subtract all pairs of values from two arrays

I have two vectors, v1 and v2 . I'd like to subtract each value of v2 from each value of v1 and store the results in another vector. I also would like to work with very large vectors (e.g. 1e6 size), so I think I should be using numpy for performance. Up until now I have: import numpy v1 = numpy.array(numpy.random.uniform(-1, 1, size=1e2)) v2 = numpy.array(numpy.random.uniform(-1, 1, size=1e2)) vdiff = [] for value in v1: vdiff.extend([value - v2]) This creates a list with 100 entries, each entry being an array of size 100. I don't know if this is the most efficient way to do this though. I'd

What are the differences between kappa-architecture and lambda-architecture

阅读更多关于 What are the differences between kappa-architecture and lambda-architecture

If the Kappa-Architecture does analysis on stream directly instead of splitting the data into two streams, where is the datastored then, in a messagin-system like Kafka? or can it be in a database for recomputing? And is a seperate batch layer faster than recomputing with a stream processing engine for batch analytics? "A very simple case to consider is when the algorithms applied to the real-time data and to the historical data are identical. Then it is clearly very beneficial to use the same code base to process historical and real-time data, and therefore to implement the use-case using the

Is it a good idea to generate per day collections in mongodb

阅读更多关于 Is it a good idea to generate per day collections in mongodb

Is it a good idea to create per day collections for data on a given day (we could start with per day and then move to per hour if there is too much data). Is there a limit on the number of collections we can create in mongodb, or does it result in performance loss (is it an overhead for mongodb to maintain so many collections). Does a large number of collections have any adverse effect on performance? To give you more context, the data will be more like facebook feeds, and only the latest data (say last one week or month) is more important to us. Making per day collections keeps the number of

What are the differences between kappa-architecture and lambda-architecture

阅读更多关于 What are the differences between kappa-architecture and lambda-architecture

问题 If the Kappa-Architecture does analysis on stream directly instead of splitting the data into two streams, where is the datastored then, in a messagin-system like Kafka? or can it be in a database for recomputing? And is a seperate batch layer faster than recomputing with a stream processing engine for batch analytics? 回答1: "A very simple case to consider is when the algorithms applied to the real-time data and to the historical data are identical. Then it is clearly very beneficial to use

How can I one hot encode multiple variables with big data in R?

阅读更多关于 How can I one hot encode multiple variables with big data in R?

问题 I currently have a dataframe with 260,000 rows and 50 columns where 3 columns are numeric and the rest are categorical. I wanted to one hot encode the categorical columns in order to perform PCA and use regression to predict the class. How can I go about accomplishing the below example in R? Example: V1 V2 V3 V4 V5 .... VN-1 VN to V1_a V1_b V2_a V2_b V2_c V3_a V3_b and so on 回答1: You can use model.matrix or sparse.model.matrix . Something like this: sparse.model.matrix(~. -1, data = your_data

Iterating an RDD and updating a mutable collection returns an empty collection

阅读更多关于 Iterating an RDD and updating a mutable collection returns an empty collection

问题 I am new to Scala and Spark and would like some help in understanding why the below code isn't producing my desired outcome. I am comparing two tables My desired output schema is: case class DiscrepancyData(fieldKey:String, fieldName:String, val1:String, val2:String, valExpected:String) When I run the below code step by step manually, I actually end up with my desired outcome. Which is a List[DiscrepancyData] completely populated with my desired output. However, I must be missing something in

What happens if an RDD can't fit into memory in Spark? [duplicate]

阅读更多关于 What happens if an RDD can't fit into memory in Spark? [duplicate]

This question already has an answer here: What will spark do if I don't have enough memory? 3 answers As far as I know, Spark tries to do all computation in memory, unless you call persist with disk storage option. If however, we don't use any persist, what does Spark do when an RDD doesn't fit in memory? What if we have very huge data. How will Spark handle it without crashing? Sachin Gaikwad From Apache Spark FAQ's: Spark's operators spill data to disk if it does not fit in memory, allowing it to run well on any sized data. Likewise, cached datasets that do not fit in memory are either

Finding Minimum hamming distance of a set of strings in python

阅读更多关于 Finding Minimum hamming distance of a set of strings in python

I have a set of n (~1000000) strings (DNA sequences) stored in a list trans. I have to find the minimum hamming distance of all sequences in the list. I implemented a naive brute force algorithm, which has been running for more than a day and has not yet given a solution. My code is dmin=len(trans[0]) for i in xrange(len(trans)): for j in xrange(i+1,len(trans)): dist=hamdist(trans[i][:-1], trans[j][:-1]) if dist < dmin: dmin = dist Is there a more efficient method to do this? Here hamdist is a function I wrote to find hamming distances. It is def hamdist(str1, str2): diffs = 0 if len(str1) !=

How to convert a Date String from UTC to Specific TimeZone in HIVE?

阅读更多关于 How to convert a Date String from UTC to Specific TimeZone in HIVE?

My Hive table has a date column with UTC date strings. I want to get all rows for a specific EST date. I am trying to do something like the below: Select * from TableName T where TO_DATE(ConvertToESTTimeZone(T.date)) = "2014-01-12" I want to know if there is a function for ConvertToESTTimeZone, or how I can achieve that? I tried the following but it doesnt work (my default timezone is CST): TO_DATE(from_utc_timestamp(T.Date) = "2014-01-12" TO_DATE( from_utc_timestamp(to_utc_timestamp (unix_timestamp (T.date), 'CST'),'EST')) Thanks in advance. Update: Strange behavior. When I do this: select