bigdata

Spark - Joining 2 PairRDD elements

夙愿已清 提交于 2019-12-12 01:56:21
问题 Hi have a JavaRDDPair with 2 elements: ("TypeA", List<jsonTypeA>), ("TypeB", List<jsonTypeB>) I need to combine the 2 pairs into 1 pair of type: ("TypeA_B", List<jsonCombinedAPlusB>) I need to combine the 2 lists into 1 list, where each 2 jsons (1 of type A and 1 of type B) have some common field I can join on. Consider that list of type A is significantly smaller than the other, and the join should be inner, so the result list should be as small as the list of type A. What is the most

Build an undirected weighted graph by matching N vertices

别来无恙 提交于 2019-12-12 01:33:11
问题 Problem: I want to suggest the top 10 most compatible matches for a particular user, by comparing his/her 'interests' with interests of all others. I'm building an undirected weighted graph between users, where the weight = match score between the two users. I already have a set of N users: S. For any user U in S, I have a set of interests I. After a long time (a week?) I create a new user U with a set of interests and add it to S. To generate a graph for this new user, I'm comparing interest

Oozie shell action issue while creating directories

落爺英雄遲暮 提交于 2019-12-12 01:29:37
问题 I am unable to add/delete any files or directories on HDFS from a shell script which I am executing from Oozie workflow. The username is "scitest" and the hdfs path I am trying to edit/add/delete is /user/scitest/. In the shell script I am trying to delete a folder named test123456 from the path /user/scitest/. ---------------Error from oozie log------------------ 429737-oozie-oozi-W@shell-node] Launcher ERROR, reason: Main class [org.apache.oozie.action.hadoop.ShellMain], exit code [1] 2016

historyserver not able to read log after enabling kerberos

旧城冷巷雨未停 提交于 2019-12-12 00:14:26
问题 I enable the Kerberos on the cluster and it is working fine. But due to some issue mapred user is not able to read and display log over JobHistory server. I check the logs of job history server and it giving access error as: org.apache.hadoop.security.AccessControlException: Permission denied:user=mapred, access=READ_EXECUTE, inode="/user/history/done_intermediate/prakul":prakul:hadoop:drwxrwx--- as we can see the directory have access to hadoop group and mapred is in hadoop group, even then

How do I make a large 3D array without running out of memory?

时光毁灭记忆、已成空白 提交于 2019-12-11 20:25:31
问题 I have the following method: public static void createGiantArray(int size) { int[][][] giantArray = new int[size][size][size]; } When I call it with a size of 10,000 like so: createGiantArray(10000); I get the following error: Exception in thread "main" java.lang.OutOfMemoryError: Java heap space How can I create an array that is 10,000 x 10,000 x 10,000 while bypassing the memory exception? (I am aware my current method is pointless and loses scope. I didn't post the extra code that goes

Reading a CSV file, looping through the rows, using connections

泪湿孤枕 提交于 2019-12-11 17:37:12
问题 So I have a large csv excel file that my computer cannot handle opening without rstudio terminating. To solve this I am trying to iterate through the rows of the file in order do my calculations on each row at a time, before storing the value and then moving on to the next row. This I can normally achieve (eg on a smaller file) through simply reading and storing the whole csv file within Rstudio and running a simple for loop. It is, however, the size of this storage of data that I am trying

How to iterate among text in the for loop and find count of a particular text in MapReduce()

醉酒当歌 提交于 2019-12-11 16:16:23
问题 So here is a piece of Reduce() code on a particular dataset which has a bunch of designations as 'key' and the salary of designation of a particular named person as 'value' public static class ReduceEmployee extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } }

Forecasting using Multiple Regression in BigQuery

假如想象 提交于 2019-12-11 15:56:45
问题 Pity Google BigQuery still doesn't have a function such as forecast() that we see in Spreadsheets-- don't look down on yet; given one has the statistical know-how, surprising amount of smoothing and seasonality can be added to forecasting on spreadsheets. BigQuery allows you to determine Standard Deviation, correlation and intercept metrics. Using that, one can create the prediction model-- refer to this and this. But that uses Linear regression model; so we are not happy with the seasonality

HIVE very long field gives OOM Heap

冷暖自知 提交于 2019-12-11 15:27:12
问题 We are storing string fields which varies in length from small(few kB) to very long(<400MB) in HIVE table. Now we are facing the issue of OOM when copying data from one table to another(without any conditions or joins), which is not exactly what we are running in production, but it is the most simple use case where this problem occurs. So the HQL is basically just: INSERT INTO new_table SELECT * FROM old_table; Container and Java Heap was set to 16GB, we had tried different file formats

Algorithm for finding trends in data? [closed]

白昼怎懂夜的黑 提交于 2019-12-11 14:14:23
问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 6 years ago . I'm looking for an algorithm that is able to find trends in large amounts of data. For instance, if one is given time t and a variable x , (t,x) , and given input such as {(1,1), (2,4), (3,9), (4,16)} , it should be able to figure out that the value of x for t=5 is 25. How is this