bigdata

How to get the specified output without combineByKey and aggregateByKey in spark RDD

≯℡__Kan透↙ 提交于 2019-12-13 22:56:21
问题 Below is my data: val keysWithValuesList = Array("foo=A", "foo=A", "foo=A", "foo=A", "foo=B", bar=C","bar=D", "bar=D") Now I want below types of output but without using combineByKey and aggregateByKey : 1) Array[(String, Int)] = Array((foo,5), (bar,3)) 2) Array((foo,Set(B, A)), (bar,Set(C, D))) Below is my attempt: scala> val keysWithValuesList = Array("foo=A", "foo=A", "foo=A", "foo=A", "foo=B", "bar=C", | "bar=D", "bar=D") scala> val sample=keysWithValuesList.map(_.split("=")).map(p=>(p(0)

Updating unique id column for newly added records in table in hive

荒凉一梦 提交于 2019-12-13 20:14:55
问题 I have a table in which I want unique identifier to be added automatically as a new record is inserted into it. Considering I have column for unique identifier already created. 回答1: hive can't update the table but you can create a temporary table or overwrite your first table. you can also use concat function to join the two diferent column or string. here is the examples function :concat(string A, string B…) return: string hive> select concat(‘abc’,'def’,'gh’) from dual; abcdefgh HQL &result

Handling a very big dataframe

半世苍凉 提交于 2019-12-13 20:05:30
问题 Right now I'm having trouble on how to process my data and transform it into a dataframe. Basically what I'm trying to do is to read the data first data = pd.read_csv(querylog, sep=" ", header=None) then group it query_group = data.groupby('Query') ip_group = data.groupby('IP') and lastly create a blank dataframe to map their values df = pd.DataFrame(columns=query_group.groups, index=range(0, len(ip_group.groups))) index = 0 for name, group in ip_group: df.set_value(index, 'IP', name) index +

Oozie Workflow EL function timestamp() does not give seconds

北慕城南 提交于 2019-12-13 19:24:14
问题 I have the following Oozie workflow: <workflow-app name="${workflow_name}" xmlns="uri:oozie:workflow:0.4"> <global> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.job.queue.name</name> <value>${launcherQueueName}</value> </property> <property> <name>mapred.queue.name</name> <value>${launcherQueueName}</value> </property> </configuration> </global> <start to="email-1" /> <action name="email-1"> <email xmlns="uri:oozie:email

SQL Update large table

时光总嘲笑我的痴心妄想 提交于 2019-12-13 18:42:28
问题 I have a question. I need to update two large table - t_contact (170 million rows) and t_participants (11 million rows). This tables both have column CUSTOMER_ID . Some of this IDs wrong and I need to update it. Wrong IDs is about 140 thousand. I understand that if I will use UPDATE TABLE it takes a lot of times, but this two tables mustn't be unavailable for a long time. What should I do? 回答1: If you have the wrong ID's stored some where you should use merge: MERGE INTO t_contact D USING

How to shuffle a text file on disk in Python

旧街凉风 提交于 2019-12-13 17:28:36
问题 I am working with a text file of about 12*10^6 rows which is stored on my hard disk. The structure of the file is: data|data|data|...|data\n data|data|data|...|data\n data|data|data|...|data\n ... data|data|data|...|data\n There's no header, and there's no id to uniquely identify the rows. Since I want to use it for machine learning purposes, I need to make sure that there's no order in the text file which may affect the stochastic learning. Usually I upload such kind of files into memory,

Efficiently calculating a segmented regression on a large dataset

我的梦境 提交于 2019-12-13 16:13:23
问题 I currently have a large data set, for which I need to calculate a segmented regression (or fit a piecewise linear function in some similar way). However, I have both a large data set, as well as a very large number of pieces. Currently I have the following approach: Let s i be the end of segment i Let (x i ,y i ) denote the i-th data point Assume the data point x k lies within segment j, then I can create a vector from x k as (s 1 ,s 2 -s 1 ,s 3 -s 2 ,...,x k -s j-1 ,0,0,...) To do a

Syntax to Sqoop import 5 out of 100 tables present in database - don't use exclude keyword?

懵懂的女人 提交于 2019-12-13 09:11:15
问题 I have 100 tables in a database. I want to import only 5 tables. I can't/don't use "-- exclude" command 回答1: This can be done by shell script. 1)Prepare a input file which has list of 5 DBNAME.TABLENAME 2)The shell script will have this file as input, iterate line by line and execute sqoop statement for each line. while read line; do DBNAME=`echo $line | cut -d'.' -f1` tableName=`echo $line | cut -d'.' -f2` sqoop import -Dmapreduce.job.queuename=$RM_QUEUE_NAME --connect '$JDBC_URL

What is the difference between Hbase checkAndPut and checkAndMutate?

偶尔善良 提交于 2019-12-13 08:43:35
问题 In Hbase 1.2.4 What is the difference between checkAndPut and checkAndMutate? 回答1: checkAndPut - compares the value with the current value from the hbase according to the passed CompareOp. CompareOp=EQUALS Adds the value to the put object if expected value is equal. checkAndMutate - compares the value with the current value from the hbase according to the passed CompareOp.CompareOp=EQUALS Adds the value to the rowmutation object if expected value is equal. you can add multiple put and delete

Retrieving nth qualifier in hbase using java

不问归期 提交于 2019-12-13 08:31:44
问题 This question is quite out of box but i need it. In list(collection), we can retrieve the nth element in the list by list.get(i); similarly is there any method, in hbase, using java API, where i can get the nth qualifier given the row id and ColumnFamily name. NOTE: I have million qualifiers in single row in single columnFamily. 回答1: Sorry for being unresponsive. Busy with something important. Try this for right now : package org.myorg.hbasedemo; import java.io.IOException; import java.util