apache-pig

how to cancel command in GRUNT shell

一世执手 提交于 2019-12-21 20:31:07
问题 This is probably a more general question: Many tools in linux have their own shells. In my case, I use pig and hbase. Sometimes when you execute a command in the shell, it returns a lot of results and I want to cancel it. Let's say for example you do cat 'a.txt' and that file is huge. What's the best way to cancel it without exiting the shell. If I press Ctrl+c it'll exit the shell. 回答1: kill <job_id> will kill a mapreduce job `with the specified id. It's not exactly what you are looking for

Using Hive with Pig

百般思念 提交于 2019-12-21 19:23:55
问题 My hive query has multiple outer joins and takes very long to execute. I was wondering if it would make sense to break it into multiple smaller queries and use pig to work the transformations. Is there a way I could query hive tables or read hive table data within a pig script? Thanks 回答1: The goal of the Howl project is to allow Pig and Hive to share a single metadata repository. Once Howl is mature, you'll be able to run PigLatin and HiveQL queries over the same tables. For now, you can try

Still getting “Unable to load realm info from SCDynamicStore” after bug fix

混江龙づ霸主 提交于 2019-12-21 07:41:05
问题 I installed Hadoop and Pig using brew install hadoop and brew install pig . I read here that you will to get Unable to load realm info from SCDynamicStore error message unless you add: export HADOOP_OPTS="-Djava.security.krb5.realm=OX.AC.UK -Djava.security.krb5.kdc=kdc0.ox.ac.uk:kdc1.ox.ac.uk" to your hadoop-env.sh file, which I have. However, when I run hadoop namenode -format , I still see: java[1548:1703] Unable to load realm info from SCDynamicStore amongst the outputs. Anyone know why I

latin pig bag to tuple after group by

与世无争的帅哥 提交于 2019-12-21 05:36:18
问题 I have the following data with schema (t0: chararray,t1: int,t2: int) (B,4,2) (A,2,3) (A,3,2) (B,2,2) (A,1,2) (B,1,2) I'd like to generate the following results (group by t0, and ordered by t1) (A, ((1,2),(2,3),(3,2))) (B, ((1,2),(2,2),(4,2))) Please note I want only tuples in the second component, not bags. Please help. 回答1: You should be able to do it like this. -- A: (t0: chararray,t1: int,t2: int) B = GROUP A BY t0 ; C = FOREACH B { -- Project out the first column of A. projected =

How to use Cassandra's Map Reduce with or w/o Pig?

拥有回忆 提交于 2019-12-21 03:29:08
问题 Can someone explain how MapReduce works with Cassandra .6? I've read through the word count example, but I don't quite follow what's happening on the Cassandra end vs. the "client" end. https://svn.apache.org/repos/asf/cassandra/trunk/contrib/word_count/ For instance, let's say I'm using Python and Pycassa, how would I load in a new map reduce function, and then call it? Does my map reduce function have to be java that's installed on the cassandra server? If so, how do I call it from Pycassa?

Access hdfs file from udf

笑着哭i 提交于 2019-12-21 02:46:11
问题 I`d like to access a file from my udf call. This is my script: files = LOAD '$docs_in' USING PigStorage(';') AS (id, stopwords, id2, file); buzz = FOREACH files GENERATE pigbuzz.Buzz(file, id) as file:bag{(year:chararray, word:chararray, count:long)}; The jar is registered. The path is realtive to my hdfs, where the files really exist. The call is made. But seems that the file is not discovered. Maybe beacause I'm trying to access the file on hdfs. How can I access a file in hdfs, from my UDF

How can I partition a table with HIVE?

走远了吗. 提交于 2019-12-21 01:42:31
问题 I've been playing with Hive for few days now but I still have a hard time with partition. I've been recording Apache logs (Combine format) in Hadoop for few months. They are stored in row text format, partitioned by date (via flume): /logs/yyyy/mm/dd/hh/* Example: /logs/2012/02/10/00/Part01xx (02/10/2012 12:00 am) /logs/2012/02/10/00/Part02xx /logs/2012/02/10/13/Part0xxx (02/10/2012 01:00 pm) The date in the combined log file is following this format [10/Feb/2012:00:00:00 -0800] How can I

Pig: Control number of mappers

こ雲淡風輕ζ 提交于 2019-12-20 16:31:57
问题 I can control the number of reducers by using PARALLEL clause in the statements which result in reducers. I want to control the number of mappers. The data source is already created, and I can not reduce the number of parts in the data source. Is it possible to control the number of maps spawned by my pig statements? Can I keep a lower and upper cap on the number of maps spawned? Is it a good idea to control this? I tried using pig.maxCombinedSplitSize, mapred.min.split.size, mapred

Pig: Control number of mappers

ⅰ亾dé卋堺 提交于 2019-12-20 16:31:20
问题 I can control the number of reducers by using PARALLEL clause in the statements which result in reducers. I want to control the number of mappers. The data source is already created, and I can not reduce the number of parts in the data source. Is it possible to control the number of maps spawned by my pig statements? Can I keep a lower and upper cap on the number of maps spawned? Is it a good idea to control this? I tried using pig.maxCombinedSplitSize, mapred.min.split.size, mapred

Debugging in PIG UDF

大城市里の小女人 提交于 2019-12-20 14:06:30
问题 I am new to Hadoop/PIG. I have a basic question. Do we have a Logging facility in PIG UDF? I have written a UDF which I need to verify I need to log certain statements to check the flow. Is there a Logging facility available? If yes where are the Pig logs present? 回答1: Assuming your UDF extends EvalFunc , you can use the Logger returned from EvalFunc.getLogger(). The log output should be visible in the associated Map / Reduce task that pig executes (if the job executes in more than a single