apache-pig | 易学教程

how to cancel command in GRUNT shell

阅读更多关于 how to cancel command in GRUNT shell

问题 This is probably a more general question: Many tools in linux have their own shells. In my case, I use pig and hbase. Sometimes when you execute a command in the shell, it returns a lot of results and I want to cancel it. Let's say for example you do cat 'a.txt' and that file is huge. What's the best way to cancel it without exiting the shell. If I press Ctrl+c it'll exit the shell. 回答1: kill <job_id> will kill a mapreduce job `with the specified id. It's not exactly what you are looking for

Using Hive with Pig

阅读更多关于 Using Hive with Pig

问题 My hive query has multiple outer joins and takes very long to execute. I was wondering if it would make sense to break it into multiple smaller queries and use pig to work the transformations. Is there a way I could query hive tables or read hive table data within a pig script? Thanks 回答1: The goal of the Howl project is to allow Pig and Hive to share a single metadata repository. Once Howl is mature, you'll be able to run PigLatin and HiveQL queries over the same tables. For now, you can try

Still getting “Unable to load realm info from SCDynamicStore” after bug fix

阅读更多关于 Still getting “Unable to load realm info from SCDynamicStore” after bug fix

问题 I installed Hadoop and Pig using brew install hadoop and brew install pig . I read here that you will to get Unable to load realm info from SCDynamicStore error message unless you add: export HADOOP_OPTS="-Djava.security.krb5.realm=OX.AC.UK -Djava.security.krb5.kdc=kdc0.ox.ac.uk:kdc1.ox.ac.uk" to your hadoop-env.sh file, which I have. However, when I run hadoop namenode -format , I still see: java[1548:1703] Unable to load realm info from SCDynamicStore amongst the outputs. Anyone know why I

latin pig bag to tuple after group by

阅读更多关于 latin pig bag to tuple after group by

问题 I have the following data with schema (t0: chararray,t1: int,t2: int) (B,4,2) (A,2,3) (A,3,2) (B,2,2) (A,1,2) (B,1,2) I'd like to generate the following results (group by t0, and ordered by t1) (A, ((1,2),(2,3),(3,2))) (B, ((1,2),(2,2),(4,2))) Please note I want only tuples in the second component, not bags. Please help. 回答1: You should be able to do it like this. -- A: (t0: chararray,t1: int,t2: int) B = GROUP A BY t0 ; C = FOREACH B { -- Project out the first column of A. projected =

How to use Cassandra's Map Reduce with or w/o Pig?

阅读更多关于 How to use Cassandra's Map Reduce with or w/o Pig?

问题 Can someone explain how MapReduce works with Cassandra .6? I've read through the word count example, but I don't quite follow what's happening on the Cassandra end vs. the "client" end. https://svn.apache.org/repos/asf/cassandra/trunk/contrib/word_count/ For instance, let's say I'm using Python and Pycassa, how would I load in a new map reduce function, and then call it? Does my map reduce function have to be java that's installed on the cassandra server? If so, how do I call it from Pycassa?

Access hdfs file from udf

阅读更多关于 Access hdfs file from udf

问题 I`d like to access a file from my udf call. This is my script: files = LOAD '$docs_in' USING PigStorage(';') AS (id, stopwords, id2, file); buzz = FOREACH files GENERATE pigbuzz.Buzz(file, id) as file:bag{(year:chararray, word:chararray, count:long)}; The jar is registered. The path is realtive to my hdfs, where the files really exist. The call is made. But seems that the file is not discovered. Maybe beacause I'm trying to access the file on hdfs. How can I access a file in hdfs, from my UDF

How can I partition a table with HIVE?

阅读更多关于 How can I partition a table with HIVE?

问题 I've been playing with Hive for few days now but I still have a hard time with partition. I've been recording Apache logs (Combine format) in Hadoop for few months. They are stored in row text format, partitioned by date (via flume): /logs/yyyy/mm/dd/hh/* Example: /logs/2012/02/10/00/Part01xx (02/10/2012 12:00 am) /logs/2012/02/10/00/Part02xx /logs/2012/02/10/13/Part0xxx (02/10/2012 01:00 pm) The date in the combined log file is following this format [10/Feb/2012:00:00:00 -0800] How can I

Pig: Control number of mappers

阅读更多关于 Pig: Control number of mappers

问题 I can control the number of reducers by using PARALLEL clause in the statements which result in reducers. I want to control the number of mappers. The data source is already created, and I can not reduce the number of parts in the data source. Is it possible to control the number of maps spawned by my pig statements? Can I keep a lower and upper cap on the number of maps spawned? Is it a good idea to control this? I tried using pig.maxCombinedSplitSize, mapred.min.split.size, mapred

Pig: Control number of mappers

阅读更多关于 Pig: Control number of mappers

Debugging in PIG UDF

阅读更多关于 Debugging in PIG UDF

问题 I am new to Hadoop/PIG. I have a basic question. Do we have a Logging facility in PIG UDF? I have written a UDF which I need to verify I need to log certain statements to check the flow. Is there a Logging facility available? If yes where are the Pig logs present? 回答1: Assuming your UDF extends EvalFunc , you can use the Logger returned from EvalFunc.getLogger(). The log output should be visible in the associated Map / Reduce task that pig executes (if the job executes in more than a single