apache-pig | 易学教程

Parsing a nested XML string from a Hive table using PIG

阅读更多关于 Parsing a nested XML string from a Hive table using PIG

问题 I'm trying to use PIG to extract some XML from a field in a Hive table, rather than from an XML file (which is the assumption of most of the examples I have read). The XML comes from a table arranged as follows: ID, {XML_string} The XML string contains n. number of rows, always containing at least one from up to 10 attributes. We can assume that attribute #1 will always be present and will be unique. <row> <att1></att1> <att2></att2> ... </row> <row> <att1></att1> <att2></att2> ... </row> ...

Composite key in Cassandra with Pig

阅读更多关于 Composite key in Cassandra with Pig

问题 We have a CQL table that looks something like this: CREATE table data ( occurday text, seqnumber int, occurtimems bigint, unique bigint, fields map<text, text>, primary key ((occurday, seqnumber), occurtimems, unique) ) I can query this table from cqlsh like this: select * from data where seqnumber = 10 AND occurday = '2013-10-01'; This query works and returns the expected data. If I execute this query as part of a LOAD from within Pig, however, things don't work. -- Need to URL encode the

How to FILTER Cassandra TimeUUID/UUID in Pig

阅读更多关于 How to FILTER Cassandra TimeUUID/UUID in Pig

问题 Here is my Cassandra schema, using Datastax Enterprise CREATE KEYSPACE applications WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 1}; USE applications; CREATE TABLE events( bucket text, id timeuuid, app_id uuid, event text, PRIMARY KEY(bucket, id) ); I want to FILTER in PIG by app_id (TimeUUID) and id (UUID), here is my Pig script. events = LOAD 'cql://applications/events' USING CqlStorage() AS (bucket: chararray, id: chararray, app_id: chararray, event: chararray);

Heap Space Issue while Running a Pig Script

阅读更多关于 Heap Space Issue while Running a Pig Script

问题 I am trying to execute a pig script with around 30 million data and I am getting the below heap space error: > ERROR 2998: Unhandled internal error. Java heap space > > java.lang.OutOfMemoryError: Java heap space > at java.util.Arrays.copyOf(Arrays.java:2367) > at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130) > at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114) > at java.lang.AbstractStringBuilder.append

Escaping a Dollar Sign in Pig?

阅读更多关于 Escaping a Dollar Sign in Pig?

问题 This wasn't a problem in 0.9.2, but in 0.10, when I try to access a key in a map that has a dollar sign in it, I get hammered with errors that I haven't defined the variable. Specifically: blah = FOREACH meh GENERATE source, json_post_id#'$id' AS post_id; returns Undefined parameter : id That's fine and makes sense, but when I amend it to: blah = FOREACH meh GENERATE source, json_post_id#'\$id' AS post_id; I get: Unexpected character '$' Ideas? [Edit] Forgot to mention: have tried with 2

Pig FILTER returns empty bag that I can't COUNT

阅读更多关于 Pig FILTER returns empty bag that I can't COUNT

问题 I'm trying to count how many values in a data set match a filter condition, but I'm running into issues when the filter matches no entries. There are a lot of columns in my data structure, but there's only three of use for this example: key - data key for the set (not unique), value - float value as recorded, nominal_value - float representing the nominal value. Our use case right now is to find the number of values that are 10% or more below the nominal value. I'm doing something like this:

ERROR 1066: Unable to open iterator for alias in certain fields, but works for others

阅读更多关于 ERROR 1066: Unable to open iterator for alias in certain fields, but works for others

问题 I am unable to use my udf on some fields, yet I can do it on others. If I use my first field, ipAddress , the udf works as intended. However, if I change it to be date I got the 1066 error. Here is my script. Pig Script that works and calls udf. REGISTER myudfs.jar; DEFINE HOUR myudfs.HOUR; A = load 'access_log_Jul95' using PigStorage(' ') as (ip:chararray, dash1:chararray, dash2:chararray, date:chararray, date1:chararray, getRequset:chararray, location:chararray, http:chararray, code:int,

Apache Pig: unable to run my own pig.jar and pig-withouthadoop.jar

阅读更多关于 Apache Pig: unable to run my own pig.jar and pig-withouthadoop.jar

问题 I have a cluster running Hadoop 0.20.2 and Pig 0.10. I'm interested to add some logs to Pig's source code and to run my own Pig version on the cluster. What I did: built the project with 'ant' command got pig.jar and pig-withouthadoop.jar copied the jars to Pig home directory on the cluster's namenode run a job Then I've got following std output: 2013-03-25 06:35:05,226 [main] WARN org.apache.pig.backend.hadoop20.PigJobControl - falling back to default JobControl (not using hadoop 0.20 ?)

Pig in grunt mode

阅读更多关于 Pig in grunt mode

问题 I have installed cygwin, hadoop and pig in windows. The configuration seems ok, as I can run pig scripts in batch and embedded mode. When I try to run pig in grunt mode, something strange happens. Let me explain. I try to run a simple command like grunt> A = load 'passwd' using PigStorage(':'); When I press Enter, nothing happens. The cursor goes to the next line and the grunt> prompt does not appear at all anymore. It seems as I am typing in a text editor. Has anything similar ever happened

properly loading datetime in pig

阅读更多关于 properly loading datetime in pig

问题 I'm loading a tsv file with a datetime column and long column with: A = LOAD 'tweets-clean.txt' USING PigStorage('\t') AS (date:datetime, userid:long); DUMP A; An example line of input: Tue Feb 11 05:02:10 +0000 2014 205291417 that line of output: , 205291417 How do I do this properly? 回答1: You'd want to load date as a chararray (date:chararray) and then can convert it to to a datetime using FOREACH GENERATE along with the ToDate Pig built-in function. The format string is based on the