问题
I am trying to execute a pig script with around 30 million data and I am getting the below heap space error:
> ERROR 2998: Unhandled internal error. Java heap space
>
> java.lang.OutOfMemoryError: Java heap space
> at java.util.Arrays.copyOf(Arrays.java:2367)
> at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
> at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)
> at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:415)
> at java.lang.StringBuilder.append(StringBuilder.java:132)
> at org.apache.pig.newplan.logical.optimizer.LogicalPlanPrinter.shiftStringByTabs(LogicalPlanPrinter.java:223)
> at org.apache.pig.newplan.logical.optimizer.LogicalPlanPrinter.depthFirst(LogicalPlanPrinter.java:108)
> at org.apache.pig.newplan.logical.optimizer.LogicalPlanPrinter.depthFirst(LogicalPlanPrinter.java:102)
> at org.apache.pig.newplan.logical.optimizer.LogicalPlanPrinter.depthFirst(LogicalPlanPrinter.java:102)
> at org.apache.pig.newplan.logical.optimizer.LogicalPlanPrinter.depthFirst(LogicalPlanPrinter.java:102)
> at org.apache.pig.newplan.logical.optimizer.LogicalPlanPrinter.depthFirst(LogicalPlanPrinter.java:102)
> at org.apache.pig.newplan.logical.optimizer.LogicalPlanPrinter.depthFirst(LogicalPlanPrinter.java:102)
> at org.apache.pig.newplan.logical.optimizer.LogicalPlanPrinter.depthFirst(LogicalPlanPrinter.java:102)
> at org.apache.pig.newplan.logical.optimizer.LogicalPlanPrinter.depthFirst(LogicalPlanPrinter.java:102)
> at org.apache.pig.newplan.logical.optimizer.LogicalPlanPrinter.depthFirst(LogicalPlanPrinter.java:102)
> at org.apache.pig.newplan.logical.optimizer.LogicalPlanPrinter.depthFirstLP(LogicalPlanPrinter.java:83)
> at org.apache.pig.newplan.logical.optimizer.LogicalPlanPrinter.visit(LogicalPlanPrinter.java:69)
> at org.apache.pig.newplan.logical.relational.LogicalPlan.getLogicalPlanString(LogicalPlan.java:148)
> at org.apache.pig.newplan.logical.relational.LogicalPlan.getSignature(LogicalPlan.java:133)
> at org.apache.pig.PigServer.execute(PigServer.java:1295)
> at org.apache.pig.PigServer.executeBatch(PigServer.java:375)
> at org.apache.pig.PigServer.executeBatch(PigServer.java:353)
> at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:140)
> at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:202)
> at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:173)
> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
> at org.apache.pig.Main.run(Main.java:607)
> at org.apache.pig.Main.main(Main.java:156)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> ================================================================================
I ran the same code with 10 million data and it ran fine.
So what are the possible ways I can avoid the above issue?
Does compression helps in avoiding the heap space issue?
I have tried to split the code into multiple fragments and still I am getting the error.So even though if we increase the heap memeory alloaction does it gurantee it will hold true if we execute the same with volume of data?
回答1:
You can increase numbers of mappers by setting mapred.map.tasks to any number you want. and than run your script.
回答2:
One reason for that could be you have a giantic line in your data that doesn't fit into memory.
So try to check. You can run this bash command in a node of your cluster:
hdfs dfs -cat '/some/path/to/hdfs.file' | awk '{if (length($0) > SOME_VALUE_REASONABLE_VALUE) print $0}' > large_lines
If your data is not in a single file you can use *
, for example /some/path/to/hdfs.dir/part*
.
Then you should check if there are some huge lines:
less large_lines
回答3:
I guess you are appending some data into a globalStringBuilder variable, either to get a summary of the results or to get some logs.
If you need some output, just log/get a summary of the first x lines.
来源:https://stackoverflow.com/questions/31065622/heap-space-issue-while-running-a-pig-script