How to efficiently query a hive table in spark using hive context?

问题

I have a 1.6T Hive table with time series data. I am using Hive 1.2.1 and Spark 1.6.1 in scala.

Following is the query which I have in my code. But I always get Java out of memory error.

val sid_data_df = hiveContext.sql(s"SELECT time, total_field, sid, year, date FROM tablename WHERE sid = '$stationId' ORDER BY time LIMIT 4320000  ")

By iteratively selecting few records at a time from hive table, I am trying to do a sliding window on the resultant dataframe

I have a cluster of 4 nodes with 122 GB of memory, 44 vCores. I am using 425 GB memory out of 488 GB available. I am giving the spark-submit with the following parameters

--num-executors 16 --driver-memory 4g --executor-memory 22G --executor-cores 10 \
--conf "spark.sql.shuffle.partitions=1800" \
--conf "spark.shuffle.memory.fraction=0.6" \
--conf "spark.storage.memoryFraction=0.4" \
--conf "spark.yarn.executor.memoryOverhead=2600" \
--conf "spark.yarn.nodemanager.resource.memory-mb=123880" \
--conf "spark.yarn.nodemanager.resource.cpu-vcores=43"

kindly give me suggestions on how to optimize this and successfully fetch data from hive table.

Thanks

回答1:

The problem is likely here:

LIMIT 4320000

You should avoid using LIMIT to subset large number of records. In Spark, LIMIT moves all rows to a single partition and is likely to cause serious performance and stability issues.

See for example How to optimize below spark code (scala)?

I am trying to do a sliding window on this resultant dataframeiteratively by selecting few records at a time.

This doesn't sound right. Sliding window operations can be usually achieved with some combination of window function, and timestamp-based window buckets.

来源：https://stackoverflow.com/questions/45886205/how-to-efficiently-query-a-hive-table-in-spark-using-hive-context

标签

scala

Hadoop

apache-spark

Hive

hiveql