问题
How does one execute spark sql queries from routines that are not the driver portion of the program?
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
def doWork(rec):
data = SQLContext.sql("select * from zip_data where STATEFP ='{sfp}' and COUNTYFP = '{cfp}' ".format(sfp=rec[0], cfp=rec[1]))
for item in data.collect():
print(item)
# do something
return (rec[0], rec[1])
if __name__ == "__main__":
sc = SparkContext(appName="Some app")
print("Starting some app")
SQLContext = SQLContext(sc)
parquetFile = SQLContext.read.parquet("/path/to/data/")
parquetFile.registerTempTable("zip_data")
df = SQLContext.sql("select distinct STATEFP,COUNTYFP from zip_data where STATEFP IN ('12') ")
rslts = df.map(doWork)
for rslt in rslts.collect():
print(rslt)
In this example I'm attempting to query the same table but would like to query other tables registered in Spark SQL too.
回答1:
One does not execute nested operations on distributed data structure.It is simply not supported in Spark. You have to use joins
, local (optionally broadcasted) data structures or access external data directly instead.
回答2:
In case when you can't accomplish your task with the joins
and want to run the SQL queries in memory:
You can consider using some in-memory database like H2, Apache Derby
and Redis
etc. to execute parallel faster SQL queries without loosing benefits of in-memory computation.
In-memory databases will provide faster access as compared to MySQL, PostgreSQL
etc. databases.
来源:https://stackoverflow.com/questions/35234465/how-to-execute-a-spark-sql-query-from-a-map-function-python