How to connect to Presto JDBC in PySpark?

Deadly 提交于 2019-12-24 08:12:38

问题


I want to connect to Presto server using JDBC in PySpark. I followed a tutorial which is written in Java. I am trying to do the same in my Python3 code but getting an error:

: java.sql.SQLException: No suitable driver

I have tried to execute the following code:

jdbcDF = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:presto://my_machine_ip:8080/hive/default") \
    .option("user", "airflow") \
    .option("dbtable", "may30_1") \
    .load()

It should be noted that I am using Spark on EMR and so, spark is already provided to me.

The code from the aforementioned tutorial is:

final String JDBC_DRIVER = "com.facebook.presto.jdbc.PrestoDriver";
    final String DB_URL = "jdbc:presto://localhost:9000/catalogName/schemaName";
    //  Database credentials
    final String USER = "username";
    final String PASS = "password";
    Connection conn = null;
    Statement stmt = null;
    try {
      //Register JDBC driver
      Class.forName(JDBC_DRIVER);

Kindly notice the JDBC_DRIVER in the above code, I have not been able to deduce the corresponding assignment in Python3 i.e. in PySpark.

Nor have I added any dependency in any configuration whatsoever.

I expect to successfully connect to my presto. Right now, I am getting the following error, complete stacktrace:

Traceback (most recent call last):
  File "<stdin>", line 5, in <module>
  File "/usr/lib/spark/python/pyspark/sql/readwriter.py", line 172, in load
    return self._df(self._jreader.load())
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o90.load.
: java.sql.SQLException: No suitable driver
    at java.sql.DriverManager.getDriver(DriverManager.java:315)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$6.apply(JDBCOptions.scala:105)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$6.apply(JDBCOptions.scala:105)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:104)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:35)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
    at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)

回答1:


You need to perform the following steps for connecting presto to pyspark.

  1. Download Presto JDBC driver jar from official website and put it into spark master node.

  2. Start pyspark shell with following params:

bin/pyspark --driver-class-path com.facebook.presto.jdbc.PrestoDriver --jars /path/to/presto/jdbc/driver/jar/file
  1. Try to connect to presto using spark
jdbcDF = spark.read \
    .format("jdbc") \
    .option("driver", "com.facebook.presto.jdbc.PrestoDriver") \  <-- presto driver class
    .option("url", "jdbc:presto://<machine_ip>:8080/hive/default") \
    .option("user", "airflow") \
    .option("dbtable", "may30_1") \
    .load()


来源:https://stackoverflow.com/questions/56804113/how-to-connect-to-presto-jdbc-in-pyspark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!