Transfer data from database to Spark using sparklyr

末鹿安然 提交于 2019-11-29 02:39:30

Sparklyr >= 0.6.0

You can use spark_read_jdbc.

Sparklyr < 0.6.0

I hope there is a more elegant solution out there but here is a minimal example using low level API:

  • Make sure that Spark has access to the required JDBC driver, for example by adding its coordinates to spark.jars.packages. For example with PostgreSQL (adjust for current version) you could add:

    spark.jars.packages org.postgresql:postgresql:9.4.1212
    

    to SPARK_HOME/conf/spark-defaults.conf

  • Load data and register as temporary view:

    name <- "foo"
    
    spark_session(sc) %>% 
      invoke("read") %>% 
      # JDBC URL and table name
      invoke("option", "url", "jdbc:postgresql://host/database") %>% 
      invoke("option", "dbtable", "table") %>% 
      # Add optional credentials
      invoke("option", "user", "scott") %>%
      invoke("option", "password", "tiger") %>% 
      # Driver class, here for PostgreSQL
      invoke("option", "driver", "org.postgresql.Driver") %>% 
      # Read and register as a temporary view
      invoke("format", "jdbc") %>% 
      invoke("load") %>% 
      # Spark 2.x, registerTempTable in 1.x
      invoke("createOrReplaceTempView", name)
    

    You can pass multiple options at once using an environment:

    invoke("options", as.environment(list(
      user="scott", password="tiger", url="jdbc:..."
    )))
    
  • Load temporary view with dplyr:

    dplyr::tbl(sc, name)
    
  • Be sure to read about further JDBC options, with focus on partitionColumn, *Bound and numPartitions.

  • For additional details see for example How to use JDBC source to write and read data in (Py)Spark? and How to improve performance for slow Spark jobs using DataFrame and JDBC connection?

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!