Converting mysql table to spark dataset is very slow compared to same from csv file

后端 未结 2 1876
谎友^
谎友^ 2020-12-01 15:20

I have csv file in Amazon s3 with is 62mb in size (114 000 rows). I am converting it into spark dataset, and taking first 500 rows from it. Code is as follow;



        
2条回答
  •  -上瘾入骨i
    2020-12-01 16:06

    Please follow the steps below

    1.download a copy of the JDBC connector for mysql. I believe you already have one.

    wget http://central.maven.org/maven2/mysql/mysql-connector-java/5.1.38/mysql-connector-java-5.1.38.jar
    

    2.create a db-properties.flat file in the below format

    jdbcUrl=jdbc:mysql://${jdbcHostname}:${jdbcPort}/${jdbcDatabase}
    user=
    password=
    

    3.create a empty table first where you want to load the data.

    invoke spark shell with driver class

    spark-shell --driver-class-path  
    

    then import all the required package

    import java.io.{File, FileInputStream}
    import java.util.Properties
    import org.apache.spark.sql.SaveMode
    import org.apache.spark.sql.hive.HiveContext
    import org.apache.spark.{SparkConf, SparkContext}
    

    initiate a hive context or a sql context

    val sQLContext = new HiveContext(sc)
    import sQLContext.implicits._
    import sQLContext.sql
    

    set some of the properties

    sQLContext.setConf("hive.exec.dynamic.partition", "true")
    sQLContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
    

    Load mysql db properties from file

    val dbProperties = new Properties()
    dbProperties.load(new FileInputStream(new File("your_path_to/db-        properties.flat")))
    val jdbcurl = dbProperties.getProperty("jdbcUrl")
    

    create a query to read the data from your table and pass it to read method of #sqlcontext. this is where you can manage your where clause

    val df1 = "(SELECT  * FROM your_table_name) as s1" 
    

    pass the jdbcurl, select query and db properties to read method

    val df2 = sQLContext.read.jdbc(jdbcurl, df1, dbProperties)
    

    write it to your table

    df2.write.format("orc").partitionBy("your_partition_column_name").mode(SaveMode.Append).saveAsTable("your_target_table_name")
    

提交回复
热议问题