How to create a Schema file in Spark

谁都会走 提交于 2019-11-29 16:26:58

To create a schema from a text file create a function to match the type and return DataType as

def getType(raw: String): DataType = {
  raw match {
    case "ByteType" => ByteType
    case "ShortType" => ShortType
    case "IntegerType" => IntegerType
    case "LongType" => LongType
    case "FloatType" => FloatType
    case "DoubleType" => DoubleType
    case "BooleanType" => BooleanType
    case "TimestampType" => TimestampType
    case _ => StringType
  }
}

Now create a schema by reading a schema file as

val schema = Source.fromFile("schema.txt").getLines().toList
  .flatMap(_.split(",")).map(_.replaceAll("\"", "").split(" "))
  .map(x => StructField(x(0), getType(x(1)), true))

Now read the csv file as

spark.read
  .option("samplingRatio", "0.01")
  .option("delimiter", "|")
  .option("nullValue", "NULL")
  .schema(StructType(schema))
  .csv("data.csv")

Hope this helps!

Something like this is a little bit more robust since it uses the hive metastore:

    import org.apache.hadoop.hive.metastore.api.FieldSchema
    def sparkToHiveSchema(schema: StructType): List[FieldSchema] ={
        schema.map(field => new FieldSchema(field.name,field.dataType.catalogString,field.getComment.getOrElse(""))).toList
    }
``


Ghost9

You can specify schema like this:

import org.apache.spark.sql.types.{StructType, StructField, StringType,IntegerType}; 

For example:

val schema = new StructType(
Array(
   StructField("Age",IntegerType,true),
  StructField("Name",StringType,true),
  )
)

val data = spark.read.option("header", "false").schema(schema).csv("filename.csv")
data.show()

This would directly create it in a dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!