I have a large Excel(xlsx and xls) file with multiple sheet and I need convert it to RDD or Dataframe so that it can be joined to othe
The solution to your problem is to use Spark Excel dependency in your project.
Spark Excel has flexible options to play with.
I have tested the following code to read from excel and convert it to dataframe and it just works perfect
def readExcel(file: String): DataFrame = sqlContext.read
.format("com.crealytics.spark.excel")
.option("location", file)
.option("useHeader", "true")
.option("treatEmptyValuesAsNulls", "true")
.option("inferSchema", "true")
.option("addColorColumns", "False")
.load()
val data = readExcel("path to your excel file")
data.show(false)
you can give sheetname as option if your excel sheet has multiple sheets
.option("sheetName", "Sheet2")
I hope its helpful