databricks error to copy and read file from to dbfs that is > 2gb

问题

I have a csv of size 6GB. So far I was using the following line which when I check its size on dbfs after this copy using java io, it still shows as 6GB so I assume it was right. But when I do a spark.read.csv(samplePath) it reads only 18mn rows instead of 66mn.

Files.copy(Paths.get(_outputFile), Paths.get("/dbfs" + _outputFile))

So I tried dbutils to copy as shown below but it gives error. I have updated maven dbutil dependency and imported the same in this object where I am calling this line. Is there any other place too where I should make any change to use dbutils in scala code to run on databricks?

dbutils.fs.cp("file:" + _outputFile, _outputFile)

Databricks automatically assumes that when you do spark.read.csv(path) then it searches this path on dbfs by default. How to make sure it can read this path from driver memory instead of dbfs? Because I feel the file copy is not actually copying all rows due to 2GB size limit while using java io with databricks.

Can I use this:

spark.read.csv("file:/databricks/driver/sampleData.csv")

Any suggestions around this?

Thanks.

回答1:

Note: Local file I/O APIs only support files less than 2GB in size. If you use local file I/O APIs to read or write files larger than 2GB you might see corrupted files. Instead, access files larger than 2GB using the DBFS CLI, dbutils.fs, or Spark APIs.

When you’re using Spark APIs, you reference files with "/mnt/training/file.csv" or "dbfs:/mnt/training/file.csv". If you’re using local file APIs, you must provide the path under /dbfs, for example: "/dbfs/mnt/training/file.csv". You cannot use a path under dbfs when using Spark APIs.

There are multiple way to solve this issue.

Option1: Access DBFS using local file APIs.

You can use local file APIs to read and write to DBFS paths. Azure Databricks configures each cluster node with a FUSE mount that allows processes running on cluster nodes to read and write to the underlying distributed storage layer with local file APIs. For example:

Python:

#write a file to DBFS using python i/o apis
with open("/dbfs/tmp/test_dbfs.txt", 'w') as f:
  f.write("Apache Spark is awesome!\n")
  f.write("End of example!")

# read the file
with open("/dbfs/tmp/test_dbfs.txt", "r") as f_read:
  for line in f_read:
    print line

Scala:

import scala.io.Source

val filename = "/dbfs/tmp/test_dbfs.txt"
for (line <- Source.fromFile(filename).getLines()) {
  println(line)
}

Option2: Reading Large DBFS-Mounted files using Python APIs.

Move the file from dbfs:// to local file system (file://). Then read using the Python API. For example:

Copy the file from dbfs:// to file://:

%fs cp dbfs:/mnt/large_file.csv file:/tmp/large_file.csv

Read the file in the pandas API:

import pandas as pd

pd.read_csv('file:/tmp/large_file.csv',).head()

Hope this helps.

来源：https://stackoverflow.com/questions/57116963/databricks-error-to-copy-and-read-file-from-to-dbfs-that-is-2gb

标签

csv

apache-spark

databricks