orc

Converting CSV to ORC with Spark

僤鯓⒐⒋嵵緔 提交于 2019-11-29 23:37:06
问题 I've seen this blog post by Hortonworks for support for ORC in Spark 1.2 through datasources. It covers version 1.2 and it addresses the issue or creation of the ORC file from the objects, not conversion from csv to ORC. I have also seen ways, as intended, to do these conversions in Hive. Could someone please provide a simple example for how to load plain csv file from Spark 1.6+, save it as ORC and then load it as a data frame in Spark. 回答1: I'm going to ommit the CSV reading part because

Aggregating multiple columns with custom function in Spark

亡梦爱人 提交于 2019-11-29 20:10:26
I was wondering if there is some way to specify a custom aggregation function for spark dataframes over multiple columns. I have a table like this of the type (name, item, price): john | tomato | 1.99 john | carrot | 0.45 bill | apple | 0.99 john | banana | 1.29 bill | taco | 2.59 to: I would like to aggregate the item and it's cost for each person into a list like this: john | (tomato, 1.99), (carrot, 0.45), (banana, 1.29) bill | (apple, 0.99), (taco, 2.59) Is this possible in dataframes? I recently learned about collect_list but it appears to only work for one column. The easiest way to do

How to create a Schema file in Spark

谁都会走 提交于 2019-11-29 16:26:58
I am trying to read a Schema file (which is a text file) and apply it to my CSV file without a header. Since I already have a schema file I don't want to use InferSchema option which is an overhead. My input schema file looks like below, "num IntegerType","letter StringType" I am trying the below code to create a schema file, val schema_file = spark.read.textFile("D:\\Users\\Documents\\schemaFile.txt") val struct_type = schema_file.flatMap(x => x.split(",")).map(b => (b.split(" ")(0).stripPrefix("\"").asInstanceOf[String],b.split(" ")(1).stripSuffix("\"").asInstanceOf[org.apache.spark.sql

How do I Combine or Merge Small ORC files into Larger ORC file?

可紊 提交于 2019-11-26 16:56:55
问题 Most questions/answers on SO and the web discuss using Hive to combine a bunch of small ORC files into a larger one, however, my ORC files are log files which are separated by day and I need to keep them separate. I only want to "roll-up" the ORC files per day (which are directories in HDFS). I need to write the solution in Java most likely and have come across OrcFileMergeOperator which may be what I need to use, but still too early to tell. What is the best approach to solving this issue?