apache-spark

How to list file keys in Databricks dbfs **without** dbutils

…衆ロ難τιáo~ 提交于 2021-01-07 01:21:08
问题 Apparently dbutils cannot be used in cmd-line spark-submits, you must use Jar Jobs for that, but I MUST use spark-submit style jobs due to other requirements, yet still have a need to list and iterate over file keys in dbfs to make some decisions about which files to use as input to a process... Using scala, what lib in spark or hadoop can I use to retrieve a list of dbfs:/filekeys of a particular pattern? import org.apache.hadoop.fs.Path import org.apache.spark.sql.SparkSession def ls

How to run this code on Spark Cluster mode

我的梦境 提交于 2021-01-06 08:01:28
问题 I want to run my code on a Cluster: my code: import java.util.Properties import edu.stanford.nlp.ling.CoreAnnotations._ import edu.stanford.nlp.pipeline._ import org.apache.spark.{SparkConf, SparkContext} import scala.collection.JavaConversions._ import scala.collection.mutable.ArrayBuffer object Pre2 { def plainTextToLemmas(text: String, pipeline: StanfordCoreNLP): Seq[String] = { val doc = new Annotation(text) pipeline.annotate(doc) val lemmas = new ArrayBuffer[String]() val sentences = doc

How to run this code on Spark Cluster mode

落花浮王杯 提交于 2021-01-06 07:51:07
问题 I want to run my code on a Cluster: my code: import java.util.Properties import edu.stanford.nlp.ling.CoreAnnotations._ import edu.stanford.nlp.pipeline._ import org.apache.spark.{SparkConf, SparkContext} import scala.collection.JavaConversions._ import scala.collection.mutable.ArrayBuffer object Pre2 { def plainTextToLemmas(text: String, pipeline: StanfordCoreNLP): Seq[String] = { val doc = new Annotation(text) pipeline.annotate(doc) val lemmas = new ArrayBuffer[String]() val sentences = doc

How to run this code on Spark Cluster mode

筅森魡賤 提交于 2021-01-06 07:50:31
问题 I want to run my code on a Cluster: my code: import java.util.Properties import edu.stanford.nlp.ling.CoreAnnotations._ import edu.stanford.nlp.pipeline._ import org.apache.spark.{SparkConf, SparkContext} import scala.collection.JavaConversions._ import scala.collection.mutable.ArrayBuffer object Pre2 { def plainTextToLemmas(text: String, pipeline: StanfordCoreNLP): Seq[String] = { val doc = new Annotation(text) pipeline.annotate(doc) val lemmas = new ArrayBuffer[String]() val sentences = doc

scala api for delta lake optimize command

我只是一个虾纸丫 提交于 2021-01-06 07:38:54
问题 The databricks docs say that you can change zordering of a delta table by doing: spark.read.table(connRandom) .write.format("delta").saveAsTable(connZorder) sql(s"OPTIMIZE $connZorder ZORDER BY (src_ip, src_port, dst_ip, dst_port)") The problem with this is the switching between the scala and SQL api which is gross. What I want to be able to do is: spark.read.table(connRandom) .write.format("delta").saveAsTable(connZorder) .optimize.zorderBy("src_ip", "src_port", "dst_ip", "dst_port") but I

scala api for delta lake optimize command

爷,独闯天下 提交于 2021-01-06 07:34:44
问题 The databricks docs say that you can change zordering of a delta table by doing: spark.read.table(connRandom) .write.format("delta").saveAsTable(connZorder) sql(s"OPTIMIZE $connZorder ZORDER BY (src_ip, src_port, dst_ip, dst_port)") The problem with this is the switching between the scala and SQL api which is gross. What I want to be able to do is: spark.read.table(connRandom) .write.format("delta").saveAsTable(connZorder) .optimize.zorderBy("src_ip", "src_port", "dst_ip", "dst_port") but I

scala api for delta lake optimize command

谁说我不能喝 提交于 2021-01-06 07:34:42
问题 The databricks docs say that you can change zordering of a delta table by doing: spark.read.table(connRandom) .write.format("delta").saveAsTable(connZorder) sql(s"OPTIMIZE $connZorder ZORDER BY (src_ip, src_port, dst_ip, dst_port)") The problem with this is the switching between the scala and SQL api which is gross. What I want to be able to do is: spark.read.table(connRandom) .write.format("delta").saveAsTable(connZorder) .optimize.zorderBy("src_ip", "src_port", "dst_ip", "dst_port") but I

scala api for delta lake optimize command

人盡茶涼 提交于 2021-01-06 07:34:23
问题 The databricks docs say that you can change zordering of a delta table by doing: spark.read.table(connRandom) .write.format("delta").saveAsTable(connZorder) sql(s"OPTIMIZE $connZorder ZORDER BY (src_ip, src_port, dst_ip, dst_port)") The problem with this is the switching between the scala and SQL api which is gross. What I want to be able to do is: spark.read.table(connRandom) .write.format("delta").saveAsTable(connZorder) .optimize.zorderBy("src_ip", "src_port", "dst_ip", "dst_port") but I

How to correctly transform spark dataframe by mapInPandas

随声附和 提交于 2021-01-06 03:51:57
问题 I'm trying to transform spark dataframe with 10k rows by latest spark 3.0.1 function mapInPandas. Expected output: mapped pandas_function() transforms one row to three, so output transformed_df should have 30k rows Current output: I'm getting 3 rows with 1 core and 24 rows with 8 cores. INPUT: respond_sdf has 10k rows +-----+-------------------------------------------------------------------+ |url |content | +-----+-------------------------------------------------------------------+ |api_1|{

How to correctly transform spark dataframe by mapInPandas

倖福魔咒の 提交于 2021-01-06 03:42:32
问题 I'm trying to transform spark dataframe with 10k rows by latest spark 3.0.1 function mapInPandas. Expected output: mapped pandas_function() transforms one row to three, so output transformed_df should have 30k rows Current output: I'm getting 3 rows with 1 core and 24 rows with 8 cores. INPUT: respond_sdf has 10k rows +-----+-------------------------------------------------------------------+ |url |content | +-----+-------------------------------------------------------------------+ |api_1|{