spark-dataframe | 易学教程

Why is pyspark picking up a variable that was not broadcast?

阅读更多关于 Why is pyspark picking up a variable that was not broadcast?

问题 I'm using pyspark to analyse a dataset and I'm a little bit surprised as to why the following code works correctly even though I'm using a variable that was not broadcast. The variable in question is video , that's used in function filter , after the join. seed = random.randint(0,999) # df is a dataframe # video is just one randomly sampled element video = df.sample(False,0.001,seed).head() # just a python list otherVideos = [ (22,0.32),(213,0.43) ] # transform the python list into an rdd

Spark convert PairRDD to data frame

阅读更多关于 Spark convert PairRDD to data frame

问题 How can I convert a pair RDD of the following type joinResult res16: org.apache.spark.api.java.JavaPairRDD[com.vividsolutions.jts.geom.Polygon,java.util.HashSet[com.vividsolutions.jts.geom.Polygon]] = org.apache.spark.api.java.JavaPairRDD@264b550 to a data frame? https://github.com/geoHeil/geoSparkScalaSample/blob/master/src/main/scala/myOrg/GeoSpark.scala#L72-L75 joinResult.toDF().show will not work as well as 来源： https://stackoverflow.com/questions/42689512/spark-convert-pairrdd-to-data

Apache Spark RDD substitution

阅读更多关于 Apache Spark RDD substitution

问题 I'm trying to solve a problem such that I've got a dataset like this: (1, 3) (1, 4) (1, 7) (1, 2) <- (2, 7) <- (6, 6) (3, 7) <- (7, 4) <- ... Since (1 -> 2) and (2 -> 7) , I would like to replace the set (2, 7) as (1, 7) similarly, (3 -> 7) and (7 -> 4) also replace (7,4) as (3, 4) Hence, my dataset becomes (1, 3) (1, 4) (1, 7) (1, 2) (1, 7) (6, 6) (3, 7) (3, 4) ... Any idea how to solve or tackle this ? Thanks 回答1: This problem looks like a transitive closure of a graph, represented in the

DataFrameReadercsv(path: String) option for skipping blank lines

阅读更多关于 DataFrameReadercsv(path: String) option for skipping blank lines

问题 Does org.apache.spark.sqlDataFrameReadercsv(path: String) have an option for skipping blank lines? In particular, a blank line as the last line? 回答1: You could try setting mode to "DROPMALFORMED" as in: val df = sqlContext.read.format("com.databricks.spark.csv").option("mode", "DROPMALFORMED")... In Python : df = sqlContext.read.format('com.databricks.spark.csv').options(mode = "DROPMALFORMED")... Which according to the documentation: "...drops lines which have fewer or more tokens than

Spark UDF how to convert Map to column

阅读更多关于 Spark UDF how to convert Map to column

问题 I am using Apache Zeppelin notebook. So spark is basically running in interactive mode. I can't use closure variable here since zeppelin throws org.apache.spark.SparkException: Task not serializable as it tries to serialize whole paragraph (bigger closure). So without closure approach only option I have is to pass map as a column to UDF. I have a following map collected from paried RDD: final val idxMap = idxMapRdd.collectAsMap Which is being used in one of spark transformation here: def

PySpark insert overwrite issue

阅读更多关于 PySpark insert overwrite issue

问题 Below are the last 2 lines of the PySpark ETL code: df_writer = DataFrameWriter(usage_fact) df_writer.partitionBy("data_date", "data_product").saveAsTable(usageWideFactTable, format=fileFormat,mode=writeMode,path=usageWideFactpath) Where, WriteMode= append and fileFormat=orc I wanted to use insert overwrite in place of this so that my data is not getting appended when I re-run the code. Hence I have used this: usage_fact.createOrReplaceTempView("usage_fact") fact = spark.sql("insert overwrite

AWS Glue Error | Not able to read Glue tables from Developer End points using spark

阅读更多关于 AWS Glue Error | Not able to read Glue tables from Developer End points using spark

问题 I am not able to access AWS Glue tables even if I given all required IAM permissions. I cant even list all the databases.Here is the code. import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job # New recommendation from AWS Support 2018-03-22 newconf = sc._conf.set("spark.sql.catalogImplementation", "in-memory") sc.stop() sc = sc.getOrCreate(newconf) #

what is the best way to get azure blob storage

阅读更多关于 what is the best way to get azure blob storage

问题 I'm working with scala and spark and need to access azure blob storage and get its list of files. What is the best way to do that knowing spark version is 2.11. 回答1: For Spark running on local, there is an official blog which introduces how to access Azure Blob Storage from Spark. The key is that you need to configure Azure Storage account as HDFS-compatible storage in core-site.xml file and add two jars hadoop-azure & azure-storage to your classpath for accessing HDFS via the protocol wasb[s

Using Apache Phoenix and Spark to save a CSV in HBase #spark2.2 #intelliJIdea

阅读更多关于 Using Apache Phoenix and Spark to save a CSV in HBase #spark2.2 #intelliJIdea

问题 I have been trying to load data from a CSV using Spark and write it to HBase. I am able to do it in Spark 1.6 easily but not in Spark 2.2. I have tried multiple approaches and finally/ultimately everything leads me to the same error with Spark 2.2: Exception in thread "main" java.lang.IllegalArgumentException: Can not create a Path from an empty string Any idea why this is happening. Sharing code snippet: def main(args : Array[String]) { val spark = SparkSession.builder .appName("PhoenixSpark

sparklyr spark_read_parquet Reading String Fields as Lists

阅读更多关于 sparklyr spark_read_parquet Reading String Fields as Lists

问题 I have a number of Hive files in parquet format that contain both string and double columns. I can read most of them into a Spark Data Frame with sparklyr using the syntax below: spark_read_parquet(sc, name = "name", path = "path", memory = FALSE) However, I have one file that I read in where all of the string values get converted to unrecognizable lists that looks like this when collected into an R Data Frame and printed: s_df <- spark_read_parquet(sc, name = "s_df", path = "hdfs:/