spark-dataframe

Spark and Cassandra Java Application Exception Provider org.apache.hadoop.fs.s3.S3FileSystem not found

拥有回忆 提交于 2019-12-06 15:59:52
问题 I want to load cassandra table to a datafram in spark, I have followed the sample programes below (found in this answer), but I am getting an execption mentioned below, I have tried to load the table to RDD first then convert it to Datafrme, loading the RDD is successful, but when I try to convert it to a dataframe I am getting the same execption faced in the first methdology, any suggestions ? I am using Spark 2.0.0, Cassandra 3.7, and Java 8. public class SparkCassandraDatasetApplication {

Change the Datatype of columns in PySpark dataframe

a 夏天 提交于 2019-12-06 15:53:38
I have an input dataframe( ip_df ), data in this dataframe looks like as below: id col_value 1 10 2 11 3 12 Data type of id and col_value is String I need to get another dataframe( output_df ), having datatype of id as string and col_value column as decimal**(15,4)**. THere is no data transformation, just data type conversion. Can i use it using PySpark. Any help will be appreciated Try using the cast method: from pyspark.sql.types import DecimalType <your code> output_df = ip_df.withColumn("col_value",ip_df["col_value"].cast(DecimalType())) neeraj bhadani try below statement. output_df = ip

Spark Structured streaming- Using different Windows for different GroupBy Keys

邮差的信 提交于 2019-12-06 12:39:10
Currently i have following table after reading from a Kafka topic via spark structured streaming key,timestamp,value ----------------------------------- key1,2017-11-14 07:50:00+0000,10 key1,2017-11-14 07:50:10+0000,10 key1,2017-11-14 07:51:00+0000,10 key1,2017-11-14 07:51:10+0000,10 key1,2017-11-14 07:52:00+0000,10 key1,2017-11-14 07:52:10+0000,10 key2,2017-11-14 07:50:00+0000,10 key2,2017-11-14 07:51:00+0000,10 key2,2017-11-14 07:52:10+0000,10 key2,2017-11-14 07:53:00+0000,10 I would like to use different windows for each of the keys and perform aggregation for example key1 would be

Performance of UDAF versus Aggregator in Spark

浪子不回头ぞ 提交于 2019-12-06 12:09:29
I am trying to write some performance-mindful code in Spark and wondering whether I should write an Aggregator or a User-defined Aggregate Function (UDAF) for my rollup operations on a Dataframe. I have not been able to find any data anywhere on how fast each of these methods are and which you should be using for spark 2.0+. 来源: https://stackoverflow.com/questions/45356452/performance-of-udaf-versus-aggregator-in-spark

How to do custom partition in spark dataframe with saveAsTextFile

▼魔方 西西 提交于 2019-12-06 11:35:51
I have created data in Spark and then performed a join operation, finally I have to save the output to partitioned files. I am converting data frame into RDD and then saving as text file that allows me to use multi-char delimiter. My question is to how use dataframe columns as custom partition in this case. I can not use below option for custom partition because it does not support multi-char delimiter: dfMainOutput.write.partitionBy("DataPartiotion","StatementTypeCode") .format("csv") .option("delimiter", "^") .option("nullValue", "") .option("codec", "gzip") .save("s3://trfsdisu/SPARK

spark: What is the difference between Aggregator and UDAF?

∥☆過路亽.° 提交于 2019-12-06 11:27:34
问题 In Spark's documentation, Aggregator: abstract class Aggregator[-IN, BUF, OUT] extends Serializable A base class for user-defined aggregations, which can be used in Dataset operations to take all of the elements of a group and reduce them to a single value. UserDefinedAggregateFunction is: abstract class UserDefinedAggregateFunction extends Serializable The base class for implementing user-defined aggregate functions (UDAF). According to Dataset Aggregator - Databricks, “an Aggregator is

Spark Java: How to move data from HTTP source to Couchbase sink?

做~自己de王妃 提交于 2019-12-06 11:10:44
问题 I've a .gz file available on a Web server that I want to consume in a streaming manner and insert the data into Couchbase. The .gz file has only one file in it, which in turn contains one JSON object per line. Since Spark doesn't have a HTTP receiver, I wrote one myself (shown below). I'm using Couchbase Spark connector to do the insertion. However, when running, the job is not actually inserting anything. I've a suspicion that it is due to my inexperience with Spark and not knowing how to

Spark: executor memory exceeds physical limit

随声附和 提交于 2019-12-06 09:43:23
问题 My input dataset is about 150G. I am setting --conf spark.cores.max=100 --conf spark.executor.instances=20 --conf spark.executor.memory=8G --conf spark.executor.cores=5 --conf spark.driver.memory=4G but since data is not evenly distributed across executors, I kept getting Container killed by YARN for exceeding memory limits. 9.0 GB of 9 GB physical memory used here are my questions: 1. Did I not set up enough memory in the first place? I think 20 * 8G > 150G, but it's hard to make perfect

'RDD' object has no attribute '_jdf' pyspark RDD

喜欢而已 提交于 2019-12-06 09:14:24
I'm new in pyspark. I would like to perform some machine Learning on a text file. from pyspark import Row from pyspark.context import SparkContext from pyspark.sql.session import SparkSession from pyspark import SparkConf sc = SparkContext spark = SparkSession.builder.appName("ML").getOrCreate() train_data = spark.read.text("20ng-train-all-terms.txt") td= train_data.rdd #transformer df to rdd tr_data= td.map(lambda line: line.split()).map(lambda words: Row(label=words[0],words=words[1:])) from pyspark.ml.feature import CountVectorizer vectorizer = CountVectorizer(inputCol ="words", outputCol=

How to generate a DataFrame with random content and N rows?

时光总嘲笑我的痴心妄想 提交于 2019-12-06 08:54:24
How can I create a Spark DataFrame in Scala with 100 rows and 3 columns that have random integer values in range (1, 100)? I know how to create a DataFrame manually, but I cannot automate it: val df = sc.parallelize(Seq((1,20, 40), (60, 10, 80), (30, 15, 30))).toDF("col1", "col2", "col3") Here you go, Seq.fill is your friend: def randomInt1to100 = scala.util.Random.nextInt(100)+1 val df = sc.parallelize( Seq.fill(100){(randomInt1to100,randomInt1to100,randomInt1to100)} ).toDF("col1", "col2", "col3") You can simply use scala.util.Random to generate the random numbers within range and loop for