rdd

A bad issue with kafka and Spark Streaming on Python

拟墨画扇 提交于 2021-01-07 02:45:47
问题 N.B. This is NOT the same issue that I had in my first post on this site, however it is the same project. I'm ingesting some files into PostgreSQL from kafka using spark streaming. These are my steps for the project: 1- Creating a script for the kafka producer (done, it works fine) 2- Creating a python script that reads files from kafka producer 3- Sending files to PostgreSQL For the connection between python and postgreSQL I use psycopg2. I am also using python 3 and java jdk1.8.0_261 and

A bad issue with kafka and Spark Streaming on Python

别等时光非礼了梦想. 提交于 2021-01-07 02:42:46
问题 N.B. This is NOT the same issue that I had in my first post on this site, however it is the same project. I'm ingesting some files into PostgreSQL from kafka using spark streaming. These are my steps for the project: 1- Creating a script for the kafka producer (done, it works fine) 2- Creating a python script that reads files from kafka producer 3- Sending files to PostgreSQL For the connection between python and postgreSQL I use psycopg2. I am also using python 3 and java jdk1.8.0_261 and

A bad issue with kafka and Spark Streaming on Python

戏子无情 提交于 2021-01-07 02:42:17
问题 N.B. This is NOT the same issue that I had in my first post on this site, however it is the same project. I'm ingesting some files into PostgreSQL from kafka using spark streaming. These are my steps for the project: 1- Creating a script for the kafka producer (done, it works fine) 2- Creating a python script that reads files from kafka producer 3- Sending files to PostgreSQL For the connection between python and postgreSQL I use psycopg2. I am also using python 3 and java jdk1.8.0_261 and

“Task not serializable” with java time in Spark-shell (or zeppelin) but not in spark-submit

 ̄綄美尐妖づ 提交于 2020-12-15 08:59:16
问题 Weirdly, I found several times there's difference when running with spark-submit vs running with spark-shell (or zeppelin), though I don't believe it. With some codes, spark-shell (or zeppelin) can throw this exception, while spark-submit just works fine: org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:345) at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner

in spark streaming must i call count() after cache() or persist() to force caching/persistence to really happen?

送分小仙女□ 提交于 2020-12-06 06:36:35
问题 watching this very good video on spark internals the presenter says that unless one performs an action on ones RDD after caching it caching will not really happen. I never see count() being called in any other circumstances. So, I'm guessing that he is only calling count() after cache() to force persistence in the simple example he is giving. It is not necessary to do this every time one calls cache() or persist() in one's code. Is this right ? 回答1: unless one performs an action on ones RDD

in spark streaming must i call count() after cache() or persist() to force caching/persistence to really happen?

徘徊边缘 提交于 2020-12-06 06:36:08
问题 watching this very good video on spark internals the presenter says that unless one performs an action on ones RDD after caching it caching will not really happen. I never see count() being called in any other circumstances. So, I'm guessing that he is only calling count() after cache() to force persistence in the simple example he is giving. It is not necessary to do this every time one calls cache() or persist() in one's code. Is this right ? 回答1: unless one performs an action on ones RDD

How to display a KeyValueGroupedDataset in Spark?

痞子三分冷 提交于 2020-11-30 06:46:30
问题 I am trying to learn datasets in Spark. One thing I can't figure out is how to display a KeyValueGroupedDataset , as show doesn't work for it. Also, what is the equivalent of a map for KeyValuGroupedDataSet ? I will appreciate if someone give some examples. 回答1: OK, I got the idea from examples given here and here. I am giving below a simple example that I've written. val x = Seq(("a", 36), ("b", 33), ("c", 40), ("a", 38), ("c", 39)).toDS x: org.apache.spark.sql.Dataset[(String, Int)] = [_1: