apache-spark

java.lang.ClassNotFoundException: com.johnsnowlabs.nlp.DocumentAssembler spark in Pycharm with conda env

◇◆丶佛笑我妖孽 提交于 2021-02-11 12:28:35
问题 I saved a pre-trained model from spark-nlp, then I'm trying to run a Python script in Pycharm with anaconda env: Model_path = "./xxx" model = PipelineModel.load(Model_path) But I got the following error: (I tried with pyspark 2.4.4 & spark-nlp2.4.4, and pyspark 2.4.4 & spark-nlp2.5.4) Got the same error: 21/02/05 13:31:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache

How to run spark-submit in virtualenv for pyspark?

喜欢而已 提交于 2021-02-11 12:21:57
问题 Is there a way to run spark-submit (spark v2.3.2 from HDP 3.1.0) while in a virtualenv? Have situation where have python file that uses python3 (and some specific libs) in a virtualenv (to isolate lib versions from rest of system). I would like to run this file with /bin/spark-submit , but attempting to do so I get... [me@airflowetl tests]$ source ../venv/bin/activate; /bin/spark-submit sparksubmit.test.py File "/bin/hdp-select", line 255 print "ERROR: Invalid package - " + name ^ SyntaxError

How to run spark-submit in virtualenv for pyspark?

允我心安 提交于 2021-02-11 12:21:33
问题 Is there a way to run spark-submit (spark v2.3.2 from HDP 3.1.0) while in a virtualenv? Have situation where have python file that uses python3 (and some specific libs) in a virtualenv (to isolate lib versions from rest of system). I would like to run this file with /bin/spark-submit , but attempting to do so I get... [me@airflowetl tests]$ source ../venv/bin/activate; /bin/spark-submit sparksubmit.test.py File "/bin/hdp-select", line 255 print "ERROR: Invalid package - " + name ^ SyntaxError

SQL or Pyspark - Get the last time a column had a different value for each ID

爱⌒轻易说出口 提交于 2021-02-11 12:14:05
问题 I am using pyspark so I have tried both pyspark code and SQL. I am trying to get the time that the ADDRESS column was a different value, grouped by USER_ID. The rows are ordered by TIME. Take the below table: +---+-------+-------+----+ | ID|USER_ID|ADDRESS|TIME| +---+-------+-------+----+ | 1| 1| A| 10| | 2| 1| B| 15| | 3| 1| A| 20| | 4| 1| A| 40| | 5| 1| A| 45| +---+-------+-------+----+ The correct new column I would like is as below: +---+-------+-------+----+---------+ | ID|USER_ID

How to load data in weka Instances from a spark dataframe

那年仲夏 提交于 2021-02-11 08:27:20
问题 I have a spark DataFrame. Now I want to do some processing using Weka. Therefore, I want to load data into Weka Instances from the DataFrame and finally return the data as a DataFrame. As the structure both the data type is different, I wondering can anybody help me with the conversion. The code snippet may look like below. val df: DataFrame = data val data: Instances = process(df) 来源: https://stackoverflow.com/questions/58160584/how-to-load-data-in-weka-instances-from-a-spark-dataframe

Spark Streaming: How Spark and Kafka communication happens?

萝らか妹 提交于 2021-02-11 07:46:14
问题 I would like to understand how the communication between the Kafka and Spark(Streaming) nodes takes place. I have the following questions. If Kafka servers and Spark nodes are in two separate clusters how would be communications takes place. What are the steps need to configure them. If both are in same clusters but are in different nodes, how will be communication happens. communication i mean here is whether it is a RPC or Socket communication. I would like to understand the internal

Spark Streaming: How Spark and Kafka communication happens?

痴心易碎 提交于 2021-02-11 07:46:09
问题 I would like to understand how the communication between the Kafka and Spark(Streaming) nodes takes place. I have the following questions. If Kafka servers and Spark nodes are in two separate clusters how would be communications takes place. What are the steps need to configure them. If both are in same clusters but are in different nodes, how will be communication happens. communication i mean here is whether it is a RPC or Socket communication. I would like to understand the internal

Spark-shell : The number of columns doesn't match

岁酱吖の 提交于 2021-02-11 07:44:22
问题 I have csv format file and is separated by delimiter pipe "|". And the dataset has 2 column, like below . Column1|Column2 1|Name_a 2|Name_b But sometimes we receive only one column value and other is missing like below Column1|Column2 1|Name_a 2|Name_b 3 4 5|Name_c 6 7|Name_f So any row having mismatched column no is a garbage value for us for the above example it will be rows having column value as 3, 4, and 6 and we want to discard these rows. Is there any direct way I can discard those

Spark-shell : The number of columns doesn't match

痴心易碎 提交于 2021-02-11 07:44:11
问题 I have csv format file and is separated by delimiter pipe "|". And the dataset has 2 column, like below . Column1|Column2 1|Name_a 2|Name_b But sometimes we receive only one column value and other is missing like below Column1|Column2 1|Name_a 2|Name_b 3 4 5|Name_c 6 7|Name_f So any row having mismatched column no is a garbage value for us for the above example it will be rows having column value as 3, 4, and 6 and we want to discard these rows. Is there any direct way I can discard those

Detecting repeating consecutive values in large datasets with Spark

廉价感情. 提交于 2021-02-10 23:46:17
问题 Cheerz, Recently I have being trying out Spark and do far I have observed quite interesting results, but currently I am stuck with famous groupByKey OOM problem. Basically what the job does it tries to search in the large datasets the periods where measured value is increasing consecutively for at least N times. I managed to get rid of the problem by writing the results to the disk, but the application is running much slower now (which is expected due to the disk IO). Now the question: is