pyspark

AssertionError: col should be Column

笑着哭i 提交于 2020-01-31 04:45:31
问题 How to create a new column in PySpark and fill this column with the date of today? This is what I tried: import datetime now = datetime.datetime.now() df = df.withColumn("date", str(now)[:10]) I get this error: AssertionError: col should be Column 回答1: How to create a new column in PySpark and fill this column with the date of today? There is already function for that: from pyspark.sql.functions import current_date df.withColumn("date", current_date().cast("string")) AssertionError: col

AssertionError: col should be Column

半腔热情 提交于 2020-01-31 04:45:05
问题 How to create a new column in PySpark and fill this column with the date of today? This is what I tried: import datetime now = datetime.datetime.now() df = df.withColumn("date", str(now)[:10]) I get this error: AssertionError: col should be Column 回答1: How to create a new column in PySpark and fill this column with the date of today? There is already function for that: from pyspark.sql.functions import current_date df.withColumn("date", current_date().cast("string")) AssertionError: col

How take a random row from a PySpark DataFrame?

∥☆過路亽.° 提交于 2020-01-31 02:59:27
问题 How can I get a random row from a PySpark DataFrame? I only see the method sample() which takes a fraction as parameter. Setting this fraction to 1/numberOfRows leads to random results, where sometimes I won't get any row. On RRD there is a method takeSample() that takes as a parameter the number of elements you want the sample to contain. I understand that this might be slow, as you have to count each partition, but is there a way to get something like this on a DataFrame? 回答1: You can

pyspark read format jdbc generates ORA-00903: invalid table name Error

╄→гoц情女王★ 提交于 2020-01-30 12:12:05
问题 With a pysqpark running on a remote server, I am able to connect to an Oracle database on another server with jdbc, but any valid query I run returns a ORA-00903: invalid table name Error . I am able to connect to the database from my local machine with cx_Oracle or pyodbc . When I connect from local those queries that return the above error run without problems. I've varied the queries that I run in either locally or remotely, but no matter what type of valid query I run ORACLE_JAR = "ojdbc7

Pyspark: \Anaconda3\envs\xgboost\python.exe] was unexpected at this time

一曲冷凌霜 提交于 2020-01-30 12:01:09
问题 I am trying to install Pyspark in Windows. I applied setx to the following: PYSPARK_DRIVER_PYTHON "C:\Users\Sade D\Anaconda3\envs\xgboost\Scripts\jupyter.exe" HADOOP_HOME "C:\spark\hadoop" JAVA_HOME "C:\Program Files\Java\jdk1.8.0_172" PYSPARK_DRIVER_PYTHON_OPTS "notebook" PYSPARK_PYTHON "C:\Users\Sade D\Anaconda3\envs\xgboost\python.exe" SCALA_HOME "C:\spark\scala" SPARK_HOME "C:\spark\spark" JAVA_HOME "C:\Program Files\Java\jdk1.8.0_172" In system variables in path I have attached the

Pyspark Structured streaming processing

 ̄綄美尐妖づ 提交于 2020-01-30 11:50:47
问题 I am trying to make a structured streaming application with spark the main idea is to read from a kafka source, process the input, write back to another topic. i have successfully made spark read and write from and to kafka however my problem is with the processing part. I have tried the foreach function to capture every row and process it before writing back to kafka however it always only does the foreach part and never writes back to kafka. If i however remove the foreach part from the

Grouping pyspark dataframe by intersection [duplicate]

亡梦爱人 提交于 2020-01-30 10:59:52
问题 This question already has an answer here : How to group by common element in array? (1 answer) Closed 7 months ago . I need to group PySpark dataframe by intersection of arrays in column. For example from dataframe like this: v1 | [1, 2, 3] v2 | [4, 5] v3 | [1, 7] result should be: [v1, v3] | [1, 2, 3, 7] [v2] | [4, 5] Because rows 1st and 3rd have value 1 in common. Is there a method like group by when intersection? Thank you in advance for ideas and suggestions how to solve this. 回答1: from

How to pass a array column and convert it to a numpy array in pyspark

人走茶凉 提交于 2020-01-30 10:32:14
问题 I have a data frame like below: from pyspark import SparkContext, SparkConf,SQLContext import numpy as np from scipy.spatial.distance import cosine from pyspark.sql.functions import lit,countDistinct,udf,array,struct import pyspark.sql.functions as F config = SparkConf("local") sc = SparkContext(conf=config) sqlContext=SQLContext(sc) @udf("float") def myfunction(x): y=np.array([1,3,9]) x=np.array(x) return cosine(x,y) df = sqlContext.createDataFrame([("doc_3",1,3,9), ("doc_1",9,6,0), ("doc_2"

How to pass a array column and convert it to a numpy array in pyspark

一曲冷凌霜 提交于 2020-01-30 10:31:08
问题 I have a data frame like below: from pyspark import SparkContext, SparkConf,SQLContext import numpy as np from scipy.spatial.distance import cosine from pyspark.sql.functions import lit,countDistinct,udf,array,struct import pyspark.sql.functions as F config = SparkConf("local") sc = SparkContext(conf=config) sqlContext=SQLContext(sc) @udf("float") def myfunction(x): y=np.array([1,3,9]) x=np.array(x) return cosine(x,y) df = sqlContext.createDataFrame([("doc_3",1,3,9), ("doc_1",9,6,0), ("doc_2"

How to pass a array column and convert it to a numpy array in pyspark

孤人 提交于 2020-01-30 10:27:19
问题 I have a data frame like below: from pyspark import SparkContext, SparkConf,SQLContext import numpy as np from scipy.spatial.distance import cosine from pyspark.sql.functions import lit,countDistinct,udf,array,struct import pyspark.sql.functions as F config = SparkConf("local") sc = SparkContext(conf=config) sqlContext=SQLContext(sc) @udf("float") def myfunction(x): y=np.array([1,3,9]) x=np.array(x) return cosine(x,y) df = sqlContext.createDataFrame([("doc_3",1,3,9), ("doc_1",9,6,0), ("doc_2"