pyspark | 易学教程

AssertionError: col should be Column

阅读更多关于 AssertionError: col should be Column

问题 How to create a new column in PySpark and fill this column with the date of today? This is what I tried: import datetime now = datetime.datetime.now() df = df.withColumn("date", str(now)[:10]) I get this error: AssertionError: col should be Column 回答1: How to create a new column in PySpark and fill this column with the date of today? There is already function for that: from pyspark.sql.functions import current_date df.withColumn("date", current_date().cast("string")) AssertionError: col

AssertionError: col should be Column

阅读更多关于 AssertionError: col should be Column

How take a random row from a PySpark DataFrame?

阅读更多关于 How take a random row from a PySpark DataFrame?

问题 How can I get a random row from a PySpark DataFrame? I only see the method sample() which takes a fraction as parameter. Setting this fraction to 1/numberOfRows leads to random results, where sometimes I won't get any row. On RRD there is a method takeSample() that takes as a parameter the number of elements you want the sample to contain. I understand that this might be slow, as you have to count each partition, but is there a way to get something like this on a DataFrame? 回答1: You can

pyspark read format jdbc generates ORA-00903: invalid table name Error

阅读更多关于 pyspark read format jdbc generates ORA-00903: invalid table name Error

问题 With a pysqpark running on a remote server, I am able to connect to an Oracle database on another server with jdbc, but any valid query I run returns a ORA-00903: invalid table name Error . I am able to connect to the database from my local machine with cx_Oracle or pyodbc . When I connect from local those queries that return the above error run without problems. I've varied the queries that I run in either locally or remotely, but no matter what type of valid query I run ORACLE_JAR = "ojdbc7

Pyspark: \Anaconda3\envs\xgboost\python.exe] was unexpected at this time

阅读更多关于 Pyspark: \Anaconda3\envs\xgboost\python.exe] was unexpected at this time

问题 I am trying to install Pyspark in Windows. I applied setx to the following: PYSPARK_DRIVER_PYTHON "C:\Users\Sade D\Anaconda3\envs\xgboost\Scripts\jupyter.exe" HADOOP_HOME "C:\spark\hadoop" JAVA_HOME "C:\Program Files\Java\jdk1.8.0_172" PYSPARK_DRIVER_PYTHON_OPTS "notebook" PYSPARK_PYTHON "C:\Users\Sade D\Anaconda3\envs\xgboost\python.exe" SCALA_HOME "C:\spark\scala" SPARK_HOME "C:\spark\spark" JAVA_HOME "C:\Program Files\Java\jdk1.8.0_172" In system variables in path I have attached the

Pyspark Structured streaming processing

阅读更多关于 Pyspark Structured streaming processing

问题 I am trying to make a structured streaming application with spark the main idea is to read from a kafka source, process the input, write back to another topic. i have successfully made spark read and write from and to kafka however my problem is with the processing part. I have tried the foreach function to capture every row and process it before writing back to kafka however it always only does the foreach part and never writes back to kafka. If i however remove the foreach part from the

Grouping pyspark dataframe by intersection [duplicate]

阅读更多关于 Grouping pyspark dataframe by intersection [duplicate]

问题 This question already has an answer here : How to group by common element in array? (1 answer) Closed 7 months ago . I need to group PySpark dataframe by intersection of arrays in column. For example from dataframe like this: v1 | [1, 2, 3] v2 | [4, 5] v3 | [1, 7] result should be: [v1, v3] | [1, 2, 3, 7] [v2] | [4, 5] Because rows 1st and 3rd have value 1 in common. Is there a method like group by when intersection? Thank you in advance for ideas and suggestions how to solve this. 回答1: from

How to pass a array column and convert it to a numpy array in pyspark

阅读更多关于 How to pass a array column and convert it to a numpy array in pyspark

问题 I have a data frame like below: from pyspark import SparkContext, SparkConf,SQLContext import numpy as np from scipy.spatial.distance import cosine from pyspark.sql.functions import lit,countDistinct,udf,array,struct import pyspark.sql.functions as F config = SparkConf("local") sc = SparkContext(conf=config) sqlContext=SQLContext(sc) @udf("float") def myfunction(x): y=np.array([1,3,9]) x=np.array(x) return cosine(x,y) df = sqlContext.createDataFrame([("doc_3",1,3,9), ("doc_1",9,6,0), ("doc_2"

How to pass a array column and convert it to a numpy array in pyspark

阅读更多关于 How to pass a array column and convert it to a numpy array in pyspark

How to pass a array column and convert it to a numpy array in pyspark

阅读更多关于 How to pass a array column and convert it to a numpy array in pyspark