pyspark | 易学教程

Replace string in PySpark

阅读更多关于 Replace string in PySpark

问题 I am having a dataframe, with numbers in European format, which I imported as a String. Comma as decimal and vice versa - from pyspark.sql.functions import regexp_replace,col from pyspark.sql.types import FloatType df = spark.createDataFrame([('-1.269,75',)], ['revenue']) df.show() +---------+ | revenue| +---------+ |-1.269,75| +---------+ df.printSchema() root |-- revenue: string (nullable = true) Output desired: df.show() +---------+ | revenue| +---------+ |-1269.75| +---------+ df

Replace string in PySpark

阅读更多关于 Replace string in PySpark

Pyspark: count on pyspark.sql.dataframe.DataFrame takes long time

阅读更多关于 Pyspark: count on pyspark.sql.dataframe.DataFrame takes long time

问题 I have a pyspark.sql.dataframe.DataFrame like the following df.show() +--------------------+----+----+---------+----------+---------+----------+---------+ | ID|Code|bool| lat| lon| v1| v2| v3| +--------------------+----+----+---------+----------+---------+----------+---------+ |5ac52674ffff34c98...|IDFA| 1|42.377167| -71.06994|17.422535|1525319638|36.853622| |5ac52674ffff34c98...|IDFA| 1| 42.37747|-71.069824|17.683573|1525319639|36.853622| |5ac52674ffff34c98...|IDFA| 1| 42.37757| -71.06942|22

mysql不能插入汉字或者是pyspark连接mysql输出或输入时汉字存在乱码问题

阅读更多关于 mysql不能插入汉字或者是pyspark连接mysql输出或输入时汉字存在乱码问题

首先，我的mysql版本为用pyspark连接数据库进行输出某表的内容 from pyspark.sql import SparkSession if __name__ == "__main__": """ 从mysql查询数据 """ spark = SparkSession.builder.getOrCreate() url="jdbc:mysql://192.168.1.105:3306/tylg?serverTimezone=Asia/Shanghai" user="root" password="123456" # 下方如果用的是mysql-connector8.0的要用加cj，若是5.多则不用】 dirver="com.mysql.cj.jdbc.Driver" #创建数据库连接,查询所有数据 mysql_df=spark.read.format("jdbc").option("url",url).option("dirver",dirver)\ .option("dbtable","customerinfo").option("user",user).option("password",password).load() print(type(mysql_df)) mysql_df.show() 这种情况出现了乱码付航那个是我修复好的，而下面那个的uname就是乱码

While submit job with pyspark, how to access static files upload with --files argument?

阅读更多关于 While submit job with pyspark, how to access static files upload with --files argument?

问题 for example, i have a folder: / - test.py - test.yml and the job is submited to spark cluster with: gcloud beta dataproc jobs submit pyspark --files=test.yml "test.py" in the test.py , I want to access the static file I uploaded. with open('test.yml') as test_file: logging.info(test_file.read()) but got the following exception: IOError: [Errno 2] No such file or directory: 'test.yml' How to access the file I uploaded? 回答1: Files distributed using SparkContext.addFile (and --files ) can be

Pyspark 常用命令

阅读更多关于 Pyspark 常用命令

1. read files # define schema from pyspark.sql.types import StructType,StructField from pyspark.sql.types import DoubleType,StringType,IntegerType schema = StructType([ StructField('x1' = StringType()), StructField('x2' = DoubleType()) ]) # read csv sel_col = ['x1'] xs = spark.read.schema(schema)\ .option('header','false')\ .csv(path.format(s3_buckect),sep = '\\t')\ .select(*sel_col) 2. add columns from pyspark.sql.window import window as W from pyspark.sql import functions as F # add columns df = df.withColumn('new_col',F.monotonically_increasing_id())\ .withColumn('row_number',F.row_number()

TypeError: 'Column' object is not callable using WithColumn

阅读更多关于 TypeError: 'Column' object is not callable using WithColumn

问题 I would like append a new column on dataframe "df" from function get_distance : def get_distance(x, y): dfDistPerc = hiveContext.sql("select column3 as column3, \ from tab \ where column1 = '" + x + "' \ and column2 = " + y + " \ limit 1") result = dfDistPerc.select("column3").take(1) return result df = df.withColumn( "distance", lit(get_distance(df["column1"], df["column2"])) ) But, I get this: TypeError: 'Column' object is not callable I think it happens because x and y are Column objects

TypeError: 'Column' object is not callable using WithColumn

阅读更多关于 TypeError: 'Column' object is not callable using WithColumn

How to use XGboost in PySpark Pipeline

阅读更多关于 How to use XGboost in PySpark Pipeline

问题 I want to update my code of pyspark. In the pyspark, it must put the base model in a pipeline, the office demo of pipeline use the LogistictRegression as an base model. However, it seems not be able to use XGboost model in the pipeline api. How can I use the pyspark like this from xgboost import XGBClassifier ... model = XGBClassifier() model.fit(X_train, y_train) pipeline = Pipeline(stages=[..., model, ...]) ... It is convenient to use the pipeline api, so can anybody give some advices?

Creating a Pyspark Schema involving an ArrayType

阅读更多关于 Creating a Pyspark Schema involving an ArrayType

问题 I'm trying to create a schema for my new DataFrame and have tried various combinations of brackets and keywords but have been unable to figure out how to make this work. My current attempt: from pyspark.sql.types import * schema = StructType([ StructField("User", IntegerType()), ArrayType(StructType([ StructField("user", StringType()), StructField("product", StringType()), StructField("rating", DoubleType())])) ]) Comes back with the error: elementType should be DataType Traceback (most