pyspark

Casting a column to JSON/dict and flattening JSON values in a column in pyspark

99封情书 提交于 2020-01-14 05:29:06
问题 I am new to Pyspark and I am figuring out how to cast a column type to dict type and then flatten that column to multiple columns using explode . Here's how my dataframe looks like: col1 | col2 | ----------------------- test:1 | {"test1":[{"Id":"17","cName":"c1"},{"Id":"01","cName":"c2","pScore":0.003609}], {"test8":[{"Id":"1","cName":"c11","pScore":0.0},{"Id":"012","cName":"c2","pScore":0.003609}] test:2 | {"test1:subtest2":[{"Id":"18","cName":"c13","pScore":0.00203}]} Right now, the schema

Check whether dataframe contains any null values

瘦欲@ 提交于 2020-01-14 03:32:28
问题 I have a dataframe and need to see if it contains null values. There are plenty of posts on the same topic but nearly all of them use the count action or the show method. count operations are prohibitively expensive in my case as the data volume is large. Same for the show method. Is there a way in which I can ask spark to look for null values and raise an error as soon as it encounters the first null value? The solutions in other posts give the count of missing values in each column. I don't

spark - Converting dataframe to list improving performance

为君一笑 提交于 2020-01-14 03:03:41
问题 I need to covert a column of the Spark dataframe to list to use later for matplotlib df.toPandas()[col_name].values.tolist() it looks like there is high performance overhead this operation takes around 18sec is there other way to do that or improve the perfomance? 回答1: If you really need a local list there is not much you can do here but one improvement is to collect only a single column not a whole DataFrame : df.select(col_name).flatMap(lambda x: x).collect() 回答2: You can do it this way: >>

Write spark dataframe to single parquet file

穿精又带淫゛_ 提交于 2020-01-14 01:56:06
问题 I am trying to do something very simple and I'm having some very stupid struggles. I think it must have to do with a fundamental misunderstanding of what spark is doing. I would greatly appreciate any help or explanation. I have a very large (~3 TB, ~300MM rows, 25k partitions) table, saved as parquet in s3, and I would like to give someone a tiny sample of it as a single parquet file. Unfortunately, this is taking forever to finish and I don't understand why. I have tried the following: tiny

Missing SLF4J logger on spark workers

不打扰是莪最后的温柔 提交于 2020-01-14 01:37:22
问题 I am trying to run a job via spark-submit . The error that results from this job is: Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger at java.lang.Class.getDeclaredMethods0(Native Method) at java.lang.Class.privateGetDeclaredMethods(Class.java:2625) at java.lang.Class.getMethod0(Class.java:2866) at java.lang.Class.getMethod(Class.java:1676) at sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494) at sun.launcher.LauncherHelper.checkAndLoadMain

pyspark_DataFrame高级操作

假如想象 提交于 2020-01-14 01:04:27
文章目录 一、DataFrame一些操作 1.1 添加列 1.2 udf 1.3 多行聚合 1.4 单行聚合 1.5 从Row结构到DataFrame 1.6 交叉频率表(`crosstab`) 1.7 删除重复行(`dropDuplicates`) 1.8 gruopby组合(`rollup`&`GROUPING_ID`) 二、简单数值型数据探索 2.1 summary 2.2 近似百分位 快速求解(`approxQuantile`) 三、分区导出导入 一、DataFrame一些操作 一些操作均需要在 pyspark.sql.functions 中加载 部分数据 import sys, os from pyspark.sql import SparkSession from pyspark.sql.functions import lit, exp, rand, expr from pyspark.sql.functions import udf, sum, grouping_id from pyspark.sql.types import ArrayType, StringType, DoubleType, StructType, StructField, LongType from pyspark.sql import Row import pandas as pd

Read in CSV in Pyspark with correct Datatypes

£可爱£侵袭症+ 提交于 2020-01-13 10:59:26
问题 When I am trying to import a local CSV with spark, every column is by default read in as a string. However, my columns only include integers and a timestamp type. To be more specific, the CSV looks like this: "Customer","TransDate","Quantity","PurchAmount","Cost","TransID","TransKey" 149332,"15.11.2005",1,199.95,107,127998739,100000 I have found code that should work in this question, but when I execute it all the entries are returned as NULL . I use the following to create a custom schema:

Read in CSV in Pyspark with correct Datatypes

安稳与你 提交于 2020-01-13 10:59:08
问题 When I am trying to import a local CSV with spark, every column is by default read in as a string. However, my columns only include integers and a timestamp type. To be more specific, the CSV looks like this: "Customer","TransDate","Quantity","PurchAmount","Cost","TransID","TransKey" 149332,"15.11.2005",1,199.95,107,127998739,100000 I have found code that should work in this question, but when I execute it all the entries are returned as NULL . I use the following to create a custom schema:

Pyspark StructType is not defined

两盒软妹~` 提交于 2020-01-13 07:31:23
问题 I'm trying to struct a schema for db testing, and StructType apparently isn't working for some reason. I'm following a tut, and it doesn't import any extra module. <type 'exceptions.NameError'>, NameError("name 'StructType' is not defined",), <traceback object at 0x2b555f0>) I'm on spark 1.4.0, and Ubuntu 12 if that has anything to do with the problem. How would I fix this problem? Thank you in advance. 回答1: Did you import StructType ? If not from pyspark.sql.types import StructType should

Converting pandas Dataframe with Numpy values to pysparkSQL.DataFrame

岁酱吖の 提交于 2020-01-13 07:19:06
问题 I created a 2 columns pandas df with random.int method to generate a second two column dataframe applying groupby operations. df.col1 is a series of lists, df.col2 a series of integers, and elements inside the list are type 'numpy.int64' , same for the elements of the second column, as result of random.int. df.a df.b 3 7 5 2 1 8 ... groupby operations df.col1 df.col2 [1,2,3...] 1 [2,5,6...] 2 [6,4,....] 3 ... When I try to crete the pyspark.sql dataframe with spark.createDataFrame(df), I get