How to slice a pyspark dataframe in two row-wise

后端 未结 5 571
走了就别回头了
走了就别回头了 2020-12-09 18:58

I am working in Databricks.

I have a dataframe which contains 500 rows, I would like to create two dataframes on containing 100 rows and the other containing the rem

相关标签:
5条回答
  • 2020-12-09 19:27

    Providing a much less complicated solution here more similar to what was requested:

    (Works in Spark 2.4 +)

    # Starting
    print('Starting row count:',df.count())
    print('Starting column count:',len(df.columns))
    
    # Slice rows
    df2 = df.limit(3)
    print('Sliced row count:',df2.count())
    
    # Slice columns
    cols_list = df.columns[0:1]
    df3 = df.select(cols_list)
    print('Sliced column count:',len(df3.columns))
    
    0 讨论(0)
  • 2020-12-09 19:30

    Spark dataframes cannot be indexed like you write. You could use head method to Create to take the n top rows. This will return a list of Row() objects and not a dataframe. So you can convert them back to dataframe and use subtract from the original dataframe to take the rest of the rows.

    #Take the 100 top rows convert them to dataframe 
    #Also you need to provide the schema also to avoid errors
    df1 = sqlContext.createDataFrame(df.head(100), df.schema)
    
    #Take the rest of the rows
    df2 = df.subtract(df1)
    

    You can use also SparkSession instead of spark sqlContext if you work on spark 2.0+. Also if you are not interested in taking the first 100 rows and you want a random split you can use randomSplit like this:

    df1,df2 = df.randomSplit([0.20, 0.80],seed=1234)
    
    0 讨论(0)
  • 2020-12-09 19:30

    If I don't mind having same rows in both dataframe's then I can use sample. For e.g. I have a dataframe with 354 rows.

    >>> df.count()
    354
    
    >>> df.sample(False,0.5,0).count() //approx. 50%
    179
    
    >>> df.sample(False,0.1,0).count() //approx. 10%
    34
    

    Alternatively, If I want to strictly split without duplicates being present, I could do

    df1 = df.limit(100)     //100 rows
    df2 = df.subtract(df1)  //Remaining rows
    
    0 讨论(0)
  • 2020-12-09 19:35

    Initially I misunderstood and thought you wanted to slice the columns. If you want to select a subset of rows, one method is to create an index column using monotonically_increasing_id(). From the docs:

    The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.

    You can use this ID to sort the dataframe and subset it using limit() to ensure you get exactly the rows you want.

    For example:

    import pyspark.sql.functions as f
    import string
    
    # create a dummy df with 500 rows and 2 columns
    N = 500
    numbers = [i%26 for i in range(N)]
    letters = [string.ascii_uppercase[n] for n in numbers]
    
    df = sqlCtx.createDataFrame(
        zip(numbers, letters),
        ('numbers', 'letters')
    )
    
    # add an index column
    df = df.withColumn('index', f.monotonically_increasing_id())
    
    # sort ascending and take first 100 rows for df1
    df1 = df.sort('index').limit(100)
    
    # sort descending and take 400 rows for df2
    df2 = df.sort('index', ascending=False).limit(400)
    

    Just to verify that this did what you wanted:

    df1.count()
    #100
    df2.count()
    #400
    

    Also we can verify that the index column doesn't overlap:

    df1.select(f.min('index').alias('min'), f.max('index').alias('max')).show()
    #+---+---+
    #|min|max|
    #+---+---+
    #|  0| 99|
    #+---+---+
    
    df2.select(f.min('index').alias('min'), f.max('index').alias('max')).show()
    #+---+----------+
    #|min|       max|
    #+---+----------+
    #|100|8589934841|
    #+---+----------+
    
    0 讨论(0)
  • 2020-12-09 19:38

    Try by this way :

    df1_list = df.collect()[:99] #this will return list    
    df1 = spark.createDataFrame(df1) #convert it to spark dataframe
    

    similarly for this as well:

    df2_list = df.collect()[100:499]
    df2 = spark.createDataFrame(df2)
    
    0 讨论(0)
提交回复
热议问题