Making histogram with Spark DataFrame column

前端 未结 6 1951
盖世英雄少女心
盖世英雄少女心 2020-12-16 03:18

I am trying to make a histogram with a column from a dataframe which looks like

DataFrame[C0: int, C1: int, ...]

If I were to make a histog

相关标签:
6条回答
  • 2020-12-16 03:25

    You can use histogram_numeric Hive UDAF:

    import random
    
    random.seed(323)
    
    sqlContext = HiveContext(sc)
    n = 3  # Number of buckets
    df = sqlContext.createDataFrame(
        sc.parallelize(enumerate(random.random() for _ in range(1000))),
       ["id", "v"]
    )
    
    hists = df.selectExpr("histogram_numeric({0}, {1})".format("v", n))
    
    hists.show(1, False)
    ## +------------------------------------------------------------------------------------+
    ## |histogram_numeric(v,3)                                                              |
    ## +------------------------------------------------------------------------------------+
    ## |[[0.2124888140177466,415.0], [0.5918851340384337,330.0], [0.8890271451209697,255.0]]|
    ## +------------------------------------------------------------------------------------+
    

    You can also extract the column of interest and use histogram method on RDD:

    df.select("v").rdd.flatMap(lambda x: x).histogram(n)
    ## ([0.002028109534323752,
    ##  0.33410233677189705,
    ##  0.6661765640094703,
    ##  0.9982507912470436],
    ## [327, 326, 347])
    
    0 讨论(0)
  • 2020-12-16 03:30

    What worked for me is

    df.groupBy("C1").count().rdd.values().histogram()
    

    I have to convert to RDD because I found histogram method in pyspark.RDD class, but not in spark.SQL module

    0 讨论(0)
  • 2020-12-16 03:30

    Let's say your values in C1 are between 1-1000 and you want to get a histogram of 10 bins. You can do something like: df.withColumn("bins", df.C1/100).groupBy("bins").count() If your binning is more complex you can make a UDF for it (and at worse, you might need to analyze the column first, e.g. by using describe or through some other method).

    0 讨论(0)
  • 2020-12-16 03:31

    One easy way could be

    import pandas as pd
    x = df.select('symboling').toPandas()  # symboling is the column for histogram
    x.plot(kind='hist')
    
    0 讨论(0)
  • 2020-12-16 03:37

    If you want a to plot the Histogram, you could use the pyspark_dist_explore package:

    fig, ax = plt.subplots()
    hist(ax, df.groupBy("C1").count().select("count"))
    

    If you would like the data in a pandas DataFrame you could use:

    pandas_df = pandas_histogram(df.groupBy("C1").count().select("count"))
    
    0 讨论(0)
  • 2020-12-16 03:49

    The pyspark_dist_explore package that @Chris van den Berg mentioned is quite nice. If you prefer not to add an additional dependency you can use this bit of code to plot a simple histogram.

    import matplotlib.pyplot as plt
    # Show histogram of the 'C1' column
    bins, counts = df.select('C1').rdd.flatMap(lambda x: x).histogram(20)
    
    # This is a bit awkward but I believe this is the correct way to do it 
    plt.hist(bins[:-1], bins=bins, weights=counts)
    
    0 讨论(0)
提交回复
热议问题