Pyspark: show histogram of a data frame column

前端 未结 5 1995
情书的邮戳
情书的邮戳 2020-12-14 01:04

In pandas data frame, I am using the following code to plot histogram of a column:

my_df.hist(column = \'field_1\')

Is there something that

5条回答
  •  难免孤独
    2020-12-14 01:42

    The histogram method for RDDs returns the bin ranges and the bin counts. Here's a function that takes this histogram data and plots it as a histogram.

    import numpy as np
    import matplotlib.pyplot as mplt
    import matplotlib.ticker as mtick
    
    def plotHistogramData(data):
        binSides, binCounts = data
    
        N = len(binCounts)
        ind = np.arange(N)
        width = 1
    
        fig, ax = mplt.subplots()
        rects1 = ax.bar(ind+0.5, binCounts, width, color='b')
    
        ax.set_ylabel('Frequencies')
        ax.set_title('Histogram')
        ax.set_xticks(np.arange(N+1))
        ax.set_xticklabels(binSides)
        ax.xaxis.set_major_formatter(mtick.FormatStrFormatter('%.2e'))
        ax.yaxis.set_major_formatter(mtick.FormatStrFormatter('%.2e'))
    
        mplt.show()
    

    (This code assumes that bins have equal length.)

提交回复
热议问题