I am trying to make a histogram with a column from a dataframe which looks like
DataFrame[C0: int, C1: int, ...]
If I were to make a histog
You can use histogram_numeric
Hive UDAF:
import random
random.seed(323)
sqlContext = HiveContext(sc)
n = 3 # Number of buckets
df = sqlContext.createDataFrame(
sc.parallelize(enumerate(random.random() for _ in range(1000))),
["id", "v"]
)
hists = df.selectExpr("histogram_numeric({0}, {1})".format("v", n))
hists.show(1, False)
## +------------------------------------------------------------------------------------+
## |histogram_numeric(v,3) |
## +------------------------------------------------------------------------------------+
## |[[0.2124888140177466,415.0], [0.5918851340384337,330.0], [0.8890271451209697,255.0]]|
## +------------------------------------------------------------------------------------+
You can also extract the column of interest and use histogram
method on RDD
:
df.select("v").rdd.flatMap(lambda x: x).histogram(n)
## ([0.002028109534323752,
## 0.33410233677189705,
## 0.6661765640094703,
## 0.9982507912470436],
## [327, 326, 347])
What worked for me is
df.groupBy("C1").count().rdd.values().histogram()
I have to convert to RDD because I found histogram
method in pyspark.RDD class, but not in spark.SQL module
Let's say your values in C1 are between 1-1000 and you want to get a histogram of 10 bins. You can do something like: df.withColumn("bins", df.C1/100).groupBy("bins").count() If your binning is more complex you can make a UDF for it (and at worse, you might need to analyze the column first, e.g. by using describe or through some other method).
One easy way could be
import pandas as pd
x = df.select('symboling').toPandas() # symboling is the column for histogram
x.plot(kind='hist')
If you want a to plot the Histogram, you could use the pyspark_dist_explore package:
fig, ax = plt.subplots()
hist(ax, df.groupBy("C1").count().select("count"))
If you would like the data in a pandas DataFrame you could use:
pandas_df = pandas_histogram(df.groupBy("C1").count().select("count"))
The pyspark_dist_explore package that @Chris van den Berg mentioned is quite nice. If you prefer not to add an additional dependency you can use this bit of code to plot a simple histogram.
import matplotlib.pyplot as plt
# Show histogram of the 'C1' column
bins, counts = df.select('C1').rdd.flatMap(lambda x: x).histogram(20)
# This is a bit awkward but I believe this is the correct way to do it
plt.hist(bins[:-1], bins=bins, weights=counts)