I\'m creating a new DataFrame with a handful of records from a Join.
val joined_df = first_df.join(second_df, first_df.col(\"key\") ===
second_df.col(\"key\"
As others have mentioned, the operations before count are "lazy" and only register a transformation, rather than actually force a computation.
When you call count, the computation is triggered. This is when Spark reads your data, performs all previously-registered transformations and calculates the result that you requested (in this case a count).
The RDD conversion kicks in and literally takes hours to complete
I think the term "conversion" is perhaps a bit inaccurate. What is actually happening is that the DataFrame transformations you registered are translated into RDD operations, and these are applied to to the RDD that underlies your DataFrame. There is no conversion per se in the code you have given here.
As an aside, it is possible to explicitly convert a DataFrame to an RDD via the DataFrame.rdd property. As mentioned in this answer this is generally a bad idea, since you lose some of the benefits (in both performance and API) of having well-structured data.