I have 100.000 pdf documents and I want to create a Spark dataset and save it. The use case of my project is that when I received a pdf in input, I have to check the cosine