Efficiently batching Spark dataframes to call an API

问题

I am fairly new to Spark and I'm trying to call the Spotify API using Spotipy. I have a list of artist ids which can be used to fetch artist info. The Spotify API allows for batch calls up to 50 ids at once. I load the artist ids from a MySQL database and store them in a dataframe.

My problem now is that I do not know how to efficiently batch that dataframe into pieces of 50 or less rows.

In the example below I'm turning the dataframe into a regular Python list from which I can call the API on batches of 50.

Any ideas how I could do this without going back to a Python list?

import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
from pyspark.sql import SparkSession
import os

spark = SparkSession\
        .builder\
        .appName("GetArtists")\
        .getOrCreate()

df = spark.read.format('jdbc') \
    .option("url", "jdbc:mysql://"+os.getenv("DB_SERVER")+":"+os.getenv("DB_PORT")+"/spotify_metadata")\
    .option("user", os.getenv("DB_USER"))\
    .option("password", os.getenv("DB_PW"))\
    .option("query", "SELECT artist_id FROM artists")\
    .load()

sp = spotipy.Spotify(client_credentials_manager=SpotifyClientCredentials())

ids = [row['artist_id'] for row in df.collect()]

batch_size = 50
for i in range(0,len(ids), batch_size):
    artists = sp.artists( ids[i:i+batch_size] )

    # process the JSON response

I thought about using foreach and calling the API for each id, but this results in unnecessary requests. Also the results are stored back in the database, which means that I am writing many single rows to the database.

回答1:

If you want to divide the dataframe based on the row number then you can do it like:

from pyspark.sql import functions as f
from pyspark.sql import Window

df = df.withColumn('row_num', f.row_number().over(Window.orderBy(f.lit(1))))
len = df.count()

for i in range(0,len, 50):
    df = df.filter(f.col('row_num')>=i & f.col('row_num')<=i+50)
    #api logic goes here

But if you can pass the df to the api directly then pass df or collect df which will have only 50 values each time.

来源：https://stackoverflow.com/questions/62107654/efficiently-batching-spark-dataframes-to-call-an-api

标签

apache-spark

pyspark

apache-spark-sql

spotipy