How to get values from RDD dynamically with Python?

问题

Below is sample record for a book in our system on campus. Each book record is a text file. I have loaded up records with:

books = sc.wholeTextFiles (file:///data/dir/*/*/*/”)

This would give me a RDD. One record in the RDD looks like this:

[[‘Call No: 56CB',
  'Title:  Global Warming',
  'Type: Serial,
  'Database:  AWS898,',
 ‘Microfilm:  Y,',
  'Access:  Public ,',
]]

I am trying to extract the values in the 4 to N tuple positions of the RDD. 0 through 4 tuples are always there. But the RDD may be missing the 5th and beyond tuples, like this:

[[‘Call No: 56CB',
  'Title:  Science 101',
  'Type: Serial,’
  'Database:  AWS898,',
   ‘Microfilm:  Y,',
]]

So, the code has to be flexible to handle the variable length of the RDD. I have the following code that gets me the 4 and 5 tuples, but this is not flexible when the RDD has 4 through 15 tuples:

Summary1 = books.map(lambda x: (x[4]))
Summary2 = books.map(lambda x: (x[5]))

I can get the length of the RDD with:

LenRDD = books.map(lambda x: len(x)).collect()

Can you help me write the python code that gets me dynamically the 4th to LenRDD tuples?

Here is an example of one of the files:

Call No: 56CB
Title:  Global Warming
Type: Serial
Database:  AWS894
Microfilm:  Y
Access:  Public
Location: Oxford
Size:  987 MB
Key:  677867IPOIO

回答1:

According to what I understand from your question, you are trying to filter out the first 4 lines of each text files and retain the rest of the lines of each file in rdds. If my understanding is correct then you should read the files as you are doing

books = sc.wholeTextFiles("file:///data/dir/*/*/*/")

Then you write a function to delete the first four records from an array

def delete(x):
    if(len(x)>4):
        for index in range(0,4):
            del x[0]
    return x

and then use the above function to delete the first four lines from each text files and get the rest of the lines as rdd

summary1 = books.map(lambda x: delete(x[1].split("\n"))).map(lambda x: "\n".join(x))

You should get what you are looking for

来源：https://stackoverflow.com/questions/48959246/how-to-get-values-from-rdd-dynamically-with-python

标签

python

apache-spark

pyspark