How to get values from RDD dynamically with Python?

风流意气都作罢 提交于 2020-01-03 02:21:29

问题


Below is sample record for a book in our system on campus. Each book record is a text file. I have loaded up records with:

books = sc.wholeTextFiles (file:///data/dir/*/*/*/”)     

This would give me a RDD. One record in the RDD looks like this:

[[‘Call No: 56CB',
  'Title:  Global Warming',
  'Type: Serial,
  'Database:  AWS898,',
 ‘Microfilm:  Y,',
  'Access:  Public ,',
]]

I am trying to extract the values in the 4 to N tuple positions of the RDD. 0 through 4 tuples are always there. But the RDD may be missing the 5th and beyond tuples, like this:

[[‘Call No: 56CB',
  'Title:  Science 101',
  'Type: Serial,’
  'Database:  AWS898,',
   ‘Microfilm:  Y,',
]]

So, the code has to be flexible to handle the variable length of the RDD. I have the following code that gets me the 4 and 5 tuples, but this is not flexible when the RDD has 4 through 15 tuples:

Summary1 = books.map(lambda x: (x[4]))
Summary2 = books.map(lambda x: (x[5]))

I can get the length of the RDD with:

LenRDD = books.map(lambda x: len(x)).collect()

Can you help me write the python code that gets me dynamically the 4th to LenRDD tuples?

Here is an example of one of the files:

Call No: 56CB
Title:  Global Warming
Type: Serial
Database:  AWS894
Microfilm:  Y
Access:  Public
Location: Oxford
Size:  987 MB
Key:  677867IPOIO

回答1:


According to what I understand from your question, you are trying to filter out the first 4 lines of each text files and retain the rest of the lines of each file in rdds. If my understanding is correct then you should read the files as you are doing

books = sc.wholeTextFiles("file:///data/dir/*/*/*/")

Then you write a function to delete the first four records from an array

def delete(x):
    if(len(x)>4):
        for index in range(0,4):
            del x[0]
    return x

and then use the above function to delete the first four lines from each text files and get the rest of the lines as rdd

summary1 = books.map(lambda x: delete(x[1].split("\n"))).map(lambda x: "\n".join(x))

You should get what you are looking for



来源:https://stackoverflow.com/questions/48959246/how-to-get-values-from-rdd-dynamically-with-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!