Parsing CSV from SFTP server too slow, how to improve efficiency? [duplicate]

问题

So I have an SFTP server that hosts a single CSV file that contains data about multiple courses. The data is in the following format (4 columns):

Activity Name,Activity Code,Completion Status,Full Name
Safety with Lasers, 3XX1, 10-Jul-20, "Person, Name"
Safety with Lasers, 3XX1, NaN, "OtherP, OtherName"
How to use wrench, 7NPA, 10-Aug-19, "OtherName, Person"
etc...

I am using Paramiko to access the file using the following code:

file = sftp.open('Data.csv')

But the issue I am having is that it is a SFTPFile type. How can I go about parsing the data from it? I need to extract the names of the courses, and keep track of how many people have completed it and not completed it. I am using the following code at the moment but it is horrendously slow. Any suggestions would be appreciated:

Courses = ['']
Total =[0]
Compl =[0]
csvreal = pandas.read_csv(file)
for index, row in csvreal.iterrows():
    string =(csvreal.loc[[index]].to_string(index=False, header=False))
    if(Courses[i] !=string.split('  ')[0]):
        i+=1
        Courses.append(string.split('  ')[0])
        Total.append(0)
        Compl.append(0)
    if(len(string.split('  ')[2])>3):  #Note that incomplete courses do not have completion date, so it is NaN
        Compl[i]+=1
    Total[i]+=1

I know it is very terrible, I'm new and have no idea what I am doing. Any advice on where to read up on relevant documentation or tutorials would be appreciated. Thank you!

回答1:

sftp.open opens the file on the remote server, so every read will take place over the network. This network traversal is very slow compared to reading from local disk. It would be more efficient to copy the file to your local machine using sftp.get, and then it can be read without incurring the overhead of traversing the network. If you need to update the file you can update the local copy and then copy back to the server with sftp.put.

The code would be something like this (untested, as I don't have an ftp server to hand):

# Retrieve a copy and open
myfile = sftp.get('Data.csv', 'local-copy-Data.csv')
csvreal = pandas.read_csv(myfile)

# Update remote
sftp.put('local-copy-Data.csv', 'Data.csv')

回答2:

If you don't have a c complier installed you'll need one to do this. Tried to use some Cython to make things a little quicker:

import pandas as pd
import time

start_time = time.time()

csvreal = pd.read_csv("Data.csv")
Courses = ['']
cdef int Total[0]
cdef int Compl[0]

cdef int i = 0
for index, row in csvreal.iterrows():
    string =(csvreal.loc[[index]].to_string(index=False, header=False))
    if(Courses[i] !=string.split('  ')[0]):
        i+=1
        Courses.append(string.split('  ')[0])
        Total.append(0)
        Compl.append(0)
    if(len(string.split('  ')[2])>3):  #Note that incomplete courses do not have completion date, so it is NaN
        Compl[i]+=1
    Total[i]+=1

print("--- %s seconds ---" % (time.time() - start_time))

You'll need to install Cython with pip install Python and compile the new file with a setup.py file in the same directory. The link to the tutorial to do all of this is here: Cython Documentation

Hope this helps!

来源：https://stackoverflow.com/questions/61162671/parsing-csv-from-sftp-server-too-slow-how-to-improve-efficiency

标签

python

csv

web

sftp