Memory Error when fitting the data using sklearn package

问题

Following my question here: http://stackoverflow.com/questions/37844596/avoid-memory-error-when-dealing-with-large-arrays, I was able to deal with the Memory Error due to arrays operations by splitting them into several lines; thank to the guys responded. The problem now is it's throwing Memory Error when fitting the data using Sklearn packages; e.g when trying to do .fit(arr_3d[i]) to km in the code below.

The array dimension is 3D, and I'm looping through it, so why I'm having this error? and how to fix it? note it doesn't happen all the time, sometimes it works fine with no error, not sure why either.

Whole code is:

def home(request):
    if request.method=="POST":
        img = UploadForm(request.POST, request.FILES)
        no_clus = int(request.POST.get('num_clusters', 10))

        if img.is_valid():

            paramFile =io.TextIOWrapper(request.FILES['File'].file)
            portfolio1 = csv.DictReader(paramFile)
            users = []
            users = [row["BASE_NAME"] for row in portfolio1]


            my_list = users
            vectorizer = CountVectorizer()
            dtm = vectorizer.fit_transform(my_list)

            lsa = TruncatedSVD(n_components=100)
            dtm_lsa = lsa.fit_transform(dtm)
            dtm_lsa = Normalizer(copy=False).fit_transform(dtm_lsa)
            product= (np.dot(dtm_lsa, dtm_lsa.T))
            dist1 = (1 - product)
            k = len(my_list) ### length is 5362 
            data2 = np.asarray(dist1)
            arr_3d = data2.reshape((1, k, k))

            print(arr_3d) ### shown below
            print(len(arr_3d))
            no_cluster = number_cluster(request,len(my_list))
            print(no_cluster)
            for i in range(len(arr_3d)):
                #km = AgglomerativeClustering(n_clusters=no_clus, linkage='ward')
                #km = km.fit(arr_3d[i])
              #  km = KMeans(n_clusters=no_cluster, init='k-means++')
                km = AgglomerativeClustering(n_clusters=no_cluster, linkage='complete')
                km = km.fit(arr_3d[i])
                #km = AgglomerativeClustering(n_clusters=no_cluster, linkage='average').fit(arr_3d[i])
                # km = AgglomerativeClustering(n_clusters=no_clus, linkage='complete').fit(arr_3d[i])
                # km = MeanShift()
                # km = KMeans(n_clusters=no_clus, init='k-means++')
                # km = MeanShift()
                #  km = km.fit(arr_3d[i])
                # print km
                labels = km.labels_

            csvfile = settings.MEDIA_ROOT +'\\'+ 'images\\export.csv'

            csv_input = pd.read_csv(csvfile, encoding='latin-1')
            csv_input['cluster_ID'] = labels
            csv_input['BASE_NAME'] = my_list
            csv_input.to_csv(settings.MEDIA_ROOT +'/'+ 'output.csv', index=False)

arr_3d is:

 [[[  0.00000000e+00   9.87752905e-01   1.00070800e+00 ...,   8.93937985e-01
     1.00352321e+00   1.00481892e+00]
  [  9.87752905e-01  -2.22044605e-16   1.00107768e+00 ...,   9.80156085e-01
     1.00047940e+00   1.00059883e+00]
  [  1.00070800e+00   1.00107768e+00  -6.66133815e-16 ...,   9.97548342e-01
     9.99890765e-01   1.00143594e+00]
  ..., 
  [  8.93937985e-01   9.80156085e-01   9.97548342e-01 ...,  -2.22044605e-16
     2.34431311e-01   9.87267801e-01]
  [  1.00352321e+00   1.00047940e+00   9.99890765e-01 ...,   2.34431311e-01
    -2.22044605e-16   1.00152421e+00]
  [  1.00481892e+00   1.00059883e+00   1.00143594e+00 ...,   9.87267801e-01
     1.00152421e+00   3.33066907e-16]]]

来源：https://stackoverflow.com/questions/37890921/memory-error-when-fitting-the-data-using-sklearn-package

标签

python

memory

scikit-learn

data-fitting