Memory Error when fitting the data using sklearn package

别说谁变了你拦得住时间么 提交于 2019-12-08 05:42:30

问题


Following my question here: http://stackoverflow.com/questions/37844596/avoid-memory-error-when-dealing-with-large-arrays, I was able to deal with the Memory Error due to arrays operations by splitting them into several lines; thank to the guys responded. The problem now is it's throwing Memory Error when fitting the data using Sklearn packages; e.g when trying to do .fit(arr_3d[i]) to km in the code below.

The array dimension is 3D, and I'm looping through it, so why I'm having this error? and how to fix it? note it doesn't happen all the time, sometimes it works fine with no error, not sure why either.

Whole code is:

def home(request):
    if request.method=="POST":
        img = UploadForm(request.POST, request.FILES)
        no_clus = int(request.POST.get('num_clusters', 10))

        if img.is_valid():

            paramFile =io.TextIOWrapper(request.FILES['File'].file)
            portfolio1 = csv.DictReader(paramFile)
            users = []
            users = [row["BASE_NAME"] for row in portfolio1]


            my_list = users
            vectorizer = CountVectorizer()
            dtm = vectorizer.fit_transform(my_list)

            lsa = TruncatedSVD(n_components=100)
            dtm_lsa = lsa.fit_transform(dtm)
            dtm_lsa = Normalizer(copy=False).fit_transform(dtm_lsa)
            product= (np.dot(dtm_lsa, dtm_lsa.T))
            dist1 = (1 - product)
            k = len(my_list) ### length is 5362 
            data2 = np.asarray(dist1)
            arr_3d = data2.reshape((1, k, k))

            print(arr_3d) ### shown below
            print(len(arr_3d))
            no_cluster = number_cluster(request,len(my_list))
            print(no_cluster)
            for i in range(len(arr_3d)):
                #km = AgglomerativeClustering(n_clusters=no_clus, linkage='ward')
                #km = km.fit(arr_3d[i])
              #  km = KMeans(n_clusters=no_cluster, init='k-means++')
                km = AgglomerativeClustering(n_clusters=no_cluster, linkage='complete')
                km = km.fit(arr_3d[i])
                #km = AgglomerativeClustering(n_clusters=no_cluster, linkage='average').fit(arr_3d[i])
                # km = AgglomerativeClustering(n_clusters=no_clus, linkage='complete').fit(arr_3d[i])
                # km = MeanShift()
                # km = KMeans(n_clusters=no_clus, init='k-means++')
                # km = MeanShift()
                #  km = km.fit(arr_3d[i])
                # print km
                labels = km.labels_

            csvfile = settings.MEDIA_ROOT +'\\'+ 'images\\export.csv'

            csv_input = pd.read_csv(csvfile, encoding='latin-1')
            csv_input['cluster_ID'] = labels
            csv_input['BASE_NAME'] = my_list
            csv_input.to_csv(settings.MEDIA_ROOT +'/'+ 'output.csv', index=False)

arr_3d is:

 [[[  0.00000000e+00   9.87752905e-01   1.00070800e+00 ...,   8.93937985e-01
     1.00352321e+00   1.00481892e+00]
  [  9.87752905e-01  -2.22044605e-16   1.00107768e+00 ...,   9.80156085e-01
     1.00047940e+00   1.00059883e+00]
  [  1.00070800e+00   1.00107768e+00  -6.66133815e-16 ...,   9.97548342e-01
     9.99890765e-01   1.00143594e+00]
  ..., 
  [  8.93937985e-01   9.80156085e-01   9.97548342e-01 ...,  -2.22044605e-16
     2.34431311e-01   9.87267801e-01]
  [  1.00352321e+00   1.00047940e+00   9.99890765e-01 ...,   2.34431311e-01
    -2.22044605e-16   1.00152421e+00]
  [  1.00481892e+00   1.00059883e+00   1.00143594e+00 ...,   9.87267801e-01
     1.00152421e+00   3.33066907e-16]]]

来源:https://stackoverflow.com/questions/37890921/memory-error-when-fitting-the-data-using-sklearn-package

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!