How can I read LIBSVM models (saved using LIBSVM) into PySpark?

前端未结

关注

 1  1490

离开以前 2020-12-11 11:12

I have a LIBSVM scaling model (generated with svm-scale) that I would like to port over to PySpark. I\'ve naively tried the following:

scaler_path = \"path t


      
      
        
          1条回答        

        
                    
            
            
                         
                
              
              
                
                   死守一世寂寞
                                             
                
                
                (楼主)
            
              
              
                2020-12-11 11:41
              

            
            
                        
First, the file presented isn't in libsvm format. The correct format of a libsvm file is the following : 

 : : ... :


Thus your data preparation is incorrect to start with.

Secondly, the class method load(path) that you are using with MinMaxScaler reads an ML instance from the input path. 

Remember that : MinMaxScaler computes summary statistics on a data set and produces a MinMaxScalerModel. The model can then transform each feature individually such that it is in the given range. 

e.g :

from pyspark.ml.linalg import Vectors
from pyspark.mllib.regression import LabeledPoint
from pyspark.ml.feature import MinMaxScaler
df = spark.createDataFrame([(1.1, Vectors.sparse(3, [(0, 1.23), (2, 4.56)])) ,(0.0, Vectors.dense([1.01, 2.02, 3.03]))],['label','features'])

df.show(truncate=False)
# +-----+---------------------+
# |label|features             |
# +-----+---------------------+
# |1.1  |(3,[0,2],[1.23,4.56])|
# |0.0  |[1.01,2.02,3.03]     |
# +-----+---------------------+

mmScaler = MinMaxScaler(inputCol="features", outputCol="scaled")
temp_path = "/tmp/spark/"
minMaxScalerPath = temp_path + "min-max-scaler"
mmScaler.save(minMaxScalerPath)   


The snippet above will save the MinMaxScaler feature transformer so it can be loaded after with the class method load. 

Now, let's take a look at what actually happened. The class method save will create the following file structure :

/tmp/spark/
└── min-max-scaler
    └── metadata
        ├── part-00000
        └── _SUCCESS


Let's check the content of that part-0000 file :

$ cat /tmp/spark/min-max-scaler/metadata/part-00000 | python -m json.tool
{
    "class": "org.apache.spark.ml.feature.MinMaxScaler",
    "paramMap": {
        "inputCol": "features",
        "max": 1.0,
        "min": 0.0,
        "outputCol": "scaled"
    },
    "sparkVersion": "2.0.0",
    "timestamp": 1480501003244,
    "uid": "MinMaxScaler_42e68455a929c67ba66f"
}


So actually when you load the transformer :

loadedMMScaler = MinMaxScaler.load(minMaxScalerPath)


You are actually load that file. It won't take a libsvm file !

Now you can apply your transformer to create the model and transform your DataFrame :

model = loadedMMScaler.fit(df)

model.transform(df).show(truncate=False)                                    
# +-----+---------------------+-------------+
# |label|features             |scaled       |
# +-----+---------------------+-------------+
# |1.1  |(3,[0,2],[1.23,4.56])|[1.0,0.0,1.0]|
# |0.0  |[1.01,2.02,3.03]     |[0.0,1.0,0.0]|
# +-----+---------------------+-------------+


Now let's get back to that libsvm file and let us create some dummy data and save it to a libsvm format using MLUtils

from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.util import MLUtils
data = sc.parallelize([LabeledPoint(1.1, Vectors.sparse(3, [(0, 1.23), (2, 4.56)])), LabeledPoint(0.0, Vectors.dense([1.01, 2.02, 3.03]))])
MLUtils.saveAsLibSVMFile(data, temp_path + "data")


Back to our file structure :

/tmp/spark/
├── data
│   ├── part-00000
│   ├── part-00001
│   ├── part-00002
│   ├── part-00003
│   ├── part-00004
│   ├── part-00005
│   ├── part-00006
│   ├── part-00007
│   └── _SUCCESS
└── min-max-scaler
    └── metadata
        ├── part-00000
        └── _SUCCESS


You can check the content of those file which is in libsvm format now :

$ cat /tmp/spark/data/part-0000*
1.1 1:1.23 3:4.56
0.0 1:1.01 2:2.02 3:3.03


Now let's load that data and apply :

loadedData = MLUtils.loadLibSVMFile(sc, temp_path + "data")
loadedDataDF = spark.createDataFrame(loadedData.map(lambda lp : (lp.label, lp.features.asML())), ['label','features'])

loadedDataDF.show(truncate=False)
# +-----+----------------------------+                                            
# |label|features                    |
# +-----+----------------------------+
# |1.1  |(3,[0,2],[1.23,4.56])       |
# |0.0  |(3,[0,1,2],[1.01,2.02,3.03])|
# +-----+----------------------------+


Note that converting MLlib Vectors to ML Vectors is very important. You can read more about it here.

model.transform(loadedDataDF).show(truncate=False)
# +-----+----------------------------+-------------+
# |label|features                    |scaled       |
# +-----+----------------------------+-------------+
# |1.1  |(3,[0,2],[1.23,4.56])       |[1.0,0.0,1.0]|
# |0.0  |(3,[0,1,2],[1.01,2.02,3.03])|[0.0,1.0,0.0]|
# +-----+----------------------------+-------------+


I hope that this answers your question!
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                    
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复