Optimal way to create a ml pipeline in Apache Spark for dataset with high number of columns

前端 未结 2 1691
庸人自扰
庸人自扰 2020-12-25 10:34

I am working with Spark 2.1.1 on a dataset with ~2000 features and trying to create a basic ML Pipeline, consisting of some Transformers and a Classifier.

Let\'s assu

相关标签:
2条回答
  • 2020-12-25 10:58

    The janino error is due the number of constant variables created during the optimizer process. The maximum limit of constant variables allowed in the JVM is ((2^16) -1). If this limit is exceeded, then you get the Constant pool for class ... has grown past JVM limit of 0xFFFF

    The JIRA that will fix this issue is SPARK-18016, but it's still in progress at this time.

    Your code is most likely failing during the VectorAssembler stage, when it has to perform against thousands of columns during a single optimization task.

    The workaround that I developed for this problem is to create a "vector of vectors" by working against subsets of the columns and then bringing the results together at the end to create a singular feature vector. This prevents any single optimization task from exceeding the JVM constant limit. It's not elegant, but I've used it on datasets reaching into the 10k columns range.

    This method also allows you to still keep a single pipeline, though it requires some additional steps to make it work (creating the sub-vectors). After you've created the feature vector from the sub-vectors, you can drop the original source columns if desired.

    Example Code:

    // IMPORT DEPENDENCIES
    import org.apache.spark.sql.SparkSession
    import org.apache.spark.sql.functions._
    import org.apache.spark.sql.{SQLContext, Row, DataFrame, Column}
    import org.apache.spark.ml.feature.VectorAssembler
    import org.apache.spark.ml.{Pipeline, PipelineModel}
    
    // Create first example dataframe
    val exampleDF = spark.createDataFrame(Seq(
      (1, 1, 2, 3, 8, 4, 5, 1, 3, 2, 0, 4, 2, 8, 1, 1, 2, 3, 8, 4, 5),
      (2, 4, 3, 8, 7, 9, 8, 2, 3, 3, 2, 6, 5, 4, 2, 4, 3, 8, 7, 9, 8),
      (3, 6, 1, 9, 2, 3, 6, 3, 8, 5, 1, 2, 3, 5, 3, 6, 1, 9, 2, 3, 6),
      (4, 7, 8, 6, 9, 4, 5, 4, 9, 8, 2, 4, 9, 2, 4, 7, 8, 6, 9, 4, 5),
      (5, 9, 2, 7, 8, 7, 3, 5, 3, 4, 8, 0, 6, 2, 5, 9, 2, 7, 8, 7, 3),
      (6, 1, 1, 4, 2, 8, 4, 6, 3, 9, 8, 8, 9, 3, 6, 1, 1, 4, 2, 8, 4)
    )).toDF("uid", "col1", "col2", "col3", "col4", "col5", 
            "col6", "col7", "col8", "col9", "colA", "colB", 
            "colC", "colD", "colE", "colF", "colG", "colH", 
            "colI", "colJ", "colK")
    
    // Create multiple column lists using the sliding method
    val Array(colList1, colList2, colList3, colList4) = exampleDF.columns.filter(_ != "uid").sliding(5,5).toArray
    
    // Create a vector assembler for each column list
    val colList1_assembler = new VectorAssembler().setInputCols(colList1).setOutputCol("colList1_vec")
    val colList2_assembler = new VectorAssembler().setInputCols(colList2).setOutputCol("colList2_vec")
    val colList3_assembler = new VectorAssembler().setInputCols(colList3).setOutputCol("colList3_vec")
    val colList4_assembler = new VectorAssembler().setInputCols(colList4).setOutputCol("colList4_vec")
    
    // Create a vector assembler using column list vectors as input
    val features_assembler = new VectorAssembler().setInputCols(Array("colList1_vec","colList2_vec","colList3_vec","colList4_vec")).setOutputCol("features")
    
    // Create the pipeline with column list vector assemblers first, then the final vector of vectors assembler last
    val pipeline = new Pipeline().setStages(Array(colList1_assembler,colList2_assembler,colList3_assembler,colList4_assembler,features_assembler))
    
    // Fit and transform the data
    val featuresDF = pipeline.fit(exampleDF).transform(exampleDF)
    
    // Get the number of features in "features" vector
    val featureLength = (featuresDF.schema(featuresDF.schema.fieldIndex("features")).metadata.getMetadata("ml_attr").getLong("num_attrs"))
    
    // Print number of features in "features vector"
    print(featureLength)
    

    (Note: The method of creating the column lists should really be done programatically, but I've kept this example simple for the sake of understanding the concept.)

    0 讨论(0)
  • 2020-12-25 11:09

    The janino error that you are getting is because depending on the feature set, the generated code becomes larger.

    I'd separate the steps into different pipelines and drop the unnecessary features, save the intermediate models like StringIndexer and OneHotEncoder and load them while prediction stage, which is also helpful because transformations would be faster for the data that has to be predicted.

    Finally, you don't need to keep the feature columns after you run VectorAssembler stage as it transforms the features into a feature vector and label column and that is all you need to run predictions.

    Example of Pipeline in Scala with saving of intermediate steps-(Older spark API)

    Also, if you are using older version of spark like 1.6.0, you need to check for patched version i.e. 2.1.1 or 2.2.0 or 1.6.4 or else you would hit the Janino error even with around 400 feature columns.

    0 讨论(0)
提交回复
热议问题