How to filter in Spark using two “conditions”?

问题

I'm doing an excercice where I'm required to filter the amount of crimes per year based on a file that has more than 13 millions of lines (in case that's an important info). For that, I did this and it's working fine:

  JavaRDD<String> anoRDD  = arquivo.map(s -> 
    {String[] campos = s.split(";") ; 
    return campos[2];
    });
    
    System.out.println(anoRDD.countByValue());

But, the next question to be answered is "How many "NARCOTIC" crimes happen per YEAR?", I managed to filter the total amount, but not per year, I did the following:

    JavaRDD<String> NarcoticsRDD  = arquivo.map(s -> 
    {String[] campos = s.split(";") ; 
    return campos[4];
    });
  
    JavaRDD<String> JustNarcotics = NarcoticsRDD.filter(s -> s.equals("NARCOTICS"));
    
    System.out.println(JustsNarcotics.countByValue());

How can I do this type of filter in Spark using java?

Tks!

回答1:

So the first thing you would want to do is to map your data to a bean class.

Step 1: Let's create a bean class to represent your data. This should implement serializable and must have public getters and setters.

public class CrimeInfo implements Serializable {

  private Integer year;
  private String crimeType;

  public CrimeInfo(Integer year, String crimeType) {
    this.year = year;
    this.crimeType = crimeType;
  }

  public Integer getYear() {
    return year;
  }

  public void setYear(Integer year) {
    this.year = year;
  }

  public String getCrimeType() {
    return crimeType;
  }

  public void setCrimeType(String crimeType) {
    this.crimeType = crimeType;
  }
}

Step 2: Lets create your data. I just created a dummy dataset here, but you can read from your data source.

    List<String> crimes = new ArrayList<>();
    crimes.add("1998; Robbery");
    crimes.add("1998; Robbery");
    crimes.add("1998; Narcotics");


    JavaRDD<String> crimesRdd = javaSparkContext().parallelize(crimes);

Step 3: Lets now map it to the bean class

JavaRDD<CrimeInfo> crimeInfoRdd = crimesRdd.map(entry -> {
      String[] crimeInfo = entry.split(";");
      return new CrimeInfo(Integer.parseInt(crimeInfo[0]), crimeInfo[1]);
    });

Step 4: Let's use dataframes to simplify the construct.

Dataset<Row> crimeInfoDataset =     
    sparkSession.createDataFrame(crimeInfoRdd, CrimeInfo.class);`

Step 5: Lets group by the entities to see the result.

crimeInfoDataset.groupBy("year", "crimeType").count().show(false);

+----+----------+-----+
|year|crimeType |count|
+----+----------+-----+
|1998| Robbery  |2    |
|1998| Narcotics|1    |
+----+----------+-----+

If you want to just see activity for few crimeTypes then just use filter on the above dataset.

来源：https://stackoverflow.com/questions/64979082/how-to-filter-in-spark-using-two-conditions

标签

java

apache-spark

filter