问题
I'm doing an excercice where I'm required to filter the amount of crimes per year based on a file that has more than 13 millions of lines (in case that's an important info). For that, I did this and it's working fine:
JavaRDD<String> anoRDD = arquivo.map(s ->
{String[] campos = s.split(";") ;
return campos[2];
});
System.out.println(anoRDD.countByValue());
But, the next question to be answered is "How many "NARCOTIC" crimes happen per YEAR?", I managed to filter the total amount, but not per year, I did the following:
JavaRDD<String> NarcoticsRDD = arquivo.map(s ->
{String[] campos = s.split(";") ;
return campos[4];
});
JavaRDD<String> JustNarcotics = NarcoticsRDD.filter(s -> s.equals("NARCOTICS"));
System.out.println(JustsNarcotics.countByValue());
How can I do this type of filter in Spark using java?
Tks!
回答1:
So the first thing you would want to do is to map your data to a bean class.
Step 1: Let's create a bean class to represent your data. This should implement serializable and must have public getters and setters.
public class CrimeInfo implements Serializable {
private Integer year;
private String crimeType;
public CrimeInfo(Integer year, String crimeType) {
this.year = year;
this.crimeType = crimeType;
}
public Integer getYear() {
return year;
}
public void setYear(Integer year) {
this.year = year;
}
public String getCrimeType() {
return crimeType;
}
public void setCrimeType(String crimeType) {
this.crimeType = crimeType;
}
}
Step 2: Lets create your data. I just created a dummy dataset here, but you can read from your data source.
List<String> crimes = new ArrayList<>();
crimes.add("1998; Robbery");
crimes.add("1998; Robbery");
crimes.add("1998; Narcotics");
JavaRDD<String> crimesRdd = javaSparkContext().parallelize(crimes);
Step 3: Lets now map it to the bean class
JavaRDD<CrimeInfo> crimeInfoRdd = crimesRdd.map(entry -> {
String[] crimeInfo = entry.split(";");
return new CrimeInfo(Integer.parseInt(crimeInfo[0]), crimeInfo[1]);
});
Step 4: Let's use dataframes to simplify the construct.
Dataset<Row> crimeInfoDataset =
sparkSession.createDataFrame(crimeInfoRdd, CrimeInfo.class);`
Step 5: Lets group by the entities to see the result.
crimeInfoDataset.groupBy("year", "crimeType").count().show(false);
+----+----------+-----+
|year|crimeType |count|
+----+----------+-----+
|1998| Robbery |2 |
|1998| Narcotics|1 |
+----+----------+-----+
If you want to just see activity for few crimeTypes then just use filter on the above dataset.
来源:https://stackoverflow.com/questions/64979082/how-to-filter-in-spark-using-two-conditions