I would like to remove duplicates from my data in my CSV file. The first column is the year, and the second is the sentence. I would like to remove any duplicates of a sentence, regardless of the year information.
Is there a command that I can insert in val text = { } to remove these dupes?
My script is:
val source = CSVFile("science.csv");
val text = {
source ~>
Column(2) ~>
TokenizeWith(tokenizer) ~>
TermCounter() ~>
TermMinimumDocumentCountFilter(30) ~>
TermDynamicStopListFilter(10) ~>
DocumentMinimumLengthFilter(5)
}
Thank you!
Essentially you want a version of distinct where you can specify what makes an object (row) unique (the second column).
Given the code: (modified SeqLike.distinct)
type Row = (Int, String)
def distinct(rows:Seq[Row], f: Row => AnyRef) = {
val b = newBuilder
val seen = mutable.HashSet[AnyRef]()
val key = f(x)
for (x <- rows) {
if (!seen(key)) {
b += x
seen += key
}
}
b.result
}
If you had a list of rows (where a row is a tuple) you could get the filtered/unique ones based on the second column with
distinct(rows, (_._2))
Do you need to have your code reproducible? If not, then in excel, click on the "Data" tab, click the little box directly above "1" and to the left of "A" to highlight everything, click "Remove Duplicates", make sure "My data has headers" is selected if you have headers, and then unclick the column that has the years, only keeping the column that has the sentence with a check mark next to it. This will remove duplicate sentences but keep the first instance of the year occuring.
As sets naturally eliminate duplicates, a simple approach would be to fill the rows into a TreeSet
, using a custom ordering which only takes into account the text part of each row.
Update
Here is a sample script to demonstrate the above:
import collection.immutable.TreeSet
import scala.io.Source
val lines = Source.fromFile("science.csv").getLines()
val uniques = lines.foldLeft(TreeSet[String]()(Ordering.by(_.split(',')(1)))) {
(s, l) =>
if (s contains l) s
else s + l
}
uniques.toList.sorted foreach println
The script folds the sequence of lines into a treeset with a custom ordering based on the 2nd part of the comma-separated line. The simplest fold function would be (s, l) => s + l
; however, that would result in the lines with later year overwriting lines with the same text of earlier years. This is why I had to test for containment first.
Now we are almost ready, we just need to reorder the collection again by year before printing (this assuming the input was ordered by year).
来源:https://stackoverflow.com/questions/14260103/removing-duplicates