问题
I have text file with a size of over 50gb. Now i want to delete the duplicate words. But I have heard, that i need very much RAM to load every Word from the text file into an Hash Set. Can you tell me a very good way to delete every duplicate word from the text file? The Words are sorted by a white space, like this.
word1 word2 word3 ... ...
回答1:
The H2 answer is good, but maybe overkill. All the words in the english language won't be more than a few Mb. Just use a set. You could use this in RAnders00 program.
public static void read50Gigs(String fileLocation, String newFileLocation) {
Set<String> words = new HashSet<>();
try(FileInputStream fileInputStream = new FileInputStream(fileLocation);
Scanner scanner = new Scanner(fileInputStream);) {
while (scanner.hasNext()) {
String nextWord = scanner.next();
words.add(nextWord);
}
System.out.println("words size "+words.size());
Files.write(Paths.get(newFileLocation), words,
StandardOpenOption.CREATE, StandardOpenOption.WRITE);
} catch (IOException e) {
throw new RuntimeException(e);
}
}
As an estimate of common words, I added this for war and peace (from gutenberg)
public static void read50Gigs(String fileLocation, String newFileLocation) {
try {
Set<String> words = Files.lines(Paths.get("war and peace.txt"))
.map(s -> s.replaceAll("[^a-zA-Z\\s]", ""))
.flatMap(Pattern.compile("\\s")::splitAsStream)
.collect(Collectors.toSet());
System.out.println("words size " + words.size());//22100
Files.write(Paths.get("out.txt"), words,
StandardOpenOption.CREATE,
StandardOpenOption.TRUNCATE_EXISTING,
StandardOpenOption.WRITE);
} catch (IOException e) {}
}
It completed in 0 seconds. You can't use Files.lines unless your huge source file has line breaks. With line breaks, it will process it line by line so it won't use too much memory.
回答2:
This approach uses a database to buffer the words found.
It also assumes that words - regardless of case - are equal.
The H2 documentation states that a database on a non-FAT filesystem has a maximum size of 4 TB (using the default page size of 2KB), which is more than enough for this purpose.
package com.stackoverflow;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.sql.*;
import java.util.Scanner;
public class H2WordReading {
public static void main(String[] args) {
// read50Gigs("50gigfile.txt", "cleaned50gigfile.txt");
read50Gigs("./testSmallFile", "./cleaned");
}
public static void read50Gigs(String fileLocation, String newFileLocation) {
try (Connection connection = DriverManager.getConnection("jdbc:h2:./words");
FileInputStream fileInputStream = new FileInputStream(fileLocation);
Scanner scanner = new Scanner(fileInputStream);
FileOutputStream fileOutputStream = new FileOutputStream(newFileLocation);
OutputStreamWriter outputStreamWriter = new OutputStreamWriter(fileOutputStream)) {
connection.createStatement().execute("DROP TABLE IF EXISTS WORDS;");
connection.createStatement().execute("CREATE TABLE WORDS(WORD VARCHAR NOT NULL);");
PreparedStatement insertStatement = connection.prepareStatement("INSERT INTO WORDS VALUES (?);");
PreparedStatement queryStatement = connection.prepareStatement("SELECT * FROM WORDS WHERE UPPER(WORD) = UPPER(?);");
while (scanner.hasNext()) {
String nextWord = scanner.next();
queryStatement.setString(1, nextWord);
ResultSet resultSet = queryStatement.executeQuery();
if (!resultSet.next()) // word not found, ok
{
outputStreamWriter.write(scanner.hasNext() ? (nextWord + ' ') : nextWord);
insertStatement.setString(1, nextWord);
insertStatement.execute();
} // word found, just don't write anything
}
} catch (IOException | SQLException e) {
throw new RuntimeException(e);
}
}
}
You need to add the H2 driver jar present on the classpath.
Note that I only tested this with a small file consisting of 10 words or so. You should try this attempt with your 50 gigabyte file and report back any errors.
Please be aware that this attempt
- normalizes all whitespace and newlines down to a single space character
- always uses the first occurrence of a word and deletes all upcoming occurences
The time this attempt takes scales exponentially with the amount of words in the file.
回答3:
// Remove duplicate words from a file
public String removeDupsFromFile(String str) {
String[] words = str.split(" ");
LinkedHashMap<String, Integer> map = new LinkedHashMap<String, Integer>();
for (int i = 0 ; i < words.length ; i++) {
if (map.containsKey(words[i])) {
int count = map.get(words[i]) + 1;
map.put(words[i], count);
} else {
map.put(words[i], 1);
}
}
StringBuilder result = new StringBuilder("");
Iterator itr = map.keySet().iterator();
while (itr.hasNext()) {
result.append(itr.next() + " ");
}
return result.toString();
}
来源:https://stackoverflow.com/questions/33321275/delete-duplicate-words-in-a-big-text-file-java