问题
I have a array which contains a files path, I want to make a list a those file which are duplicate on the basis of their MD5. I calculate their MD5 like this:
private void calcMD5(Array files) //Array contains a path of all files
{
int i=0;
string[] md5_val = new string[files.Length];
foreach (string file_name in files)
{
using (var md5 = MD5.Create())
{
using (var stream = File.OpenRead(file_name))
{
md5_val[i] = BitConverter.ToString(md5.ComputeHash(stream)).Replace("-", "").ToLower();
i += 1;
}
}
}
}
From above I able to calculate their MD5 but how to get only list of those files which are duplicate. If there is any other way to do same please let me know, and also I am new to Linq
回答1:
1.
Rewrite your calcMD5
function to take in a single file path and return the MD5.2.
Store your file names in a string[]
or List<string>
, not an untyped array, if possible.3.
Use the following LINQ to get groups of files with the same hash:
var groupsOfFilesWithSameHash = files
// or files.Cast<string>() if you're stuck with an Array
.GroupBy(f => calcMD5(f))
.Where(g => g.Count() > 1);
4.
You can get to the groups with nested foreach
loops, for example:
foreach(var group in groupsOfFilesWithSameHash)
{
Console.WriteLine("Shared MD5: " + g.Key);
foreach (var file in group)
Console.WriteLine(" " + file);
}
回答2:
static void Main(string[] args)
{
// returns a list of file names, which have duplicate MD5 hashes
var duplicates = CalcDuplicates(new[] {"Hello.txt", "World.txt"});
}
private static IEnumerable<string> CalcDuplicates(IEnumerable<string> fileNames)
{
return fileNames.GroupBy(CalcMd5OfFile)
.Where(g => g.Count() > 1)
// skip SelectMany() if you'd like the duplicates grouped by their hashes as group key
.SelectMany(g => g);
}
private static string CalcMd5OfFile(string path)
{
// I took your implementation - I don't know if there are better ones
using (var md5 = MD5.Create())
{
using (var stream = File.OpenRead(path))
{
return BitConverter.ToString(md5.ComputeHash(stream)).Replace("-", "").ToLower();
}
}
}
回答3:
Instead of returning an array of all the files MD5 hashes, do it this way instead:
- Have a single 'calculateFileHash()' method.
- Create an array of filenames to test for.
Do this:
var dupes = Filenames.GroupBy(fn => calculateFileHash(fn)).Where(gr => gr.Count > 1);
This will return an array of groups, each group being an enumerable containing the filenames with identical content to each other.
回答4:
var duplicates = md5_val.GroupBy(x => x).Where(x => x.Count() > 1).Select(x => x.Key);
That will give you a list of hashes that are duplicated within the array.
To get names instead of hashes as well:
var duplicates = md5_val.Select((x,i) => new Tuple<string, int>(x, i))
.GroupBy(x => x.Item1)
.Where(x => x.Count() > 1)
.SelectMany(x => files[x.Item2].ToList());
回答5:
private void calcMD5(String[] filePathes) //Array contains a path of all files
{
Dictionary<String, String> hashToFilePathes = new Dictionary<String, String>();
foreach (string file_name in filePathes)
{
using (var md5 = MD5.Create())
{
using (var stream = File.OpenRead(file_name))
{
//This will get you dictionary where key is md5hash and value is filepath
hashToFilePathes.Add(BitConverter.ToString(md5.ComputeHash(stream)).Replace("-", "").ToLower(), file_name);
}
}
}
// Here will be all duplicates
List<String> listOfDuplicates = hashToFilePathes.GroupBy(e => e.Key).Where(e => e.Count() > 1).SelectMany(e=>e).Select(e => e.Value).ToList();
}
}
来源:https://stackoverflow.com/questions/15133970/get-duplicate-file-list-by-computing-their-md5