问题
I have >20MB text files with some lines containing * at some positions. Accordingly should remove from this file positions matched with position containing * (e.g 700670* should cause to remove all positions 70067000000 to 70067099999). First I make list of positions to remove the code is:
Parallel.ForEach(List, (pos) =>
{ if (pos.IndexOf("*") != -1)
{ var lineWithStar = pos.Substring(0, pos.IndexOf("*"));
var result = from single in List
where single.Substring(0, lineWithStar.Length) == lineWithStar
select single;
listWithPositionsToDel.AddRange(result.Skip(1).ToList());
}
});
It takes ages to get result.
I need to remove line "123456" from input file - everything that matches 123*.
123*
123456
1245
E.g. Result should look like: 700204* 700205100614136* 700205100662305* 7002051006623443904 700205100667271* 700205120015472* Source is: 700204* 700205100614136* 7002041232323234332 700205100662305* 7002051006141362332 7002051006623443904 700205100667271* 700205120015472
回答1:
You have nested loop which is influencing your performance. Also you are doing lots of extra string and lists allocations.
I would do this way: go through file once to find all patterns that you need to remove. Then iterate another time and for every line immediately decide if you need to remove that line or keep it. Then you can either create new list with lines you need to keep or write directly to new file or just add items to be removed in separate collection. Something like that
var linePatternsToRemove = new List<String>();
var resultList = new ConcurrentBag<String>();
foreach (var line in List)
{
var asteriskIndex = line.IndexOf("*");
if (asteriskIndex != -1)
{
linePatternsToRemove.Add(line.Substring(0, asteriskIndex));
}
}
Parallel.ForEach(List, currentLine =>
{
Boolean needDeleteLine = false;
foreach (var pattern in linePatternsToRemove)
{
if (currentLine.StartsWith(pattern))
{
// If line starts with pattern like "700204" it may be the pattern line itself "700204*" and we don't need to delete it
// or it can be regular line and we like "70020412" and we need to delete it.
if (currentLine.Length > pattern.Length && currentLine[pattern.Length] != '*')
{
needDeleteLine = true;
break;
}
}
}
if (!needDeleteLine)
resultList.Add(currentLine);
});
Update: Probably you won't need Parallel.Foreach and plain simple for loop will work fast enough. But if you need parallel, you should think about thread-safe collection for results.
Update2: done changes to code to reflect new information. Please be aware that when using parallel loop, output results collection will be out of order. Also performance will depend a lot on number of patterns in file. If you have big amount of patterns, more complicated solution is required to test every line against lots of various patterns. Probably using trees will be good option for you in that case.
回答2:
I need to remove line "123456" from input file - everything that matches 123*.
123*
123456
1245
来源:https://stackoverflow.com/questions/49916696/increasing-speed-of-matching-strings-in-list