Parallel tasks performance in c#

房东的猫 提交于 2021-02-11 17:25:42

问题


I need to make Tasks run faster, I tried to use semaphore, parallel library and threads(tried to open one for every work, I know its the most dumb thing to do), but none of them show the performance I need. I'm not familiar to work with thread stuff and I need some help to find the right way and understand how Task and Threads work.

Here is the function:

 public class Test
    {
        public void openThreads()
        {
            int maxConcurrency = 500;
            var someWork = get_data_from_database();
            using (SemaphoreSlim concurrencySemaphore = new SemaphoreSlim(maxConcurrency))
            {
                List<Task> tasks = new List<Task>();
                foreach (var work in someWork)
                {
                    concurrencySemaphore.Wait();

                    var t = Task.Factory.StartNew(() =>
                    {
                        try
                        {
                            ScrapThings(work);
                        }
                        finally
                        {
                            concurrencySemaphore.Release();
                        }
                    });

                    tasks.Add(t);
                }

                Task.WaitAll(tasks.ToArray());
            }
        }

        public async Task ScrapThings(Object work)
        {
            HttpClient client = new HttpClient();
            Encoding utf8 = Encoding.UTF8;
            var response = client.GetAsync(work.url).Result;
            var buffer = response.Content.ReadAsByteArrayAsync().Result;
            string content = utf8.GetString(buffer);
            /*
             Do some parse operations, load html document, get xpath, split things, etc 
             */

            while(true) // this loop runs from 1~15 times
            {
                response = client.GetAsync(work.anotherUrl).Result;
                buffer = response.Content.ReadAsByteArrayAsync().Result;
                content = utf8.GetString(buffer);
                if (content == "OK")
                    break;

                await Task.Delay(10000); //I need some throttle here before it tries again
            }
            /*
                Do some parse operations, load html document, get xpath, split things, etc 
                */
            update_things_in_database();
        }
    }

I want to make this task run 500 times in parallel, all the operation takes 18 hours to complete and I need to decrease this, I'm using xeon with 32 cores/64 threads. I tried to open 500 threads (better performance comparing to semaphore and parallel library) but it doesnt feel the right way to do.


回答1:


I would say problem with performance is not with how you run your threads, but how individual threads are performing. Depended on version of .NET/libraries you are using there are few possible issues.

  1. You should reuse HttpClient instances, for reasons explained here for example.
  2. If work.url and work.anotherUrl use the same subset of domains you should look into connection limit per endpoint (and total also). Depended on version either HttpClientHandler.MaxConnectionsPerServer or ServicePoint.ConnectionLimit and ServicePointManager.DefaultConnectionLimit . The former one is for .NET Core and latter for .NET Full framework.

The recommended approach to solve the first issue is to use IHttpClientFactory

And some more info.

UPD

You mentioned in comments that you are using .NET 4.7.2 so I would suggest to start with adding next lines to your application (somewhere at the start):

ServicePointManager.DefaultConnectionLimit = 500;
// if you can get collection of most scrapped ones:
var domains = new [] { "http://slowwly.robertomurray.co.uk" };
foreach(var d in domains)
{
    var delayServicePoint = ServicePointManager.FindServicePoint(new Uri(d));
    delayServicePoint.ConnectionLimit = 10; // or bigger
}



回答2:


This sounds like a job for the TPL Dataflow library. You probably need different concurrency levels for the I/O bound operations (web requests, database updates) and the CPU-bound operations (parsing of the data). The TPL Dataflow allows to build a pipeline where each block is responsible for a single operation, and the data flows from one block to the next. It even allows for cyclic graphs, so for example you are allowed to throw a failed data element back into the block, so that it can be processed again.

For some examples of using this library, look here, here or here.

The TPL Dataflow library is embedded in .NET Core, and available as a package for .NET Framework.



来源:https://stackoverflow.com/questions/61957315/parallel-tasks-performance-in-c-sharp

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!