Deserializing large files using Json.NET

二次信任 提交于 2019-12-12 03:39:16

问题


I am trying to process a very large amount of data (~1000 seperate files, each of them ~30 MB) in order to use as input to the training phase of a machine learning algorithm. Raw data files formatted with JSON and I deserialize them using JsonSerializer class of Json.NET. Towards the end of the program, Newtonsoft.Json.dll throwing 'OutOfMemoryException' error. Is there a way to reduce the data in memory, or do I have to change all of my approach (such as switching to a big data framework like Spark) to handle this problem?

public static List<T> DeserializeJsonFiles<T>(string path)
{
    if (string.IsNullOrWhiteSpace(path))
        return null;

    var jsonObjects = new List<T>();
    //var sw = new Stopwatch();
    try
    {
        //sw.Start();
        foreach (var filename in Directory.GetFiles(path))
        {
            using (var streamReader = new StreamReader(filename))
            using (var jsonReader = new JsonTextReader(streamReader))
            {
                jsonReader.SupportMultipleContent = true;
                var serializer = new JsonSerializer();

                while (jsonReader.Read())
                {
                    if (jsonReader.TokenType != JsonToken.StartObject)
                        continue;

                    var jsonObject = serializer.Deserialize<dynamic>(jsonReader);

                    var reducedObject = ApplyFiltering(jsonObject) //return null if the filtering conditions are not met 
                    if (reducedObject == null)
                        continue;

                    jsonObject = reducedObject;
                    jsonObjects.Add(jsonObject);
                }
            }
        }    
        //sw.Stop();
        //Console.WriteLine($"Elapsed time: {sw.Elapsed}, Elapsed mili: {sw.ElapsedMilliseconds}");
    }
    catch (Exception ex)
    {
        Console.WriteLine($"Error: {ex}")
        return null;
    }

    return jsonObjects;
}

Thanks.


回答1:


It's not really a problem with Newtonsoft. You are reading all of these objects into one big list in memory. It gets to a point where you ask the JsonSerializer to create another object and it fails.

You need to return IEnumerable<T> from your method, yield return each object, and deal with them in the calling code without storing them in memory. That means iterating the IEnumerable<T>, processing each item, and writing to disk or wherever they need to end up.



来源:https://stackoverflow.com/questions/42899111/deserializing-large-files-using-json-net

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!