Length of an XML file

☆樱花仙子☆ 提交于 2019-12-11 03:19:50

问题


I have an XML file of size 31 GB. I need to find the total number of lines in that file. I know the command wc -l will give me the same. However it's taking too long to perform this operation. Is there any faster mechanism to find the number of lines in a large file?


回答1:


31 gigs is a really big text file. I bet it would compress down to about 1.5 gigs. I would create these files in a compressed format to begin with then you can stream a decompressed version of the file through wc. This will greatly reduce the amount of i/o and memory used to process this file. gzip can read and write compressed streams.

But I would also make the following comments:

  • Line numbers are not really that informative for XML as whitespace between elements is ignored (except for mixed content). What do you really want to know about the dataset? I bet counting elements would be more useful.
  • Make sure your xml file is not unnecessarily redunant, for example are you repeating the same namespace declarations all over the document?
  • Perhaps XML is not the best way to represent this document, if it is try looking into something like Fast Infoset



回答2:


if all you need is the line count, wc -l will be as fast as anything else.

The problem is the 31GB text file.




回答3:


If accuracy isn't an issue, find the average line length and divide the file size by that. That way you can get a really fast approximation. (make sure to consider the character encoding used)




回答4:


This falls beyond the point where the code should be refactored to avoid your problem entirely. One way to do this is to place all of the data in the file into a tuple store database instead. Apache couchDB and Intersystems Cache are two systems that you could use for this, and will be far better optimized for the type of data you're dealing with.

If you're really stuck with the xml file, then another option is to count all the lines ahead of time and cache this value. Each time a line is added or removed from the file, you can add or subtract one from the file. Also, make sure to use a 64 bit integer since there may be more than 2^32 lines.




回答5:


No, not really. wc is going to be pretty well optimized. 31GB is a lot of data, and reading it in to count lines is going to take a while no matter what program you use.

Also, this question isn't really appropriate for Stack Overflow, as it's not about programming at all.




回答6:


Isn't counting lines pretty uncertain since in XML newline is basically just a cosmetic thing? It would probably be better to count the number of occurrences of a specific tag.



来源:https://stackoverflow.com/questions/754899/length-of-an-xml-file

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!