updating line in large text file using scala

拟墨画扇 提交于 2020-01-05 12:32:36

问题


i've a large text file around 43GB in .ttl contains triples in the form :

<http://www.wikidata.org/entity/Q1001> <http://www.w3.org/2002/07/owl#sameAs> <http://la.dbpedia.org/resource/Mahatma_Gandhi> .
<http://www.wikidata.org/entity/Q1001> <http://www.w3.org/2002/07/owl#sameAs> <http://lad.dbpedia.org/resource/Mohandas_Gandhi> .

and i want to find the fastest way to update a specific line inside the file without rewriting all next text. either by updating it or deleting it and appending it to the end of the file

to access the specific line i use this code :

val lines = io.Source.fromFile("text.txt").getLines
val seventhLine = lines drop(10000000) next

回答1:


If you want to use text files, consider a fixed length/record size for each line/record.

This way you can use a RandomAccessFile to seek to the exact position of each line by number: You just seek to line * LineSize, and then update it.

It will not really help, if you have to insert a new line. Other limitations are: The file size will grow (because of the fixed record length), and there will always be one record which is too big.

As for the initial conversion:

  • Get the maximum line length of the current file, then add 10% for example.
  • Now you have to convert the file once: Read a line from the text file, and convert it into a fixed-size record.
  • You could use a special character like | to separate the fields. If possible, use somthing like ;, so you get a .csv file
  • I suggest padding the remaining space it with spaces, so it still looks like a text file which you can parse with shell utilities.
  • You could use a \n to terminate the record.

For example

http://x.com|http://x.com|http://x.com|...\n

or

http://x.com;http://x.com;http://x.com;...\n

where each . at the end represents a space character. So it's still somehow compatible with a "normal" text file.


On the other hand, looking at your data, consider using a key-value data store like Redis: You could use the line number or the 1st URL as the key.



来源:https://stackoverflow.com/questions/17739973/updating-line-in-large-text-file-using-scala

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!