Ruby: Length of a line of a file in bytes?

流过昼夜 提交于 2019-12-01 13:49:53

IO#gets works the same as if you were capturing input from the command line: the "Enter" isn't sent as part of the input; neither is it passed when #gets is called on a File or other subclass of IO, so the numbers are definitely not going to match up.

See the relevant Pickaxe section

May I enquire why you're so concerned about the line lengths summing to the file size? You may be solving a harder problem than is necessary...

Aha. I think I get it now.

Lacking a handy iPod (or any other sort, for that matter), I don't know if you want exactly 4K chunks, in which case IO#read(4000) would be your friend (4000 or 4096?) or if you're happier to break by line, in which case something like this ought to work:

class Chunkifier
  def Chunkifier.to_chunks(path)
    chunks, current_chunk_size = [""], 0
    File.readlines(path).each do |line|
      line.chomp! # strips off \n, \r or \r\n depending on OS
      if chunks.last.size + line.size >= 4_000 # 4096?
        chunks.last.chomp! # remove last line terminator
        chunks << ""
      end
      chunks.last << line + "\n" # or whatever terminator you need
    end
    chunks
  end
end

if __FILE__ == $0
  require 'test/unit'
  class TestFile < Test::Unit::TestCase
    def test_chunking
      chs = Chunkifier.to_chunks(PATH)
      chs.each do |chunk|
        assert 4_000 >= chunk.size, "chunk is #{chunk.size} bytes long"
      end
    end
  end
end

Note the use of IO#readlines to get all the text in one slurp: #each or #each_line would do as well. I used String#chomp! to ensure that whatever the OS is doing, the byts at the end are removed, so that \n or whatever can be forced into the output.

I would suggest using File#write, rather than #print or #puts for the output, as the latter have a tendency to deliver OS-specific newline sequences.

If you're really concerned about multi-byte characters, consider taking the each_byte or unpack(C*) options and monkey-patching String, something like this:

class String
  def size_in_bytes
    self.unpack("C*").size
  end
end

The unpack version is about 8 times faster than the each_byte one on my machine, btw.

You might try IO#each_byte, e.g.

total_bytes = 0
file_name = "test_this.txt"
File.open(file_name, "r") do |file|
  file.each_byte {|b| total_bytes += 1}
end
puts "Original size #{File.size(file_name)}"
puts "Total bytes #{total_bytes}"

That, of course, doesn't give you a line at a time. Your best option for that is probably to go through the file via each_byte until you encounter \r\n. The IO class provides a bunch of pretty low-level read methods that might be helpful.

You potentially have several overlapping issues here:

  1. Linefeed characters \r\n vs. \n (as per your previous post). Also EOF file character (^Z)?

  2. Definition of "size" in your problem statement: do you mean "how many characters" (taking into account multi-byte character encodings) or do you mean "how many bytes"?

  3. Interaction of the $KCODE global variable (deprecated in ruby 1.9. See String#encoding and friends if you're running under 1.9). Are there, for example, accented characters in your file?

  4. Your format string for #unpack. I think you want C* here if you really want to count bytes.

Note also the existence of IO#each_line (just so you can throw away the while and be a little more ruby-idiomatic ;-)).

The issue is that when you save a text file on windows, your line breaks are two characters (characters 13 and 10) and therefore 2 bytes, when you save it on linux there is only 1 (character 10). However, ruby reports both these as a single character '\n' - it says character 10. What's worse, is that if you're on linux with a windows file, ruby will give you both characters.

So, if you know that your files are always coming from windows text files and executed on windows, every time you get a newline character you can add 1 to your count. Otherwise it's a couple of conditionals and a little state machine.

BTW there's no EOF 'character'.

f = File.new("log.txt")
begin
    while (line = f.readline)
        line.chomp
        puts line.length
    end
rescue EOFError
    f.close
end

Here is a simple solution, presuming that the current file pointer is set to the start of a line in the read file:

    last_pos = file.pos
    next_line = file.gets
    current_pos = file.pos
    backup_dist = last_pos - current_pos
    file.seek(backup_dist, IO::SEEK_CUR)

in this example "file" is the file from which you are reading. To do this in a loop:

    last_pos = file.pos
    begin loop
        next_line = file.gets
        current_pos = file.pos
        backup_dist = last_pos - current_pos
        last_pos = current_pos
        file.seek(backup_dist, IO::SEEK_CUR)
    end loop
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!