How do I know if a file is tab or space delimited in Perl?

限于喜欢 提交于 2019-12-24 00:33:26

问题


I am uploading a file to a Perl program from from an HTML page. After the file has been uploaded I want to determine whether the file is either space or tab delimited and all the values are integers. If this is not the case then I want to output some message.

I was thinking of reading every character of the file and checking if it's an integer. If it fails then I'll show the output message. Is there a better way to do this?

I checked few examples and can read the whole file line by line, but how can I read each character in that line? Should I be splitting on space or tab as the file can be either?


回答1:


It's easy enough to split on both spaces and tabs:

my @fields = split /[ \t]/, $line;

but if it has to be only one or the other, and you don't know which ahead of time, that's a little trickier. If you know how many columns there should be in the input, you can try counting the number of spaces and the number of tabs on each line and seeing if there are the right number of separators. E.g. if there are supposed to be 5 columns and you see 4 tabs on each line, it's a good bet that the user is using tabs as separators. If neither one matches up, return an error.

Checking for integer values is straightforward:

for my $val ( @fields ) {
    die "'$val' is not an integer!" if $val !~ /^-?\d+$/;
}



回答2:


sounds like it doesn't matter wether it's delimited by spaces or tabs. You will have to at some point read all of the characters of the file to validate them and to parse them. Why make these two steps. Consume integers from the file until you run into something that isn't whitespace or a valid integer, then complain (and possibly roll back)




回答3:


I am uploading a file to a perl programfrom from an html page. After the file has been uploaded I want to determine whether the file is either (space or tab delimited) and all the values are integers. If this is not the case then I want to output some message.

This condition means that your data should contain of only digits, space and tab characters (basically it should be digits and space, or digits and tab only).

For this, just load the data to variable, and check if it matches:

$data =~ /\A[0-9 \t]+\z/;

If it matches - it will mean that you will have set of integers delimited by spaces or tabs (it's not really relevant which character was used to delimit the integers).

If your next step is to extract these integers (which sounds logical), you can do it easily by:

@integers = split /[ \t]+/, $data;

or

@integers = $data =~ /(\d+)/g;



回答4:


To add to the answer, I will write a clear and simple one. This version:

  1. uses only the most basic Perl functions and constructs, so anyone who knows even a little Perl should get it quite quickly. Not to offend or anything, and there's no shame in being a newbie - I'm just trying to write something that you'll be able to understand no matter what your skill level is.
  2. accepts either tabs or spaces as a delimiter, allowing them to be mixed freely. Commented-out code will detail a trivial way to enforce an either-or throughout the entire document.
  3. prints nice error messages when it encoutnters bad values. Should show the illegal value and the line it appeared on.
  4. allows you to process the data however you like. I'm not going to store it in an array or anything, just put a ... at one point, and there you will add in a bit of code to do whatever processing of the data on a given line you want to perform.

So here goes:

use strict;
use warnings;

open(my $data, "<", $filename);
# define $filename before this, or get it from the user

my $whitespace = "\t ";

chomp(my @data = <$data>);

# check first line for whitespace to enforce...
#if($data[0] =~ /\t/ and $data[0] !~ / /) {
#  $whitespace = "\t";
#} elsif($data[0] =~ / / and $data[0] !~ /\t/) {
#  $whitespace = " ";
#} else {
#  warn "Warning: mixed whitespace on line 1 - ignoring whitespace.\n";
#}

foreach my $n (0 .. $#data) {
  my @fields = split(/[$whitespace]+/, $data[$n]);
  foreach my $f (@fields) {
    if($f !~ /-?\d/) { # \D will call "-12" invalid
      if($f =~ /\s/) {
        warn "Warning: invalid whitespace use at line $n - ignoring.\n";
      } else {
        warn "Warning: invalid value '$f' at line $n - ignoring.\n";
      }
    } else {
      ... # do something with $f, or...
    }
  }
  ... # do something with @fields if you want to process the whole list
}

There are better, faster, more compact, and perhaps even more readable (depending on who you ask) ways to do it, but this one uses the most basic constructs, and any Perl programmer should be able to read this, regardless of skill level (okay, if you're just starting with Perl as a first language, you may not know any of it, but then you shouldn't be trying to do something like this quite yet).

EDIT: fixed my regex for matching integers. It was lazy before, and allowed "12-4", which is obviously not an integer (though it evaluates to one - but that's much more complicated (well, not really, but it's not what the OP wants (or is it? It would be a fun feature (INSERT LISP JOKE HERE)))). Thanks wisnij - I'm glad I re-read your post, since you wrote a better regex than I did.




回答5:


Your question isn't very clear. It sounds like you expect the data to be in this format:

123 456 789
234 567 890

In other words, each line contains one or more groups of digits, separated by whitespace. Assuming you're processing the file one line at a time as you said in the original question, I would use this regex:

/^\d+(\s+\d+)*$/

If there can be negative numbers, use this instead:

/^-?\d+(\s+-?\d+)*$/

Your regex won't match a blank line, and this one won't either. That's probably as it should be; I would expect blank lines (including lines containing nothing but whitespace) to be prohibited in a case like this. However, there could be one or more empty lines at the end of the file. That means, once you find a line that doesn't match the regex above, you should verify that each of the remaining lines has a length of zero.

But I'm making a lot of assumptions here. If this isn't what you're trying to do, you'll need to give us more detailed requirements. Also, all this accomplishes is a rough validation of the format of the data. That's fine if you're merely storing the data, but if you also want to extract information, you probably should do the validation as part of that process.




回答6:


You could just use a regular expression. That's what Perl is famous for ;-).

Simple example:

perl -ne 'if ($_=~/^(\d+\s+)+$/){print "yep\n";}'

will only accept lines that contain only digits and whitespace. That should get you going.




回答7:


I assume several things about your format and desired results.

  • consecutive delimiters collapse.
  • numbers may not wrap around lines, ie new lines are effectively delimiters.
  • tabs and spaces in one file are ok. Either delimiter is acceptable.
  • files are small enough that processing a whole file at once will not be an issue.

Further, my code accepts any whitespace as a delimiter.

use strict;
use warnings;

# Slurp whole file into a scalar.
my $file_contents;
{   local $/;
    $/ = undef;
    $file_contents = <DATA>;
}

# Extract and validate numbers
my @ints = grep validate_integer($_), 
                split( /\s+/, $file_contents ); 
print "@ints\n";


sub validate_integer {
    my $value = shift;

    # is it an integer?
    # add additional validation here.
    if( $value =~ /^-?\d+$/ ) {
        return 1;
    }

    # die here if you want a fatal exception.
    warn "Illegal value '$value'\n";
    return;
}

__DATA__
1 -2 3 4
5 8.8
-6
    10a b c10 -99-
    8   9 98- 9-8
10 -11  12  13

This results in:

Illegal value '8.8'
Illegal value '10a'
Illegal value 'b'
Illegal value 'c10'
Illegal value '-99-'
Illegal value '98-'
Illegal value '9-8'
1 -2 3 4 5 -6 8 9 10 -11 12 13

Updates:

  • Fixed handling of negative numbers.
  • Replaced validation map with grep.
  • Switched to split instead of non-whitespace capture from re.

If you want to process the file line by line, you can wrap the grep in a loop that reads the file.



来源:https://stackoverflow.com/questions/699253/how-do-i-know-if-a-file-is-tab-or-space-delimited-in-perl

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!