Reading CSV file with unescaped enclosures

99封情书 提交于 2019-12-08 13:58:12

问题


I am reading a CSV file but some of the values are not escaped so PHP is reading it wrong. Here is an example of a line that is bad:

" 635"," ","AUBREY R. PHILLIPS (1920- ) - Pastel depicting cottages in a steep sided river valley, possibly North Wales, signed and dated 2000, framed, 66cm by 48cm. another of a rural landscape, titled verso "Harvest Time, Somerset" signed and dated '87, framed, 69cm by 49cm. (2) NB - Aubrey Phillips is a Worcestershire artist who studied at the Stourbridge School of Art.","40","60","WAT","Paintings, prints and watercolours",

You can see Harvest Time, Somerset has quotes around it, causing PHP to think its a new value.

When i do print_r() on each line, the broken lines end up looking like this:

Array
(
    [0] =>  635
    [1] =>  
    [2] => AUBREY R. PHILLIPS (1920- ) - Pastel depicting cottages in a steep sided river valley, possibly North Wales, signed and dated 2000, framed, 66cm by 48cm. another of a rural landscape, titled verso Harvest Time
    [3] => Somerset" signed and dated '87
    [4] => framed
    [5] => 69cm by 49cm. (2)  NB - Aubrey Phillips is a Worcestershire artist who studied at the Stourbridge School of Art."
    [6] => 40
    [7] => 60
    [8] => WAT
    [9] => Paintings, prints and watercolours
    [10] => 
)

Which is obviously wrong, as it now contains many more array elements than other correct rows.

Here is the PHP i am using:

$i = 1;
if (($file = fopen($this->request->data['file']['tmp_name'], "r")) !== FALSE) {
    while (($row = fgetcsv($file, 0, ',', '"')) !== FALSE) {
        if ($i == 1){
            $header = $row;
        }else{
            if (count($header) == count($row)){
                $lots[] = array_combine($header, $row);
            }else{
                $error_rows[] = $row;
            }

        }
        $i++;
    }
    fclose($file);
}

Rows with the wrong amount of values get put into $error_rows and the rest get put into a big $lots array.

What can I do to get around this? Thanks.


回答1:


If you know that you'll always get entries 0 and 1, and that the last 5 entries in the array are always correct, so it's just the descriptive entry that's "corrupted" because of unescaped enclosure characters, then you could extract the first 2 and last 5 using array_slice(), implode() the remainder back into a single string (restoring the lost quotes), and rebuild the array correctly.

$testData = '" 635"," ","AUBREY R. PHILLIPS (1920- ) - Pastel depicting cottages in a steep sided river valley, possibly North Wales, signed and dated 2000, framed, 66cm by 48cm. another of a rural landscape, titled verso "Harvest Time, Somerset" signed and dated \'87, framed, 69cm by 49cm. (2) NB - Aubrey Phillips is a Worcestershire artist who studied at the Stourbridge School of Art.","40","60","WAT","Paintings, prints and watercolours",';

$result = str_getcsv($testData, ',', '"');

$hdr = array_slice($result,0,2);
$bdy = array_slice($result,2,-5);
$bdy = trim(implode('"',$bdy),'"');
$ftr = array_slice($result,-5);

$fixedResult = array_merge($hdr,array($bdy),$ftr);
var_dump($fixedResult);

result is:

array
  0 => string ' 635' (length=4)
  1 => string ' ' (length=1)
  2 => string 'AUBREY R. PHILLIPS (1920- ) - Pastel depicting cottages in a steep sided river valley, possibly North Wales, signed and dated 2000, framed, 66cm by 48cm. another of a rural landscape, titled verso Harvest Time" Somerset" signed and dated '87" framed" 69cm by 49cm. (2) NB - Aubrey Phillips is a Worcestershire artist who studied at the Stourbridge School of Art.' (length=362)
  3 => string '40' (length=2)
  4 => string '60' (length=2)
  5 => string 'WAT' (length=3)
  6 => string 'Paintings, prints and watercolours' (length=34)
  7 => string '' (length=0)

Not perfect, but possibly good enough

The alternative is to get whoever is generating the csv to properly escape their enclosures




回答2:


If you can ecape the " in the text like this: \"

and the in fgetcsv use specify th escape char

fgetcsv($file, 0, ',', '"','\');



回答3:


This is a long shot so don't take i to seriously.

I saw a pattern in the text that all the ',' you want to ignore has a space after it. Search and replace ', ' with 'FUU' or something unique.

Now parse the csv file. It might get the correct format. You only need to replace 'FUU' back to ', '

:)




回答4:


You are probably reading the contents of the CSV file as an array of lines, then splitting each line on the comma. This fails since some of the fields also contain commas. One trick that could help you out is to look for ",", which would indicate a field separator which would be unlikely (but not impossible, unfortunately) to occur inside a field.

<?php
  $csv = file_get_contents("yourfile.csv");
  $lines = split("\r\n", $csv);
  echo "<pre>";
  foreach($lines as $line)
  {
    $line = str_replace("\",\"", "\"@@@\"", $line);
    $fields = split("@@@", $line);
    print_r($fields);
  }
  echo "</pre>";
?>



回答5:


$csv = explode(' ', $csv);
foreach ($csv as $k => $v) if($v[0] == '"' && substr($v, -1) == '"') {
    $csv[$k] = mb_convert_encoding('&ldquo;' . substr($v, 1, -1) . '&rdquo;', 'UTF-8', 'HTML-ENTITIES');
}
$csv = implode(' ', $csv);
$csv = str_getcsv($csv);


来源:https://stackoverflow.com/questions/9736497/reading-csv-file-with-unescaped-enclosures

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!