Scraping a plain text file with no HTML?

问题

I have the following data in a plain text file:

1.  Value
Location :  Value
Owner:  Value
Architect:  Value

2.  Value
Location :  Value
Owner:  Value
Architect:  Value

... upto 200+ ...

The numbering and the word Value changes for each segment.

Now I need to insert this data in to a MySQL database.

Do you have a suggestion on how can I traverse and scrape it so I can get the value of the text beside the number, and the value of "location", "owner", "architect" ?

Seems hard to do with DOM scraping class since there is no HTML tags present.

回答1:

That will work with a very simple stateful line-oriented parser. Every line you cumulate parsed data into an array(). When something tells you're on a new record, you dump what you parsed and proceed again.

Line-oriented parsers have a great property : they require little memory and what's most important, constant memory. They can proceed with gigabytes of data without any sweat. I'm managing a bunch of production servers and there's nothing worse than those scripts slurping whole files into memory (then stuffing arrays with parsed content which requires more than twice the original file size as memory).

This works and is mostly unbreakable :

<?php
$in_name = 'in.txt';
$in = fopen($in_name, 'r') or die();

function dump_record($r) {
    print_r($r);
}

$current = array();
while ($line = fgets($in)) {
    /* Skip empty lines (any number of whitespaces is 'empty' */
    if (preg_match('/^\s*$/', $line)) continue;

    /* Search for '123. <value> ' stanzas */
    if (preg_match('/^(\d+)\.\s+(.*)\s*$/', $line, $start)) {
        /* If we already parsed a record, this is the time to dump it */
        if (!empty($current)) dump_record($current);

        /* Let's start the new record */
        $current = array( 'id' => $start[1] );
    }
    else if (preg_match('/^(.*):\s+(.*)\s*/', $line, $keyval)) {
        /* Otherwise parse a plain 'key: value' stanza */
        $current[ $keyval[1] ] = $keyval[2];
    }
    else {
        error_log("parsing error: '$line'");
    }
}

/* Don't forget to dump the last parsed record, situation
 * we only detect at EOF (end of file) */
if (!empty($current)) dump_record($current);

fclose($in);
?>

Obvously you'll need something suited to your taste in function dump_record, like printing a correctly formated INSERT SQL statement.

回答2:

If the data is constantly structured, you can use fscanf to scan them from file.

/* Notice the newlines at the end! */
$format = <<<FORMAT
%d. %s
Location :  %s
Owner:  %s
Arcihtect:  %s


FORMAT;

$file = fopen('file.txt', 'r');
while ($data = fscanf($file, $format)) {
    list($number, $title, $location, $owner, $architect) = $data;
    // Insert the data to database here
}
fclose($file);

More about fscanf in docs.

回答3:

If every block has the same structure, you could do this with the file() function: http://nl.php.net/manual/en/function.file.php

$data = file('path/to/file.txt');

With this every row is an item in the array, and you could loop through it.

for ($i = 0; $i<count($data); $i+=5){
    $valuerow = $data[$i];
    $locationrow = $data[$i+1];
    $ownerrow = $data[$i+2];
    $architectrow = $data[$i+3];
    // strip the data you don't want here, and instert it into the database.
}

回答4:

This will give you what you want,

$array = explode("\n\n", $txt);
foreach($array as $key=>$value) {
    $id_pattern = '#'.($key+1).'. (.*?)\n#';
    preg_match($id_pattern, $value, $id);

    $location_pattern = '#Location \: (.*?)\n#';
    preg_match($location_pattern, $value, $location);


    $owner_pattern = '#Owner\: (.*?)\n#';
    preg_match($owner_pattern, $value, $owner);


    $architect_pattern = '#Architect\: (.*?)#';
    preg_match($architect_pattern, $value, $architect);

    $id = $id[1];
    $location = $location[1];
    $owner = $owner[1];
    $architect = $architect[1];

    mysql_query("INSERT INTO table (id, location, owner, architect) VALUES ('".$id."', '".$location."', '".$owner."', '".$architect."')");
//Change MYSQL query

}

回答5:

Agreed with Topener solution, here's an example if each block is 4 lines + blank line:

$data = file('path/to/file.txt');
$id = 0;
$parsedData = array();
foreach ($data as $n => $row) {
  if (($n % 5) == 0) $id = (int) $row[0];
  else {
    $parsedData[$id][$row[0]] = $row[1];
  }
}

Structure will be convenient to use, for MySQL or whatelse. I didn't add code to remove the colon from the first segment.

Good luck!

回答6:

preg_match_all("/(\d+)\.(.*?)\sLocation\s*\:\s*(.*?)\sOwner\s*\:\s*(.*?)\sArchitect\s*\:\s*(.*?)\s?/i",$txt,$m);

$matched = array();

foreach($m[1] as $k => $v) {

    $matched[$v] = array(
        "location" => trim($m[2][$v]),
        "owner" => trim($m[3][$v]),
        "architect" => trim($m[4][$v])
    );

}

来源：https://stackoverflow.com/questions/8432304/scraping-a-plain-text-file-with-no-html

标签

php

screen-scraping