Read file and extract certain part only

问题

ifstream toOpen;
openFile.open("sample.html", ios::in); 

if(toOpen.is_open()){
    while(!toOpen.eof()){
        getline(toOpen,line);
        if(line.find("href=") && !line.find(".pdf")){   
                start_pos = line.find("href"); 
        tempString = line.substr(start_pos+1); // i dont want the quote
            stop_pos = tempString .find("\"");
                string testResult = tempString .substr(start_pos, stop_pos);
        cout << testResult << endl;
        }
    }

    toOpen.close();
}

What I am trying to do, is to extrat the "href" value. But I cant get it works.

EDIT:

Thanks to Tony hint, I use this:

if(line.find("href=") != std::string::npos ){   
    // Process
}

it works!!

回答1:

I'd advise against trying to parse HTML like this. Unless you know a lot about the source and are quite certain about how it'll be formatted, chances are that anything you do will have problems. HTML is an ugly language with an (almost) self-contradictory specification that (for example) says particular things are not allowed -- but then goes on to tell you how you're required to interpret them anyway.

Worse, almost any character can (at least potentially) be encoded in any of at least three or four different ways, so unless you scan for (and carry out) the right conversions (in the right order) first, you can end up missing legitimate links and/or including "phantom" links.

You might want to look at the answers to this previous question for suggestions about an HTML parser to use.

回答2:

As a start, you might want to take some shortcuts in the way you write the loop over lines in order to make it clearer. Here is the conventional "read line at a time" loop using C++ iostreams:

#include <fstream>
#include <iostream>
#include <string>

int main ( int, char ** )
{
    std::ifstream file("sample.html");
    if ( !file.is_open() ) {
        std::cerr << "Failed to open file." << std::endl;
        return (EXIT_FAILURE);
    }
    for ( std::string line; (std::getline(file,line)); )
    {
        // process line.
    }
}

As for the inner part the processes the line, there are several problems.

It doesn't compile. I suppose this is what you meant with "I cant get it works". When asking a question, this is the kind of information you might want to provide in order to get good help.
There is confusion between variable names temp and tempString etc.
string::find() returns a large positive integer to indicate invalid positions (the size_type is unsigned), so you will always enter the loop unless a match is found starting at character position 0, in which case you probably do want to enter the loop.

Here is a simple test content for sample.html.

<html>
    <a href="foo.pdf"/>
</html>

Sticking the following inside the loop:

if ((line.find("href=") != std::string::npos) &&
    (line.find(".pdf" ) != std::string::npos))
{
    const std::size_t start_pos = line.find("href");
    std::string temp = line.substr(start_pos+6);
    const std::size_t stop_pos = temp.find("\"");
    std::string result = temp.substr(0, stop_pos);
    std::cout << "'" << result << "'" << std::endl;
}

I actually get the output

'foo.pdf'

However, as Jerry pointed out, you might not want to use this in a production environment. If this is a simple homework or exercise on how to use the <string>, <iostream> and <fstream> libraries, then go ahead with such a procedure.

来源：https://stackoverflow.com/questions/4279200/read-file-and-extract-certain-part-only

标签

c++

string

extract