Using libpqxx for to store data in bulk OR how to use COPY statement in libpqxx

梦想与她 提交于 2019-12-14 02:15:20

问题


To insert bulk data/populate a database in PostgreSQL, the fastest way would be to use COPY. Source
I have to populate a database. Right now I am getting write speeds as low as 100-200 per second. This involves sending many individual INSERTS through the C++ library libpqxx. The two reasons I suppose are:

  1. The data has many repeated records.(I have raw logs, which I parse and send.) Which causes primary key exception.
  2. The one-by-one sending of the Insert Statements.

The first one is out of my hands. However I was reading about the second one.
As far as I know tablewriter class was suited to this purpose. However it has apparently been deprecated. I have read that it possible to use stdin as a parameter to copy.
But after these clues I am lost. Can someone lead me to a solution?

Edit: Here is the code, where I have a function which executes the statemnt:

void pushLog(Log log,pqxx::connection *conn){
    pqxx::work w(*conn);
    std::stringstream stmt;
    stmt<<"INSERT INTO logs VALUES('"<<log.getDevice()<<"','"<<log.getUser()<<"','"<<log.getDate()<<"','"<<log.getLabel()<<"');";
    try{
        pqxx::result res = w.exec(stmt.str());
        w.commit();
    }
    catch(const std::exception &e){
        std::cerr << e.what() << std::endl;
        std::cout<<"Exception on statement:["<<stmt.str()<<"]\n";
        return;
    }

}

I establish the connection earlier, and pass a reference to it.

PS: The question might lack some details. If so, please comment, and I'll edit and add them.


回答1:


The pushLog function commits every insert separately, and commit is slow.

As explained in the documentation's Populating a Database:

If you allow each insertion to be committed separately, PostgreSQL is doing a lot of work for each row that is added

Also:

An additional benefit of doing all insertions in one transaction is that if the insertion of one row were to fail then the insertion of all rows inserted up to that point would be rolled back, so you won't be stuck with partially loaded data

In your case however, that would be a problem rather than a benefit, because each INSERT may fail on primary key violation, thus cancelling the previous INSERTs since the last commit. Note that this would also be a problem with COPY, should you use that.

Since it's really necessary to group queries in transactions for performance, you need to deal with primary key violations in a way that doesn't abort the transaction.

Two methods are typically used:

  1. Avoid the error: INSERT INTO... WHERE NOT EXISTS (SELECT 1 FROM table WHERE primary_key=...)

  2. Trap the error by inserting inside a plpgsql function that has an EXCEPTION block ignoring itr. The specific INSERT(s) causing a duplicate will be cancelled but the transaction will not be aborted.

If you have concurrent inserts these methods need to be refined with a locking strategy.



来源:https://stackoverflow.com/questions/19684168/using-libpqxx-for-to-store-data-in-bulk-or-how-to-use-copy-statement-in-libpqxx

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!