问题
I'm trying to run my code with a multiprocessing function but mongo keep returning
"MongoClient opened before fork. Create MongoClient with connect=False, or create client after forking."
I really doesn't understand how i can adapt my code to this. Basically the structure is:
db = MongoClient().database
db.authenticate('user', 'password', mechanism='SCRAM-SHA-1')
collectionW = db['words']
collectionT = db['sinMemo']
collectionL = db['sinLogic']
def findW(word):
    rows = collectionw.find({"word": word})
    ind = 0
    for row in rows:
        ind += 1
        id = row["_id"]
    if ind == 0:
        a = ind
    else:
        a = id
    return a
def trainAI(stri):
...
      if findW(word) == 0:
                _id = db['words'].insert(
                    {"_id": getNextSequence(db.counters, "nodeid"), "word": word})
                story = _id
            else:
                story = findW(word)
...
def train(index):
    # searching progress
    progFile = "./train/progress{0}.txt".format(index)
    trainFile = "./train/small_file_{0}".format(index)
    if os.path.exists(progFile):
        f = open(progFile, "r")
        ind = f.read().strip()
        if ind != "":
            pprint(ind)
            i = int(ind)
        else:
            pprint("No progress saved or progress lost!")
            i = 0
        f.close()
    else:
        i = 0
    #get the number of line of the file    
    rangeC = rawbigcount(trainFile)
    #fix unicode
    non_bmp_map = dict.fromkeys(range(0x10000, sys.maxunicode + 1), 0xfffd)
    files = io.open(trainFile, "r", encoding="utf8")
    str1 = ""
    str2 = ""
    filex = open(progFile, "w")
    with progressbar.ProgressBar(max_value=rangeC) as bar:
        for line in files:
            line = line.replace("\n", "")
            if i % 2 == 0:
                str1 = line.translate(non_bmp_map)
            else:
                str2 = line.translate(non_bmp_map)
            bar.update(i)
            trainAI(str1 + " " + str2)
            filex.seek(0)
            filex.truncate()
            filex.write(str(i))
            i += 1
#multiprocessing function
maxProcess = 3
def f(l, i):
    l.acquire()
    train(i + 1)
    l.release()
if __name__ == '__main__':
    lock = Lock()
    for num in range(maxProcess):
        pprint("start " + str(num))
        Process(target=f, args=(lock, num)).start()
This code is made for reading 4 different file in 4 different process and at the same time insert the data in the database. I copied only part of the code for make you understand the structure of it.
I've tried to add connect=False to this code but nothing...
  db = MongoClient(connect=False).database
  db.authenticate('user', 'password', mechanism='SCRAM-SHA-1')
  collectionW = db['words']
  collectionT = db['sinMemo']
  collectionL = db['sinLogic']
then i've tried to move it in the f function (right before train() but what i get is that the program doesn't find collectionW,collectionT and collectionL.
I'm not very expert of python or mongodb so i hope that this is not a silly question.
The code is running under Ubuntu 16.04.2 with python 2.7.12
回答1:
db.authenticate will have to connect to mongo server and it will try to make a connection. So, even though connect=False is being used, db.authenticate will require a connection to be open. Why don't you create the mongo client instance after fork? That's look like the easiest solution.
回答2:
Since db.authenticate must open the MongoClient and connect to the server, it creates connections which won't work in the forked subprocess. Hence, the error message. Try this instead:
db = MongoClient('mongodb://user:password@localhost', connect=False).database
Also, delete the Lock l. Acquiring a lock in one subprocess has no effect on other subprocesses.
回答3:
Here is how I did it for my problem:
import pathos.pools as pp
import time
import db_access
class MultiprocessingTest(object):
    def __init__(self):
        pass
    def test_mp(self):
        data = [[form,'form_number','client_id'] for form in range(5000)]
        pool = pp.ProcessPool(4)
        pool.map(db_access.insertData, data)
if __name__ == '__main__':
    time_i = time.time()
    mp = MultiprocessingTest()
    mp.test_mp()
    time_f = time.time()
    print 'Time Taken: ', time_f - time_i
Here is db_access.py:
from pymongo import MongoClient
def insertData(form):
    client = MongoClient()
    db = client['TEST_001']
    db.initialization.insert({
        "form": form[0],
        "form_number": form[1],
        "client_id": form[2]
    })
This is happening to your code because you are initiating MongoCLient() once for all the sub-processes. MongoClient is not fork safe. So, initiating inside each function works and let me know if there are other solutions.
来源:https://stackoverflow.com/questions/45530741/manage-python-multiprocessing-with-mongodb