Looking for a little python machine learning advice

问题

I'm interested in having a dabble with Python and machine learning/automatic data entry. However as my research has progressed I realise there are so many different techniques each with there own strengths.

I've decided i might get further if i learn in the opposite direction. I.e. pick a problem/task and learn by solving/completing it.

I occasionally have to data process invoices that are faxed, I'm hoping to make a program that can enter these for me once I've scanned then in.

The faxes basically consist of 2 identical tables. Each row denotes a seperate worker. The 1st column is for a workers name(a choice of 6) 2nd is an address then the rest of the columns are tick boxes which denote different jobs. There is also an invoice ID in a box at the top of the page.

I'm hoping for someone to briefly explain how they would go about this. If they would use SVM for text recognition or another technique? and how you could go about making a program understand a tick in the 5th box along means 'cleaned=yes' and that the number in the top left box is the ID. Ive done a bit of research but can't get my head around how to start. How is it possible to isolate parts of a fax e.g. The top table and it's cells from the rest of the page when you can't guarantee absolute placement/size due to the fax/scans. Or do I have to get hundreds of faxes + the typed up data of these faxes then compare them and then get it to slowly learn itself the difference between fax a and b is a tick here, and the ID number is usually here...

Any advice welcomed!

回答1:

Broadly speaking you can divide this process into 2 phases:

Determining location of text. It's at the intersection of ml and Computer Vision, because before text recognition part you need to find where this text is located. It's not an easy task, you can find lines, boxes, etc, look at opencv lib for example, it may be useful for CV-related tasks. If all of your documents have same precise form (location of fields relative to scanned list itself) and you can scan them perfectly, without distortions (rotations, offsets) you can try to search text in static areas, where fields are.
When you have found the text, you have to break contents of each field to words, then words to characters, and then you can feed your recognizer (ML part) with these characters and get labels of each character itself. And it's almost impossible(nowadays) for handwritten text, thus it's hard to recognize handwritten text in general case. Even if fields contain only printed text i recommend you to avoid this step, and use special lib for OCR, like tesseract

来源：https://stackoverflow.com/questions/32089023/looking-for-a-little-python-machine-learning-advice

标签

python

machine-learning

image-recognition

text-recognition