26.11.2021 — OCR the doc or use the text in the pdf?

Day 8216「五」

Written on 28 Nov 2021

It must be funny to read back these in the future especially the thought of the day.

✨ Discovery

Currently I am working on the data extraction of the document that is text based, in this case is pdf. Normally what I did for the previous document is to converting them into image file and then ocr them to get the text and coordinates.

With these data, we will then extract the information based on the key required to be found and their respective value according to the coordinates. For e.g. key A on coordinate top left= (200, 300), bottom right= (300, 350)

With this coordinate, we will then explore the other text extraction, getting their coordinate to find which text are next to these key, then with some algorithm to extract the value word by word, concatenate them together, same goes to when we are finding the key that we want. This method is when you are using Google Vision API to get the text because the text returned from Google Vision API will be one word by one word, which make it much harder I would say, unlike the other OCR where some extract the whole horizontal line of sentence where the words are close to each other. I don’t know, maybe I have not find the simple algorithm in concatenating the words by words.

Now, when we have a document that we are able to extract the text from it without the need of ocr it. It will save our one step in our pipeline. But the problem now is that we are unable to get the coordinates of the words.

These are my current ideas:

  1. pdfminer.six: We are able to convert the pdf into various types of files: txt, tag, html and xml. The good thing about this is that with txt, we can extract some simple information from the document with the help of NLTK. And for the tag file, we are able to know the current text is in which page. While for the html file, we are able to see the word coordinate (top attribute in it). I am thinking of using this to compare the key and the value. So we do not need to use table detection model to find the extrac borderless table which is pretty hard atm since I have not train any model to detect it yet and the available model out there is more to bordered table
  2. python-docx: Another method would be converting the pdf, rtf or any files that are text base into docx where this python-docx is pretty good at finding the tables from the docs and with a little help from pandas, we can convert the table into pandas dataframe for us to extract the key and value. But, but, but the problem now is that pdf to be converted into docx, the current free pdf converter out there are not pretty bad for borderless table, like pdf2doc. Thinking of trying the adobe free trial api. Still, if we were able to extract the best table, there will be times where the sentence word-wrapped and causing two rows which I will be having hard time extracting the key, same goes to the table detection model.
  3. tabula-py: Straight away extract tables from pdf file without the conversion of docx and convert the tables into pandas dataframe. But sometime when the table is borderless, you cannot get the best table from it.

LMK if you are better in extracting the data in the future 😌

☁️ Weather

Fooggy mountain is the best morning, a little bit of suneay and the warmness of the sunlight would be nice.

Have to prepare for so much stuffs 😣



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store