https://github.com/smarbal/ocrai-grader
Automatic grader web application using AI OCR built with Flask, Tailwind CSS + Flowbite, PaddleOCR and pyspellchecker.
Run
sudo docker-compose -up --build
in the root directory.
The web page will be available at http://localhost:3000/.
In order to recognize text across any kind of document, I use PaddleOCR. I've selected this toolkit for a few reasons :
When booting the Docker container, the latest version of their model (PPOCR-v3), english and french version, are automatically downloaded by the Flask server.
The core framework of PP-OCR contains three modules: text detection, detection frame correction, and text recognition.
Since recognition is the most important part here, I'll go in to a bit more detail : Convolutional Recurrent Neural Network use a combination of convolutional neural networks (CNNs) and recurrent neural networks (RNNs) to process the images.
It is a Chinese project; developement and documentation seems geared towards the Chinese community : the chat platform is WeChat (popular Chinese communication app) and a few pre-trained models are Chinese only (handwritten text recognition model exists for Chinese but not English).
I finetuned the latest, best performing model (PPOCR-v3) to specialize in handwritten text recognition. I followed the official documentation to train my model on the IAM dataset. The project offers a simple API to train or re-train your models. The main steps were :
yml
configuration file. This is done from a available template, I mostly had to configure my pre-trained model name, my data files name and the learning rate. The configuration files can be found on ./train
.
That model is finally not included because the initial one wasn't having good results (bad generalisation) and unforeseen GPU driver problems made it impossible to train it again.
Results are excellent for printed characters in any kind of context. Handwritten text, on the other hand, is harder to get right, the context has to be very clear and the writing must not be too messy, cursive or special. Performance is also great generally speaking, it gets longer when processing 3+ pages PDF's with lots of content.
Since OCR will often have additions/deletions or badly recognized characters in a word, I added a spellchecker to correct those small mistakes.
Therefore, I use pyspellchecker
. I chose it because it is one of the fastest Python libraries to do it and it supports multiple langages (and even custom dictionnaries of words, which can be useful in the context of automatic grading).
It uses the Levenshtein Distance algorithm to find words within a distance of 2. Which means it will try all different insertions, deletions, and substitutions (with a maximum of 2 operations) and compare the results to a dictionary. If the words exist, they are taken as candidates for the correction. It will then select the most frequent one amongst the ones with the smallest distance in the selected language as correction.
Other libraries were available such as TextBlob (which is used for a wider application range) that use AI but I was having mixed results. Solutions such as ChatGPT were really good at correcting texts, even with a lot of errors, but I wanted to use open-sourced tools and have everything necessary contained in this repository.