In order to work on the recognition algorithms I realised that I didn’t have enough (read I had none) real world image data to train/classify the character recognition code. So I built up a test bed that grabs a character from the database that I downloaded (I think it was the NIST alphabet database), superimposes it on a coloured shape (which was restricted to squares and circles, the latter being worked out using a negative (mask) image), rotates this target at an arbitrary angle and scales it down to a certain ratio and places this ‘target’ randomly across a generic aerial field picture. The test bed works pretty well, and I utilized the same (coarse) code to gather the letter image and/or the contours to be fed to the classifier.
Having done the test bed, I moved on to recognition tasks, that I naively assumed would only take some time to get their act together. I had read up a lot of papers, and the most straightforward and promising approach appeared to be utilizing Hu moments. Hu moments are essentially seven numbers that are algebraically derived from simple geometrical moments of any image in particular with the special property that these numbers represent the shape invariantly to scale and rotation. To make things even better, OpenCV has built in methods to determine Hu moments of an image, and compare the Hu moments using four distance measures (statistical equations which determine the ‘distance’ between compared values)
So, I set up a folder with nicely set normalized alphabets and compared the Hu moments of the extracted letter with the hu moments of all these reference letters and assumed that the letter with the lowest difference should be the matching letter. Pretty straightforward, right? Unfortunately for me, it is definitely not the case, and letters hardly gave correct results and for slimmer letters, the results went haywire, with L and Y taking away almost all the matches. Only R behaves well. Here’s a snapshot of the test bed in action:
Turns out that Hu moments aren’t complete descriptors, for example, they can’t distinguish between a normal pan and a pan with two diametrically opposite handles.
Disappointed, I moved on to the very exciting field of Neural Networks, and the promises that they held in store. After some studiously banging my head against the OpenCV documentation (somebody *really* needs to work on it. It is haranguing! –update: I sincerely hope the intern at GSoC does a good job!) I finally managed to get something working – the code didn’t throw up errors. I saved the file and got thousands of lines of coefficients – which seemed like what the network should have looked.
How I approached the Neural Networks was thus: I selected a letter for each letter from the database and then got 36 images by rotating the letter progressively by 10 degrees through 360 degrees(thus I had 26*36 images in all) Now all these images were already normalized and centered, and I scaled them down to 32×32 px. Hence, my design for each neuron was this – 32×32 input nodes, a hidden layer of 100 nodes, and an output layer of 26 nodes, with the value of each node representing the probability of the input being the letter corresponding to that node. I used OpenCV’s implementation of the multi layered feed forward perceptron (ain’t that a fancy ((and intimidating)) name?) and trained the network using back propogation, again implemented in the CVANN_MLP class.
I had no idea if my implementation was working correctly or not, but the ‘results’ sure were way off. I contemplated using Hu moments as inputs, but then again, the descriptors themselves were not trustworthy in the first place. Seeing no way out through the impasse, and my growing desperation to get recognition working to meet the already missed deadlines I started looking into other methods. Zernike moments became the new ‘it’ thing, but due to paucity of time, I had to abandon pursuing them and tried a couple of other techniques in the meanwhile. (In case you’re wondering, K-NN was not even considered this time around)
I tried using a clever approach by Torres, Suarez, Sucar, et all which involved making concentric circles from the centre of mass of each image and counting the number of white to black transitions, which will not vary with scale or rotation. However, due to the small size of the letters and the resulting inevitable artifacts made this method very unreliable, and led to its consequent shelving. Here’s a screenshot of that in operation (I made it pretty nice and colourful. Notice the circular code recognizing it as a T. Nods disapprovingly)
Another promising method was from an ICDAR paper which used rotated images of the characters (just like I did for the neural network training) to build up an Eigen space and then come up with an Eigen face, of sorts, which when used to compare with a given image vector, would recognize BOTH the letter and the orientation. However, the mathematics seemed a bit dense, and the recent decision to stop working on recognition led to it’s abandonment.
So, like I just mentioned, we discussed and decided that the focus of the team should be to be able to present the required ‘actionable intelligence’ to the competition judges, and so we need to complete our GUI, segmentation and acquisition tasks to get a minimum working model ready.
It’s been quite an arduous task, but in all fairness, recognition itself is not required to be autonomous. Sigh.