Last week I gave an overview of my experience at MHacks and the Computer Vision Twitter Helmet two friends and I built. In this post, I’ll go more in-depth into the pupil tracking system of the helmet.
We had originally envisioned the helmet to run on a raspberry pi, and because of this we decided to delegate the optical character recognition portion of the system to Amazon AWS. This meant that we couldn’t just naively grab any and all frames from the outward facing camera and extract any text from them because we wouldn’t be able send the frames to AWS for processing fast or cheaply enough. Instead, we had to develop a heuristic to allow us to increase the probability that the frame we grabbed really did have text in it, and only when we were fairly confident of success would we send the frames to AWS for text extraction and processing.
The essence of the heuristic we developed was that sustained horizontal movement of the pupil meant the user was reading text and thus the front-facing frame was a good candidate for text extraction. If you recall from my previous post, our helmet had two cameras attached to it – an eye-tracking and a front-facing camera. With the eye-tracking camera fixed to the helmet we could simply track the pupil’s location in the camera’s view over time, and that would act as a proxy for the gaze analysis and allow us to discern a reading motion from ordinary pseudo-random eye motion.
With this framework, the main functional loop of the eye tracker essentially became:
eye_cam = feed from the eye camera front_cam = feed from the front camera pupil_locations =  while True: eye_frame = eye_cam.get_frame() pupil_locations.append(get_pupil_location(eye_frame)) if is_horizontal(pupil_locations): send_to_aws(front_cam.get_frame()) #Give our tracking a fresh slate pupil_locations[:] =  else: #Only keep track of pupil locations in the past few seconds trim_stale_locations(pupil_locations)
To find the location of the pupil, we used opencv to apply a series of filters on the eye frame to convert it to a more refined black and white frame. From that, and a few assumptions about what form the pupil now took in this new representation, we calculated the center of the pupil.
To get the black and white version of the frame we performed four operations on the it:
- Converted the frame to greyscale while ignoring the red channel. By removing the red channel from the greyscale version, much of the distracting effects of patches of slightly-too-dark skin were removed.
- Smoothed the frame to further gloss over noise and increased the contrast using opencv’s histogram equalization. This helped harden the edge between the pupil/iris and the sclera/eyelids.
- Applied a threshold filter to floor/ceiling the pixel data to be either white or black.
- Applied opencv’s dilate filter. At this point we had the black and white frame, but the eyebrow and eyelashes sometimes remained as a patchy structure which could dominate the pupil within the frame. To remove them we used opencv’s dilate functionality to erode away much of their dark-pixel mass and then smoothed the frame once more to fully remove them.
The image below progressively shows each of the filters with the final step being the centroid calculation.
Once we had the black and white version of the frame, we did a two-pass centroid calculation to find the center of the pupil. First we found the centroid of all the black pixels in the frame. This worked pretty well, but despite our dilation efforts eyelashes and other features around the eye sometimes crept into the black-realm of the image and threw off the centroid. To alleviate that error, we then found the distribution of distances from the black pixels to that first center we’d just calculated. From that, we trimmed away all of the black pixels which were more than a standard deviation away from the first center and recalculated a new center. This reliably gave us the location of the pupil.
With a method to reliably locate the pupil we we’re then able to track it’s motion and discern if it was reading text. To do that, we remembered the pupil locations over the previous second (one location every 0.05 seconds for a total of 20). On that set of locations we calculated the Pearson Correlation Coefficient to determine the strength of their linearity. If it fell into certain bounds then we concluded that the user was indeed reading.
The image above shows the whole system in action. On the left is the raw eye camera frame and on the right is the processed version of it with the red dot being the current location of the pupil, the purple circles being the past 1-second’s locations, and the large red circle estimating the entire extent of the pupil.
The whole system tracked the pupil really well and reliably fired off frames to AWS when we were reading. It wasn’t perfect though, as during our demo I was frequently looking between multiple people’s eyes while explaining the functionality and that incorrectly fired off quite a few reading events. To alleviate that issue, we introduced a cooldown which limited the rate at which reading events fired, thus bringing the false events under control.
Once again, the code can be found on github.
In my next post I’ll go more in depth into the AWS framework for extracting the text and tweeting the result.