A foundation on sinking sand

This experience has been unlike any other, to be quite honest.

One thing that is important to remember -- plans can change in a moments notice.

About Story Squad

Story Squad is the brainchild of a former educator, Graig Peterson. Seeing the scourge of technology and its dire effects on young, developing children, Graig sought a way to encourage creative writing and reading while also engaging in minimal screen time.

Here's how it works: Users of the website/app are primarily children aged 8-12. Each week, they have an opportunity to read, write, draw and play a competitive game with a partner to be placed on a weekly leaderboard.

First, the children are given a story to read -- usually 2-5 pages of an ongoing series. After reading the story, they are instructed to write by hand a spinoff of the story. Sometimes related, other times not. The prompt also includes a drawing. Giving children the space to create their OWN story and pictures encourages introspection, empathy, creativity, among other great softskills and behaviors.

After their drawings and writing are complete, they upload them to Story Squad. The stories are transcribed and put through a custom complexity analysis algorithm that assigns a score to each story. The drawings are screened for inappropriate content and then sent to a moderator to be further reviewed. Once all submissions have been checked, the user's scores are sent to a clustering algorithm. This groups users into a team of 2 based on their complexities.

Each user is given 100 points to divvy up between each drawing and story (for a total of 200 points per team). Then, they get paired head-to-head against another team in the cluster and they vote on the best story and drawing.

Then, it starts all over again the following week. Scores get updated to the leaderboard, and the users can keep track of their progress.

Here is a link to some more information about Story Squad
Story Squad

Enter Data Science

My team's scope of work for this project was daunting at first. Nobody on our team had worked with Optical Character Recognition (OCR) before. It was all new.

Our biggest concern was learning how to utilize the framework of past cohorts that had implemented different methods before us. How would those processing pipelines and functions work? How does the new model compare to the older OCR model that was first used?

These lingering questions seemed to have no end. Everyone was scant on details, nobody had answers, and the questions kept growing.

As we dug deeper into the codebase and researched our scope of work, it became apparent that the first thing we would need to do is become quite familiar with Tesseract.

Tesseract OCR

For those who don't know what Tesseract is, its an open source Optical Character Recognition model that was created by HP and then later developed by Google. After a while, Google abandoned it for their own proprietary product, Cloud Vision. At the time of our on-boarding with Story Squad, the app and its functionality was based on Google's Cloud Vision API. The problem for a fledgling startup like Story Squad is that with each call to the API, google charges a fee. Now imagine on a day when hundreds or maybe thousands of children are submitting images to be transcribed: the cost could get staggering.

By using Tesseract, the company would not only be saving money, but have a model that was built, tuned, and functions exactly for their purposes. With that in mind, we set off to begin creating an OCR model with Tesseract. As we found out, it was no easy feat.

The Training Problem

Most of the issues we found with Tesseract stemmed from two places. Installation and Training. Fundamental processes that are required to even begin exploring Tesseract.

There are python wrappers that have Tesseract functionality, but in order to create a custom model that we can tune, it takes a full installation. Between a hefty list of required libraries to build a trainable model and aging operating systems, I was unable to build a properly functioning model on my local machine. This turned out to not even be a problem, as our scope of work was about to change.

The main issue, we discovered, is that Tesseract has done well in recognizing digital text and fonts, but its recognition of handwriting is abysmal. Our entire premise is rooted in detection of handwriting. In order to get Tesseract to recognize handwriting, we have to train it with a large corpus of data.

Thankfully, we had such a dataset, but it required lots of manual editing and preparation in order to even be used.

Our team split into two tracks. One group would focus on procuring more data and possibly generate synthetic data. The other group would help prepare the existing data for training.

After two weeks of scrambling and finding brick walls behind every door we opened, our realization that we would not be training a model began to emerge. Our first task was even more basic than that.

During a meeting with the stakeholders and their data science team, we refined our project scope to begin where the need was greatest -- data preparation.

The Process

The data cleaning comes as an iterative two-step process in which we edit page "segments" and an accompanying transcription file so the model could work on single lines and train its recognition of characters.

Most of our files ended up cleaning up like this (before on top, after on bottom):
Before and After

Line by line, page by page, we began editing images of handwriting and creating "ground truth" text files that are a true transcription of what was in the image.That's the first part of training.

The second part, which we didn't even get to, was to edit "boxfiles" around each letter of the segments. After the images have been processed and edited, they get run through Tesseract again to create bounding boxes on each detected letter.

These boxfiles need to (within reason) select individual characters from a segment and label them with the correct character. Doing this gives a boundary and a label to the model so it can detect the variations of that labeled character and use it to classify new charactersd.

Remember when you fill out a form and it has individual boxes for each letter? That's essentially what we are doing here, except we have to fit boxes over naturally written characters.

The Lesson

While I started the labs project with high hopes I would have some meaningful impact with this cool idea, my expectations had to change.

Despite the large amount of data to be processed and the troubles I had in trying to implement a Tesseract model to begin training, I still had a great experience.

Thinking on your feet, being able to change pace or direction, digging down to root causes are all great skills to have. Our team definitely flexed those skills over the last 8 weeks.

Although my expectations had to change, it was a necessary step we took towards the end goal: a functioning model that can detect handwriting. At the time of writing this, the workload is still not completely done, there are still quite a few more stories to edit and box before training. With the limited training the Story Squad team was able to accomplish using a small batch of training data, the results seemed promising.

Consider the variation of handwriting, individual styling, image quality, lighting conditions, etc. This broad spectrum of potential variation can make a task that requires consistency very difficult. That's the goal.

Future Considerations

Though our contribution to the overall project was small, its still exciting to be part of a bigger thing. Seeing even parts of what we did as vital to the project is important because it takes a team. It takes incremental steps. It takes diligent focus to knock down barriers and move towards the goal.

Not only was this a great experience to be part of a team, but also the lessons we learned as a group will be timeless. Teamwork, checking our egos at the door, and relentless drive are lasting experiences that will help to further my career as I seek out employment.

I can't wait to see how this final product turns out. I know with my children, screen time is a constant battle. To encourage their growth and creativity with writing, drawing and reading will be an amazing opportunity.

18