In this series of guest blog posts, 99designs intern Daniel Williams takes us through how he has applied his knowledge of Machine Learning to the challenge of classifying Swiftly tasks based on what what customer requests.
Swiftly is an online service from 99designs that lets customers get small graphic design jobs done quickly and affordably. It’s powered by a global network of professional designers who tackle things like business card updates and photo retouching in 30 minutes or less – an amazing turnaround time for a service with real people in the loop!
With time vital to the service value, any moment wasted in allocating a task to a designer with experience in the specific requirements could have a detrimental impact on the customers experience.
With the ultimate aim of complete and accurate automation of job to designer matching with the customer simply saying in their own terms what they need, we decided to apply machine learning to further develop Swiftly’s “Intelligent Matching System”.
This is part two of a three-part blog series. In part one we tried to determine the types of tasks. In this post, we use machine learning to classify tasks into these task categories. A future post will discuss using our predictions for task allocation.
Categories to predict
To set up a machine learning problem, we need to first decide on what we want the answers to be. After the last post’s experimentation, I decided to split the classification into two parts: what type of document is to be edited or created, and what type of work is needed on the document.
This gives us 7 document types:
Template(ppt / pdf / word etc)
Header / Banner / Ad / Poster
and 9 types of graphic design work appropriate for small tasks:
For example, one task might be
Vectorisation on a
Logo, another might be
Text Change on a
Business Card. In total, 63 different combinations of document and work type exist. This is what we’re trying to predict.
Obtaining training data
In my last post, I used unsupervised techniques that don’t need training data. Now that we have a specific outcome we’d like to predict, supervised methods are more appropriate. They use training data find patterns associated with each category, patterns that might be hard for humans to spot. For us, that training data will be a bunch of historical tasks and the correct categories for them.
However, obtaining good training data is a large problem in itself, especially given how many combinations of categories there are!
Knowing how much work was involved, my first instinct was to outsource it to Amazon’s Mechanical Turk service. Mechanical Turk is named after an elaborate 18th century hoax that was exhibited across Europe, in which an automaton could play a strong game of chess against a human opponent. It was a hoax because it was not an automaton at all: there was a human chess player concealed inside the machine, secretly operating it.
Amazon calls its service Artificial Artificial Intelligence, and it is a form of ‘fake’ machine learning. We use software to submit tasks for classification, but real people all over the world get paid a little money to do the categorising for us.
Unfortunately, the results I achieved from Mechanical Turk were poor. Even humans incorrectly classified many tasks, and this data, if fed into my machine learning classifier, would lead it to poor conclusions and low accuracy. The Turkers may have lacked some specialised knowledge about graphic design, or I may not have set up the Mechanical Turk task sufficiently well. (I wish I had read this post before diving into Mechanical Turk!)
Ultimately, having an accurate training set is perhaps the most important part of developing a good classifier. I rolled up my sleeves, and manually inspected and classified approximately 1200 Swiftly design briefs myself. This was slow and monotonous, but it meant that I knew I had an excellent quality training set.
Our classifier doesn’t accept raw text, but instead we must turn design briefs into features it can make decisions on. Human language is complicated, so there are many steps to go from text to features. Any good natural language system has such a pipeline. In ours, we:
- Tokenise: split the text up into individual ‘words’
- Remove punctuation and casing
- Remove stop words (common words with no predictive power such as ‘a’, ‘the’)
- Perform stemming (reducing words to their ‘stem’ e.g. “bounced”, “bounce”, bouncing” and ” “bounces” all become “bounc”)
- Perform lemmatisation (see below)
- Convert from words (“unigrams”) to word pairs (“bigrams”)
The first four steps we covered in the last post, let’s go over steps 5 and 6 here.
Lemmatisation is similar to stemming. It’s the process of grouping related words together by replacing several variations with a common shared symbol. For example, Swiftly task descriptions often contain URLs. Lemmatisation of URLs would mean replacing every URL with a common placeholder (for example “$URL”). So the following brief:
On this business card, please change “www.coolguynumber1.com” to “www.greatestdude.org”
On this business card, please change “$URL” to “$URL”
We do this because the number of words that occur in the data set is large, but many only occur once or twice. Nearly every URL we see in a brief will be unique. For our machine learner, it can only say something useful about words which are shared between different tasks, so all these unique words and URLs are wasted.
We do this because pre-processing involves generating a list of all the words that appear in the training dataset. However, words that only appear once in the dataset are removed because they add noise. URLs are generally unique and are unlikely to occur more than once. Without lemmatisation, we lose all information gained from the presence of URLs in a brief. With lemmatisation, we instead get the symbol “$URL” many times. If a URL in a task description turns out to be a discriminating feature, this should increase classification accuracy.
Other lemmas that I used included: dimensions (e.g. 300px x 400px), emails, DPI measures and hexadecimal codes for colours (eg. #CC3399). With these, the following (entirely fictional) task description transforms from:
Please change the email on this business card from firstname.lastname@example.org to email@example.com. Can you also include a link to my website www.coolestguyuknow.net on the bottom? Please also change all the fonts to #CC3399 and the circle to #4C3F99. I want a few different business card sizes, namely: 400 x 400, 30 x 45 and 5600 by 3320. Thanks!
Please change the email on this business card from $EMAIL to $EMAIL. Can you also include a link to my website $URL on the bottom? Please also change all the fonts to $CHEX and the circle to $CHEX. I want a few different business card sizes, namely: $DIM, $DIM and $DIM. Thanks!
Now URLs, email addresses, dimensions and so on can all take many different forms. The easiest way to match as many as possible is to use regular expressions. I used these patterns to perform my lemmatisation (for Python’s
re module), you might find them useful too.
1 2 3 4 5 6
Previously I had worked with each word in the text individually (“unigrams”), but this often means words have no context. So, for example, “business card” was broken into “business” and “card”, and the importance of those words appearing together was lost. Bigrams are simply pairs of words that appear next to each other. So, if we include both unigrams and bigrams, the text “business card” would provide us the features “business”, “card” and “business card”. This captures more of the context of certain phrases. In our data, the top bigrams after stemming were:
The pipeline in action
Let’s do a worked example using the sentence below:
Please change the email on this business card from firstname.lastname@example.org to email@example.com. Thanks!
Our pipeline first tokenises the sentence into words. Follow each word from left to right in the table below to see how it gets transformed by the pipeline.
|STEP 1||STEP 2||STEP 3||STEP 4||STEP 5|
|Tokenisation||Punctuation / Case Removal||Stop Words||Stemming||Lemmatisation|
Finally we generate bigrams, which leaves us with the following list of features: “chang”, “email”, “busi”, “card”, “$EMAIL”, “chang email”, “email busi”, “busi card”, “card $EMAIL” and “$EMAIL $EMAIL”.
As discussed in the last post, we need to convert text into a numerical format. I used a simple model known as the bag-of-words vector space model. This model represents each document as a vector, a count of how many time each different word occurred in it. The vector will have n dimensions, where n is the total number of terms in the whole collection of documents. In the training dataset, there are 9186 tokens. Each brief is sparse – the vast majority of terms will have a count of 0.
Once the data set has been converted into vectors, it can be used to train a supervised learning algorithm.
Supervised Learning: Training the Classifier
Now that our data’s in the desired format, we can finally develop as system that learns to tell the difference between the various categories. This is called building a classifier model. Once the model has been built, new briefs can be fed into it and it will predict their category (called their label).
image credit: NLTK
What we’ve discussed so far is getting labels and extracting features using our pipeline. But what algorithm should we use?
Multinomial Naive Bayes
I have chose to use the Multinomial Naive Bayes (“MNB”) classifier for this task. The Naive Bayes Wikipedia page does a good job of explaining the mathematics behind the classifier in detail. Suffice to say that it is simple, computationally efficient and has been shown to work surprisingly well in the field of document classification.
A (simplified) worked Example
A simplified way of thinking about how the algorithm works in the context of document classification is:
- For each token in the total training dataset, what is the probability of that token being associated with each class?
- For each token in a particular brief, add up the probabilities of each class for each token
- pick the class with highest probability.
So, say we have the following probabilities (after laplacian smoothing and normalisation) for the tokens from our earlier example occurring in each category type:
|Token Name||Other Image||Header / Banner / Ad/ Poster /Flier||Logo||Business Card||Template work (ppt / pdf /word etc)||Icon||Social Media|
Given the the brief:
update the logo on my business card
We would match up each token with it’s probabilities in the table above, giving us the following table. Adding up each column would then give us a score for that class.
|Token name||Other Image||Header / Banner / Ad / Poster / Flier||Logo||Business Card||Template work (ppt / pdf /word etc)||Icon||Social Media|
Business card has the highest score, and so that is our prediction. Simple! The mathematics is a little more sophisticated than this, but the intuition behind it is the same.
Now, we have two types of classes to predict, document type and task type. I decided to build the machine learning classifier structure reflect this. A top level classifier which predicts the document type (
business card, etc), trained using the full dataset. Then we have a separate specialised classifier for each document type which will predict the task category. So, we will have a classifier just for working out the task type for
business card cases, trained only on those cases.
The training and classification is summarised in these handy diagrams.
Are we getting good predictions?
To see whether our algorithm is, in fact, learning with experience, we can plot a learning curve. This tells us both how the classifier is doing, and how helpful more data would be. To test this, I plotted the 10-fold cross-validated accuracy of the top-layer classifier as the training set size is increased:
It looks like our machine is learning! The more data it sees, the better it gets at picking out the correct category. It looks as though accuracy may flatten off at about 80%. This suggests that to do better, we’d need to find new features instead of just collecting more cases. The sub-classifiers, as a result of the classifier structure, have less data to work with in the training set. However, they appeared to follow a similar learning curve.
Accuracy of various implementations
Over the course of my experiments, I tested the accuracy of a variety of implementation and algorithms. For those interested in the details, accuracy figures are below.
|Classifier Type / Algorithm Type||MNB||NB||Baseline|
|Top Level Classifier||78.62 %||60.17 %||36.33 %|
|Sub-Classifier||69.46 %||61.54 %||32.97 %|
|Combined accuracy||54.61 %||37.03 %||11.97 %|
|Top Level Classifier||78.62 %||60.17 %||36.33 %|
|Sub-Classifier||59.97 %||50.95 %||24.13 %|
|Combined accuracy||47.15 %||30.66 %||8.77 %|
|Accuracy||45.58 %||39.12 %||11.43 %|
The “Specialised Sub-Classifier” is the implementation we discussed above, whereas the “Generalised Sub-Classifier” used a single classifier to look at task type, rather than one per document type. The “Single Classifier” tries to hit both targets at once, classifying against the full set of 63 category combinations. I also compared multinomial naive bayes against naive bayes (NB) and a simple Zero-R baseline.
The two-tier classifier approach worked the best, picking the document type correctly nearly 80% of the time, but getting both document and task type right only 55% of the time. The Multinomial Naive Bayes also did better than Naive Bayes on this task, as expected.
How might we improve our results? We could investigate:
- Other classification algorithms (research suggests Support Vector Machines are perhaps the most accurate methods)
- The TF-IDF vector model (as opposed to the bag of words vector model which I currently use)
- Additional metadata for a task beyond just its text
Next time, I will be discussing the how this system can be applied to assist with the next stage of the customer to designer matching process. How do we figure out which categories a particular designer may be good at? And how do we make sure that designer gets those tasks?
Daniel Williams is a Bachelor of Science (Computing and Software Science) student at the University of Melbourne and Research Assistant at the Centre for Neural Engineering where he applies Machine Learning techniques to the search for genetic indicators of Schizophrenia. He also serves as a tutor at the Department of Computing and Information Systems. Daniel was one of four students selected to take part in the inaugural round of Tin Alley Beta summer internships and he now works part-time at 99designs. Daniel is an avid eurogamer, follower of “the cricket”, and hearty enjoyer of the pub.