Amazon ML- the Perfect Marriage of AWS and Machine Learning

By: Asaf Yigal

In recent years, machine learning (ML) and artificial intelligence (AI) have become household terms—at least in techie households. But even experienced developers and product managers still struggle with mastering the basic knowledge and techniques required to successfully integrate ML models into their applications. These techniques often require a wide range of prior knowledge, from math to neuroscience, from probability to linguistics—plus, of course, the programming skills to implement them in code.

Luckily, many new services are now commodifying machine learning, making it much easier to get high-performance models from concept to production within a short space of time. Werner Vogels, Amazon’s CTO, recently wrote that “there has never been a better time than today to develop smart applications and use them. […] during the last 50 years, AI and ML were fields that had only been accessible to an exclusive circle of researchers and scientists. That is now changing, as packages of AI and ML services, frameworks and tools are today available to all sorts of companies and organizations, including those that don’t have dedicated research groups in this field.” This post will explain what machine learning is, and how Amazon’s ML solution helps get it to production quickly.

How Do Machines “Learn”?

Perhaps the clearest and most common analogy to explain machine learning is this: machine learning is to the human brain what planes were to birds. That is, machine learning is a set of techniques that try to model human learning and cognition in a mathematical and mechanical way. And, in the same way that feathers and beaks were not necessary for modern airplanes, neither do ML models try to mimic the human brain (of which we still know very little), but rather try to use known principles of cognition. The number of ML techniques is huge, but they’re usually divided into two major groups: supervised and unsupervised learning.

Supervised models are trained under supervision—a human “teacher” feeds the model with samples, while telling it the correct answer for each. At the end of the process, the model is expected to respond correctly to samples it hasn’t seen during training. In the famous cat/dog image recognition example, the model is exposed to many images of either cats or dogs, and is told what the correct label for each one is. It then learns what features are relevant for each species, until it can successfully recognize images of cats and dogs it hasn’t seen before. (Another supervised model that has recently became famous, thanks to an episode of HBO’s Silicon Valley, is an app that classifies “Hot Dogs” and “Not Hot Dogs” by employing exactly the same technology used for classifying cats and dogs.) Supervised models account for most machine learning that users currently encounter—from image and voice recognition to text classification.

If a machine that can learn under supervision seems magical, unsupervised learning models seem even more so. An unsupervised model is fed data, and is expected to extract features from it without human interference. How can it do that? Well, since it doesn’t receive any external feedback, an unsupervised model needs to be designed in such a way that will guide it to make the required inferences—just like a heart-shaped baking pan shapes a cake in an “unsupervised” way.

Unsupervised models are less common in consumer-facing applications, but they’re still present all around us: the most common unsupervised models learn to divide a set of samples into smaller groups by learning the features that characterize different subsets of individuals—a process called clustering. A neat example from recent years is word2vec—an unsupervised model that automatically learns the semantic similarity of words based only on the context in which they appear in texts.

Machine Learning Outside the Cloud

Given the current state of things, ML researchers and developers are often left with a double challenge: first, to design ML models that will be used in real-world products; second, to implement them in code. Individually, these are extremely challenging, since both require mastery of ever-evolving fields: new ML models and techniques are being published on a daily basis, and new technologies are developed constantly for deployment. Keeping up with both of them is a cruel task. For almost any company that is not a Google/Amazon/Apple-size “BigCo,” building even the simplest ML model would require developers and researchers with a broad range of skills, and could take months before being shippable to production.

Hand-Crafted Models and Hyper-Parameters

Say, for example, you’re building an application for user-generated movie reviews, and would like to know whether a review that a user has just typed is positive or negative. Nowadays, this kind of functionality has become so common that some product managers may naturally assume that any developer can easily implement it in a few workdays.

But even this seemingly simple model can require a lot of work before writing the first line of code. First, we need to know that this kind of model is called a text classifier (this specific one is actually called a sentiment analysis model). A text classifier is a supervised model that takes as input a sentence or paragraph, and outputs a predicted label (in this case, “positive” or “negative”).

Then, say we naturally decide to implement our classifier in Python. After Googling the best way to implement such a model, we may opt for SciKit-Learn, a wonderful package that offers many ML models for different use cases. But SciKit-Learn’s vast offering is of little use if we don’t know what we’re selecting from: when picking a model, we need to consider a lot of requirements, including, to name just a few, the number of training samples we have, the language the text is written in, and the number of different labels we want to assign to texts—that’s just two in our case, but for many other applications the number of labels may be huge, which will require a more complex model.

After deciding on the model itself comes the tedious part of tweaking the model’s own parameters—called hyper-parameters—that control how it behaves. Parameters such as the “learning rate” (how fast the model incorporates recent discoveries) need to be decided, using either trial and error or more advanced techniques (like grid search).

As you may have noticed, even having considered all these parameters, we still haven’t resolved a major part of the problem: the data. No matter how clever and well-tweaked our model is, it will undoubtedly produce garbage if we don’t have good data to train it on. However, since the purpose of this blog post is to review a solution for ML models, we won’t be covering this issue, because there’s no easy shortcut—apart from “simply” getting better data.

Infrastructure for Training and Serving

Now that we’ve finally designed and implemented our model, we’re faced with a double-headed infrastructure problem: where to train the model, and where to serve it from. For training, we usually need a dedicated machine with large memory and processing capacities to support both the large data sets and computation required (deep learning models, which we don’t cover here, require even more robust and costly machines).

We also need to set up infrastructure to easily train our models, inspect their performance, and then deploy and sync them between environments. To serve a model in production through an API endpoint, we need to set up more machines to load the model, pass requests through it, and respond with predictions.

All in all, we’re dealing with quite a setup, just to train and serve a simple model. Not quite what our product manager expected when he asked for a simple text classifier, is it?

Enter Amazon ML

Thanks to Amazon ML, we can skip almost this entire setup, and go straight to more important stuff, like collecting our data. Amazon ML is Machine Learning as a Service, in the spirit of existing Amazon solutions, that saves you most of the problems mentioned above: first, it lets you choose from a list of industry-standard ML models suitable for many basic and advanced use cases, such as behavioral classification and predictions. On top of that, it lets you train and serve your models in the cloud, without having to set up the infrastructure.

Automatic Data Processing and Ready-Made Models

When setting up a new model in Amazon ML, we first need to upload our data. Data needs to be CSV-formatted, with the first row containing the name of each data field, and each following row containing the data samples (called observations in Amazon ML). Training data sets can be huge, so they need to be uploaded from either Amazon S3 or Redshift storage.

When you upload your data, Amazon ML lets you define a schema to describe the different types of input it contains (numerical, textual, etc.). This helps Amazon ML tweak the learning algorithm. For example, if the input is textual, Amazon ML will split it into a vocabulary representing all the words seen in the text, which will later be transformed into numerical values that the model can use for prediction. You can also choose to let Amazon ML infer the data types by itself.

After uploading your data, Amazon ML will ask you to select which part of the data is the target variable. In our movie reviews example, the target variable will be the “sentiment” column and will hold two possible values: “positive” and “negative.” Based on this selection, Amazon ML will automatically infer the recommended model as one of three possibilities:

Binary classification model (logistic regression)—for classifying data into two categories, like in our positive/negative movie reviews app.

Multiclass classification model (multinomial logistic regression)—for classifying data into more than two categories; for example, classifying a product description into categories such as “watch,” “handbag,” or “washing machine.”

Linear regression model—for predicting future behavior of a parameter based on its past behavior; for example, prediction of housing prices over time.

These models may seem basic (see “What’s Next?” below), but as mentioned, a lot of ML use cases can surprisingly often be boiled down to classification or regression problems.

Once the data has been uploaded and the proper model has been selected, Amazon ML offers a lot of features for handling the training process. You can select how the training data should be split (randomly shuffled, sequentially split, or split it yourself), apply common transformations to data (text can be split into n-grams, i.e, tuples of words instead of single words, for increased performance), and then train the model and obtain detailed evaluation statistics (accuracy, etc.).

Deployment and Prediction

Perhaps the biggest advantage in building and training your models at Amazon ML is the simplicity of deploying them right after training. Instead of a costly dedicated server and setup, Amazon ML can immediately open an API endpoint for your model and plug it into your application. The default is a real-time synchronous endpoint for one-by-one predictions. If you need a batch prediction service, for example if your application processes inputs in bulk, an asynchronous batch endpoint is also available.

Summary

The granular control over models offered by Amazon ML is impressive. It can’t be covered entirely within the limited scope of this article, but it offers almost as much control over your models as if you were building them yourself from scratch.