Intro

Last year while attending the Microsoft BUILD conference I got to see the debut of ML.NET in person. After going to the intro session, I was amazed at the potential opportunities at both work and in my own personal projects (like my image scaling project I work on from time to time). Over the time since the initial 0.1 release they have released a new version every month adding tons of new features. As they rapidly approach a 1.0 release I figured it was time to do another deep dive.

As fate would have it, a recent discussion among co-workers about employee retention and predicting when a co-worker would leave came up. My previous deep dive into ML.NET was in Binary Classification only, which would flip the question around to: given a set of attributes is the person going to leave. Using this as an opportunity to grow my ML skillset, I started my deep dive into SDCA (Stochastic Dual Coordinate Ascent) with a Regression Task.

Since last deep diving into ML.NET the API has changed considerably (mostly for the better), and fortunately have the moved deprecated calls to a Legacy namespace to avoid forcing major refactoring on anyone who wishes to use the latest version (0.8 at the time of this writing).

The Problem

When thinking about factors that could be treated as a feature in our ML model I reflected on the various people I have worked with my career and snapshotting data that could be tracked at the time they left/got fired.

A couple of features at first thought:
  1. Position Name - In case there is a correlation between position and duration
  2. Married or not - Figuring longevity might be longer with the increased financial responsibilities
  3. BS Degree or not - Figuring there maybe a correlation especially for more junior folks
  4. MS Degree or not - Figuring this would be true for more senior level folks
  5. Years Experience - Pretty obvious here
  6. Age at Hire - More youthful hires might be antsy to move for more money/new title
  7. Duration at Job - Used to help train the model
Knowing this was not a unique thought process and surely not the first to use ML for this problem, I came across a dataset from Kaggle. This dataset, while fictional, was created by IBM Data Scientists and provided what I was looking for: another set of minds thinking about features. Their dataset offered quite a few more features than I had come up with, but had all of the features I had come up with as well.

Figuring this was validation enough for my little deep dive, I then proceeded to Visual Studio

ML.NET Implementation

First off, all of the code discussed here is checked into my ML.NET Deep Dive repo. Feel free to clone/improvement/give feedback.

To begin I defined my data structure:

public class EmploymentHistory
{
    public string PositionName { get; set; }

    [Label(0, 150)]
    public float DurationInMonths { get; set; }

    public float IsMarried { get; set; }

    public float BSDegree { get; set; }

    public float MSDegree { get; set; }

    public float YearsExperience { get; set; }

    public float AgeAtHire { get; set; }
}
The Label Attribute in this case is a custom attribute where I use it to filter out the anomalous data (someone there for 0 months or 12+ years).

As with the older ML.NET API, when you predict data, you need both a data structure to train and predict against as well as a solution object. In this case we want to predict on the DurationInMonths property, so I defined my EmploymentHistoryPrediction object:

public class EmploymentHistoryPrediction
{
    [ColumnName("Score")]
    public float DurationInMonths;
}
To keep the code some what generic I wrote a couple Extension Methods so I can use C# Generics (thinking longer term I could re-use as much of this code for other applications). These are found in the mldeepdivelib\Common\ExtenionMethods.cs file.

Skipping to the actual ML.NET code, overall the structure of any ML.NET application to train a model is the following:
  1. Create an ML Context
  2. Create your Data Reader (in this case a CSV – TextReader is built in)
  3. Transform and Normalize data (in particular string data)
  4. Choose your Trainer Algorithm
  5. Train the model
  6. Save the model
Thankfully most of these steps are extremely easily, especially compared to TensorFlow (you have to drop back to Python in TensorFlow’s case).

For my case it was just a handful of lines to do steps 3 through 6:

var dataProcessPipeline = mlContext.Transforms.CopyColumns(label.Name, "Label")
    .Append(mlContext.Transforms.Categorical.OneHotEncoding("PositionName", "PositionNameEncoded"))
    .Append(mlContext.Transforms.Normalize("IsMarried", mode: NormalizingEstimator.NormalizerMode.MeanVariance))
    .Append(mlContext.Transforms.Normalize("BSDegree", mode: NormalizingEstimator.NormalizerMode.MeanVariance))
    .Append(mlContext.Transforms.Normalize(inputName: "MSDegree", mode: NormalizingEstimator.NormalizerMode.MeanVariance))
    .Append(mlContext.Transforms.Normalize(inputName: "YearsExperience", mode: NormalizingEstimator.NormalizerMode.MeanVariance))
    .Append(mlContext.Transforms.Normalize(inputName: "AgeAtHire", mode: NormalizingEstimator.NormalizerMode.MeanVariance))
    .Append(mlContext.Transforms.Concatenate("Features", "PositionNameEncoded", "IsMarried", "BSDegree", "MSDegree", "YearsExperience", "AgeAtHire"));

var trainer = mlContext.Regression.Trainers.StochasticDualCoordinateAscent();
var trainingPipeline = dataProcessPipeline.Append(trainer);

var trainedModel = trainingPipeline.Fit(trainingDataView);

using (var fs = File.Create(modelPath))
{
    trainedModel.SaveTo(mlContext, fs);
}
Once the model has saved in the last line, it is very trivial to call MakePrediction. Fortunately, I was able to make this method 100% generic:


private static TK Predict(MLContext mlContext, string modelPath, string predictionFilePath) where T : class where TK : class, new()
{
    var predictionData = Newtonsoft.Json.JsonConvert.DeserializeObject(File.ReadAllText(predictionFilePath));

    ITransformer trainedModel;

    using (var stream = new FileStream(modelPath, FileMode.Open, FileAccess.Read, FileShare.Read))
    {
        trainedModel = mlContext.Model.Load(stream);
    }
  
    var predFunction = trainedModel.MakePredictionFunction(mlContext);

    return predFunction.Predict(predictionData);
}
First thing I do here is read in a File, parse the JSON and then convert it to the type of T to feed into the Model. The method then returns the Prediction.

Findings

Once compiled, we need to train the model. I took the Kaggle dataset above, trimmed it down to the features I thought were important and then called the app like so:


.\mlregression.exe build .\ibmclean.csv ibmmodel.mdl
On my Razer Blade Pro, it took less than 2 seconds to train. Afterwards I had my model.

Subsequently I created some test data to run the model against to get a prediction like so:


.\mlregression.exe predict ibmmodel.mdl testdata.json 

Closing Thoughts

Given the sample set, this is far from being even close to be considered solved. However, it did give me a chance to deep dive into ML.NET’s 0.8 API, SDCA and digging around for sample data. Looking forward to continuing research into other Trainers that ML.NET offers – stay tuned.
TAGS
none on this post