latest posts

This is part two of my deep dive into ML.NET. My last post on using SCDAs to predict turnover ended up being a pretty good gateway into ML.NET with a relatively easy problem to “solve” and evaluate. Wanting to dive into a different algorithm, I chose k-means for clustering. In addition, Microsoft released version 0.9 of ML.NET. The API has changed slightly so I have updated the source code for the SCDA regression and communized library code to take advantage of the new API/syntax.

Being interested in security for years now and doing it daily at work, I wanted to switch gears to a security focus. A common problem in the ML world is once a threat is found, how do you classify it? Saying it is “Abnormal” or “Unsafe” as some of the industry conveys is pretty uninformative in my opinion. This is where clustering comes into play.

With k-means clustering, the idea behind the algorithm is to take a group of data based on a type and other features to create effectively a scatter plot. In my case, each cluster would be a threat category such as Trojan, PUA or generically Virus. In a production environment you would probably want to break it out further to Worms, Rootkits, Backdoors etc. But to keep it easy, I decided to keep it to just the three.

The next piece that I didn’t need to do for my last deep dive is to actually do feature extraction. Again to keep things easy, I decided to keep it to just PE32/PE32+ files and I just utilized the PeNet NuGet package to extract two features:

  1. Size of Raw Data (from the Image Section Header)
  2. Number of Imports (from the Image Resource Dictionary)
In a production model this would need considerable more features, especially when doing more granular classification.

Some Code Cleanup for 0.9

One of the big things I did the other night after updating to 0.9 was commonizing the code more and luckily the new APIs provided allows that. One of the biggest achievements was to get the Predict function 100% generic:


public static TK Predict(MLContext mlContext, string modelPath, T predictionData) where T : class where TK : class, new()
{
    ITransformer trainedModel;

    using (var stream = new FileStream(modelPath, FileMode.Open, FileAccess.Read, FileShare.Read))
    {
        trainedModel = mlContext.Model.Load(stream);
    }

    var predFunction = trainedModel.CreatePredictionEngine(mlContext);

    return predFunction.Predict(predictionData);
}

public static TK Predict(MLContext mlContext, string modelPath, string predictionFilePath) where T : class where TK : class, new()
{
    var data = File.ReadAllText(predictionFilePath);

    var predictionData = Newtonsoft.Json.JsonConvert.DeserializeObject(data);

    return Predict(mlContext, modelPath, predictionData);
}
For the clustering I took it a step further to allow either passing in the type of T or the file path for the JSON representation of T. The reason for this is typical tools for Threat Classification like ClamAV or VirusTotal provide the ability to just upload a file or scan it from a command line.

Another area of improvement was to standardize the command line arguments especially with future experiments on the horizon. An improved but not perfect change was to use an enum:


public enum MLOperations
{
    predict,
    train,
    featureextraction
}
And then in the Program.cs:


if (!Enum.TryParse(typeof(MLOperations), args[0], out var mlOperation))
{
    Console.WriteLine($"{args[0]} is an invalid argument");

    Console.WriteLine("Available Options:");

    Console.WriteLine(string.Join(", ", Enum.GetNames(typeof(MLOperations))));

    return; 
}

switch (mlOperation)
{
    case MLOperations.train:
        TrainModel(mlContext, args[1], args[2]);
        break;
    case MLOperations.predict:
        var extraction = FeatureExtractFile(args[2], true);

        if (extraction == null)
        {
            return;
        }

        Console.WriteLine($"Predicting on {args[2]}:");

        var prediction = Predictor.Predict(mlContext, args[1], extraction);

        PrettyPrintResult(prediction);
        break;
    case MLOperations.featureextraction:
        FeatureExtraction(args[1], args[2]);
        break;
}
Utilizing the Enum allowed a quick sanity check of the first argument and then able to utilize a switch/case for each of the operations. For the next deep dive I will clean this up a bit more to probably just have an Interface or Abstract Class to implement for each experiment.

K-means Clustering

Very similarly to SCDAs the code to train a model was very easy:

private static void TrainModel(MLContext mlContext, string trainDataPath, string modelPath)
{
    var modelObject = Activator.CreateInstance();

    var textReader = mlContext.Data.CreateTextReader(columns: modelObject.ToColumns(), hasHeader: false, separatorChar: ',');

    var dataView = textReader.Read(trainDataPath);
    
    var pipeline = mlContext.Transforms
        .Concatenate(Constants.FEATURE_COLUMN_NAME, modelObject.ToColumnNames())
        .Append(mlContext.Clustering.Trainers.KMeans(
        Constants.FEATURE_COLUMN_NAME, 
                clustersCount: Enum.GetNames(typeof(ThreatTypes)).Length));

    var trainedModel = pipeline.Fit(dataView);

    using (var fs = File.Create(modelPath))
    {
        trainedModel.SaveTo(mlContext, fs);
    }

    Console.WriteLine($"Saved model to {modelPath}");
}
With the new 0.9 API, the Text Reader has been cleaned up (in conjunction with the Extension Methods I created earlier). The critical piece to keep in mind is the clustersCount argument in the KMeans Trainer constructor. You want this number to equal the number of categories you have. To keep my code flexible since I’m using an Enum, I simply calculate the length. I strongly suggest following that path to avoid errors down the road. The rest of the code is generic (room for some refactoring in the next deep dive).

For my Threat Classification class it should look like a pretty normal class:

public class ThreatInformation
{
    public float NumberImports { get; set; }

    public float DataSizeInBytes { get; set; }

    public string Classification { get; set; }

    public override string ToString() => $"{Classification},{NumberImports},{DataSizeInBytes}";
}
I overrode ToString() for the FeatureExtraction, but otherwise pretty normal.

For my Prediction class it is a little different than with an SCDA:

public class ThreatPredictor
{
    [ColumnName("PredictedLabel")]
    public uint ThreatClusterId { get; set; }

    [ColumnName("Score")]
    public float[] Distances { get; set; }
}
Where as the SCDA or other regression models return values, the k-means trainer returns the cluster that it found to be best fit. The Distances array contains the Euclidian distances from the data that gets passed in for prediction to the cluster centroids. For my case, I added a translation from the ClusterId -> a human readable string value (i.e. Trojan, Malware etc.).

Closing Thoughts

In training the data and running a model I was surprised at how quick it was to do both. Digging into the code on GitHub, everything looks to be as parallized as possible. Having used other technologies that aren’t multi-threaded – this was a refreshing sight. As for working with the clustering further, I think the big thing I will probably work on the scalable feature extraction and training in an efficient manner (right now it’s single threaded and loaded into memory all at once).
TAGS
none on this post

Intro

Last year while attending the Microsoft BUILD conference I got to see the debut of ML.NET in person. After going to the intro session, I was amazed at the potential opportunities at both work and in my own personal projects (like my image scaling project I work on from time to time). Over the time since the initial 0.1 release they have released a new version every month adding tons of new features. As they rapidly approach a 1.0 release I figured it was time to do another deep dive.

As fate would have it, a recent discussion among co-workers about employee retention and predicting when a co-worker would leave came up. My previous deep dive into ML.NET was in Binary Classification only, which would flip the question around to: given a set of attributes is the person going to leave. Using this as an opportunity to grow my ML skillset, I started my deep dive into SDCA (Stochastic Dual Coordinate Ascent) with a Regression Task.

Since last deep diving into ML.NET the API has changed considerably (mostly for the better), and fortunately have the moved deprecated calls to a Legacy namespace to avoid forcing major refactoring on anyone who wishes to use the latest version (0.8 at the time of this writing).

The Problem

When thinking about factors that could be treated as a feature in our ML model I reflected on the various people I have worked with my career and snapshotting data that could be tracked at the time they left/got fired.

A couple of features at first thought:
  1. Position Name - In case there is a correlation between position and duration
  2. Married or not - Figuring longevity might be longer with the increased financial responsibilities
  3. BS Degree or not - Figuring there maybe a correlation especially for more junior folks
  4. MS Degree or not - Figuring this would be true for more senior level folks
  5. Years Experience - Pretty obvious here
  6. Age at Hire - More youthful hires might be antsy to move for more money/new title
  7. Duration at Job - Used to help train the model
Knowing this was not a unique thought process and surely not the first to use ML for this problem, I came across a dataset from Kaggle. This dataset, while fictional, was created by IBM Data Scientists and provided what I was looking for: another set of minds thinking about features. Their dataset offered quite a few more features than I had come up with, but had all of the features I had come up with as well.

Figuring this was validation enough for my little deep dive, I then proceeded to Visual Studio

ML.NET Implementation

First off, all of the code discussed here is checked into my ML.NET Deep Dive repo. Feel free to clone/improvement/give feedback.

To begin I defined my data structure:

public class EmploymentHistory
{
    public string PositionName { get; set; }

    [Label(0, 150)]
    public float DurationInMonths { get; set; }

    public float IsMarried { get; set; }

    public float BSDegree { get; set; }

    public float MSDegree { get; set; }

    public float YearsExperience { get; set; }

    public float AgeAtHire { get; set; }
}
The Label Attribute in this case is a custom attribute where I use it to filter out the anomalous data (someone there for 0 months or 12+ years).

As with the older ML.NET API, when you predict data, you need both a data structure to train and predict against as well as a solution object. In this case we want to predict on the DurationInMonths property, so I defined my EmploymentHistoryPrediction object:

public class EmploymentHistoryPrediction
{
    [ColumnName("Score")]
    public float DurationInMonths;
}
To keep the code some what generic I wrote a couple Extension Methods so I can use C# Generics (thinking longer term I could re-use as much of this code for other applications). These are found in the mldeepdivelib\Common\ExtenionMethods.cs file.

Skipping to the actual ML.NET code, overall the structure of any ML.NET application to train a model is the following:
  1. Create an ML Context
  2. Create your Data Reader (in this case a CSV – TextReader is built in)
  3. Transform and Normalize data (in particular string data)
  4. Choose your Trainer Algorithm
  5. Train the model
  6. Save the model
Thankfully most of these steps are extremely easily, especially compared to TensorFlow (you have to drop back to Python in TensorFlow’s case).

For my case it was just a handful of lines to do steps 3 through 6:

var dataProcessPipeline = mlContext.Transforms.CopyColumns(label.Name, "Label")
    .Append(mlContext.Transforms.Categorical.OneHotEncoding("PositionName", "PositionNameEncoded"))
    .Append(mlContext.Transforms.Normalize("IsMarried", mode: NormalizingEstimator.NormalizerMode.MeanVariance))
    .Append(mlContext.Transforms.Normalize("BSDegree", mode: NormalizingEstimator.NormalizerMode.MeanVariance))
    .Append(mlContext.Transforms.Normalize(inputName: "MSDegree", mode: NormalizingEstimator.NormalizerMode.MeanVariance))
    .Append(mlContext.Transforms.Normalize(inputName: "YearsExperience", mode: NormalizingEstimator.NormalizerMode.MeanVariance))
    .Append(mlContext.Transforms.Normalize(inputName: "AgeAtHire", mode: NormalizingEstimator.NormalizerMode.MeanVariance))
    .Append(mlContext.Transforms.Concatenate("Features", "PositionNameEncoded", "IsMarried", "BSDegree", "MSDegree", "YearsExperience", "AgeAtHire"));

var trainer = mlContext.Regression.Trainers.StochasticDualCoordinateAscent();
var trainingPipeline = dataProcessPipeline.Append(trainer);

var trainedModel = trainingPipeline.Fit(trainingDataView);

using (var fs = File.Create(modelPath))
{
    trainedModel.SaveTo(mlContext, fs);
}
Once the model has saved in the last line, it is very trivial to call MakePrediction. Fortunately, I was able to make this method 100% generic:


private static TK Predict(MLContext mlContext, string modelPath, string predictionFilePath) where T : class where TK : class, new()
{
    var predictionData = Newtonsoft.Json.JsonConvert.DeserializeObject(File.ReadAllText(predictionFilePath));

    ITransformer trainedModel;

    using (var stream = new FileStream(modelPath, FileMode.Open, FileAccess.Read, FileShare.Read))
    {
        trainedModel = mlContext.Model.Load(stream);
    }
  
    var predFunction = trainedModel.MakePredictionFunction(mlContext);

    return predFunction.Predict(predictionData);
}
First thing I do here is read in a File, parse the JSON and then convert it to the type of T to feed into the Model. The method then returns the Prediction.

Findings

Once compiled, we need to train the model. I took the Kaggle dataset above, trimmed it down to the features I thought were important and then called the app like so:


.\mlregression.exe build .\ibmclean.csv ibmmodel.mdl
On my Razer Blade Pro, it took less than 2 seconds to train. Afterwards I had my model.

Subsequently I created some test data to run the model against to get a prediction like so:


.\mlregression.exe predict ibmmodel.mdl testdata.json 

Closing Thoughts

Given the sample set, this is far from being even close to be considered solved. However, it did give me a chance to deep dive into ML.NET’s 0.8 API, SDCA and digging around for sample data. Looking forward to continuing research into other Trainers that ML.NET offers – stay tuned.
TAGS
none on this post

Introduction/Backstory

A long time ago back in 2003 I had the amazing idea to use nVidia Cg on my GeForce 4 Ti4400 to accelerate image processing. I coined it imgFX at the time. While at the time I thought I was doing something no one else had, I quickly learned I was not and eventually shelved the project. Several years later in May 2008 I revived it with the HD Revolution rapidily approaching, renaming it to texelFX.
2008 texelFX logo

People had Standard Definition content and wanted to quickly release High Definition content cheaply. Using my Silicon Graphics Octane 2 (Dual R12k400mhz/V6 graphics) at the time I was writing a C++ OpenGL application to handle the scaling using the exclusive Silicon Graphics OpenGL extensions. This was working pretty well, although my scaling techniques were not much more advanced than a Nearest Neighbor scaler - I was struggling at the time with a Bicubic scaler (mostly due to my more mid-level programming abilities at the time). Results were sub-par mostly due to my programming abilities at the time. Fast forward to late summer 2017, I upgraded the GPU in my desktop to a GeForce 1080ti to take advantage of the numerous Cuda Libraries to accelerate floating point operations like I needed for image scaling. At that time I created the Github repro, if I ever become unshamed of my 2008 deep-dive I will commit them to a separate repo.

The main reasonly for reviving the project was the news earlier in 2017 that Star Trek: Deep Space Nine would most likely never get a proper 1080p or better remastering. While you could argue, popping the 2002-2003 DVD releases in a UHD upscale enabled blu-ray player might make it look the best it possibly could, I would argue those are not taking advantage of machine learning and simply applying noise reduction along with a bicubic scale.

Where I am today

Over the weekend I ported over my .NET Core 2.0 App I did back in August 2017 to a more split architecture:
-.NET Core 2.0 library (Contains all of the actual scaler code)
-ASP .NET Core 2.0 MVC App (Providing a quick interface to demonstrate the effectiveness of the algorithms)
-ASP .NET Core 2.0 WebAPI App (Providing a backend to support larger batch processes/mobile uploads/etc)

Along with the port, I got a Nearest Neighbor implementation done using the System.Drawing .NET Core 2.0 NuGet package. This will serve as the baseline for which I will compare my approach. My approach will utilize the newly released Microsoft Cognitive Toolkit to create a deep convolutional neural network to help with my image scaling solution. To dive in, take the following screencap from Season 6 Episode 1 "A Time to Stand":

DVD Screencap of A Time to Stand

Note the following:
-MPEG2 Compression Artifacts in the back left of the screencap where the 2 crew members are analyzing a screen and around the light
-Muted colors, granted DS9 was intentionally muted especially during the war seasons, but the color space of the DVD is vastly different from an HDR UHD of today

Scaling the image to HD (1920x1080):
Nearest Neighbor Upscale Screencap of A Time to Stand

Without doing a side by side comparison it is a little hard to see just how bad it is, so lets zoom in on the area mentioned above upscaled:
Zoomed Upscale Screencap

The issues mentioned above in the DVD screencap are only exacerbated by the scaling, making the quality even worse when viewed at the new upscaled resolution.

What I hope to Achieve

Given the issues above, my main goals:
-Provide a web interface and REST Service to scale single images or videos
-Remove compression artifacts (specifically MPEG2)
-Apply Machine Learning to provide detail to objects where there are not enough pixels (such as a Standard Definition source)

And with any luck provide myself a true High Definition of Deep Space Nine.

Next Steps

With my goals outlined, the first step is to deep dive into the https://docs.microsoft.com/en-us/cognitive-toolkit/ and begin training a model to provide goals 2 and 3 a viable solution.
TAGS
none on this post
Continuing my dive back into C++, libcurl by default on Windows does not come statically compiled so I packaged together the latest release compiled on Visual Studio 2017 in Release mode statically with ipv6 and ssl enabled. You can get the executable, header aand lib here.
TAGS
none on this post
Figuring most folks diving into FLTK might be developing on Windows and not wanting to pull down the source and compile it yourself. I compiled the source on Visual Studio 2017 in Release mode. You can get the header aand libs here.
TAGS
none on this post
Over the last week I have spent significant time in flushing out the UWP Client for bbxp. In doing so I have run into several interesting problems to overcome. This blog post will cover the ones I have solved and those that I uncovered that are much larger scope than a simple UWP client.

Problem One - Displaying HTML

A simple approach to displaying the content that is returned from the WebAPI service would be to simply use a WebView control in XAML and call it a day.

The problem with this approach is that in my case I am using two css files, one for bootstrap and one for my custom styles. Without the CSS styles being included I would have to either accept that the syling would be inconsistent or come up with a way to inject the CSS into the response item.

I chose the later. In diving into this problem, one approach would be to pre-inject the CSS string for every item. In looking at the file sizes, even minified bootstrap.min.css is 119kb and my own style file minified is 3kb. That is an extra 122kb per response, returning 10 posts that is over a mb of needless data. Another approach would be to return the CSS once along with the content strings. This is problematic as it would require other elements (Content, Searching and Archives) to also implement this approach. After some playing around with these approaches I came up with what I feel is the best approach:

  1. Added a new Controller to the MVC application to return a single string with both CSS files
  2. For every request on displaying either Content or a Post, see if the CSS file string is stored locally otherwise go out to the MVC app, download it and store it. Finally - inject the CSS string with the content string and display the XAML in a WebView
Pretty simple approach and efficient so lets dive into the code.

To start, I created a user control in the UWP app called HTMLRenderView to handle the NavigateToString event firing and allow the XAML to be able to bind with MVVM the string with a custom property called ContentBody like so:

This way whether I am loading several WebView controls in a list or just a single instance my code behind on the pages is clean.

Problem Two - Matching Styles between Platforms

In doing this "port" to UWP I quickly realized maintaining CSS and XAML styles is going to be problematic at best, especially when adding in Xamarin Forms ports to iOS and Android down the road. I have run into this situation before at work on various projects, but in those scenarios the application styles were pretty static as opposed my blog where I update the syling fairly regularly. Also thinking about others using this platform or at least as a basis for their own platform.

My first inclination would be to do something akin to TypeScript where it would compile down to the native syntax of each platform, CSS and XAML in my case. For the time being I will be adding this to my bucket list to investigate a solution down the road.

Conclusion

As of this writing there are only two features left in the UWP Client: Archives and External Link handling. There is additional styling and optimization work to be done, but overall once those two features are added I will begin on the Xamarin Forms port for iOS and Android clients.

All of the code thus far is committed on GitHub.
TAGS
none on this post

Introduction

In case you missed the other days of this deep dive:
Day 1 Deep Dive
Day 2 Deep Dive with MongoDB
Day 3 Deep Dive with MongoDB
Day 4 Deep Dive with MongoDB
Day 5 Deep Dive with MongoDB
Day 6 Deep Dive with Mongoose
Day 7 Deep Dive with Clustering
Day 8 Deep Dive with PM2
Day 9 Deep Dive with Restify
Day 10 Deep Dive with Redis
Day 11 Deep Dive with Redis and ASP.NET Core

As mentioned on Friday I wanted to spend some time turning my newly acquired Node.js and Redis knowledge into something more real-world. Learning when using Node.js (and Redis) and when not to over the last 2 weeks has been really interesting especially coming from an ASP.NET background where I had traditionally used it to solve every problem. While ASP.NET is great and can solve virtually any problem you throw at it, as noted by my deep dives it isn't always the best solution. In particular there are huge performance issues as you scale an ASP.NET/Redis pairing verses Node.js/Redis. Wanting to get my bbxp blog platform that powers this site back into a service oriented architecture as it was a couple years ago with a WCF Service and possibly a micro-service architecture, what better time than now to implement a caching layer using Node.js/Redis.

This blog post will detail some design decisions I made over the weekend and what I feel still needs some polishing. For those curious the code currently checked into the github repo is far from production ready. Until it is production ready I won't be using the code to power this site.

Initial Re-architecting

The first change I made to the solution was to add an ASP.NET Core WebAPI project, a .NET Standard Business Layer project and a .NET Standard Data Layer project for use with the WebAPI project. Having had the foresight earlier this year when I redid the platform to utilize ASP.NET Core everything was broken out so the effort wasn't as huge as it could have been if all of the code was simply placed into the controllers of the MVC app.

One thing of this restructuring that was new was playing around with the new Portable Class Library targeting .NET Standard. From what I have gathered this is the future replacement of the crazy amount of profiles we have been using the last three years - Profile78 is a lot more confusing than .NET Standard 1.4 in my opinion. It took some time finding the most up to date table detailing what platforms are on what version, but for those also looking for a good reference, please bookmark the .NET Standard Library Roadmap page. For this project as of this writing UWP does not support higher than 1.4 so I targeted that version for the PCL project. From what I have read 1.6 support is coming in next major UWP Platform NuGet package update.

Redis Implementation

After deep diving into Redis with ASP.NET Core in Day 11 and Node.js in Day 10, it became pretty clear Node.js was a much better choice for speed as the number of requests increased. Designing this platform to truly be scalable and getting experience designing a very scalable system with new technology I haven't messed with are definitely part of this revamp. With caching as seasoned developers know if a tricky slope. One could simply turn on Output Caching for their ASP.NET MVC or WebForms app - which wouldn't benefit the mobile, IoT or other clients of the platform. In a platform agnostic world, this approach can be used, but I shy away from using that and calling it a day. I would argue that native apps and other services are hit more than a web app for a platform like bbxp in 2016.

So what are some other options? For bbxp, the largest part of the request time server side is pulling the post data from the SQL Server database. I had previously added in some dynamically generated normalized tables when post content is created, updated or deleted, but even still this puts more stress on the database and requires scaling vertically as these tables aren't distributed. This is where a caching mechanism like Redis can really help especially in the approach I took.

A more traditional approach to implementing Redis with ASP.NET Core might have been to simply have the WebAPI service do a check if the Redis database had the data cached (ie the key) and if not push it into the Redis database and return the result. I didn't agree with this approach as it needlessly hit the main WebAPI service for no reason if it was in the cache. A better approach in my mind is to implement it a separate web service, in my case Node.js with restify and have that directly communicate with Redis. This way best case, you get the speed and scalability of Node.js and Redis without ever hitting the primary WebAPI service or SQL Servers. Worse case Node.js returns extremely quickly that the key was not found and then makes a second request to the WebAPI Service to not only query the data from SQL Server, but also fire a call to Redis to add the data to the cache.

An important thing to note here is the way I did my wrapping of the REST service calls, each consumer of the service does not actually know or care which data source the data came from. In the nearly seven years of doing Service Oriented Architectures (SOA), the less business logic work being done client side even as simple as making a second call to a separate web service is too much. The largest part of that is consistency and maintainability of your code. In a multi-platform world you might have ASP.NET, Xamarin, UWP and IoT code bases to maintain with a small team or worse just a single person. Putting this code inside the PCL as I have done is the best approach I have found.

That being said, lets dive into some C# code. For my wrapper of Redis I chose to a pretty straight forward simple approach. Taking a string value for the key and then accepting a type of T, which the helper function will automatically convert into JSON to be stored in Redis:

public async void WriteJSON(string key, T objectValue) { var value = JsonConvert.SerializeObject(objectValue, Formatting.None); value = JToken.Parse(value).ToString(); await db.StringSetAsync(Uri.EscapeDataString(key), value, flags: CommandFlags.FireAndForget); }

A key point here is the FireAndForget call so we aren't delaying the response back to the client while writing to Redis. A better approach for later this week might be to add in Azure's Service Bus or a messaging system like RabbitMQ to handle if the key couldn't be added for instance if the Redis server was down. In this scenario, the system would work with my approach, but the scaling would be hampered and depending on the number of users hitting the site and or the server size itself, this could be disasterous.

Node.js Refactoring

With the additional of several more routes being handled by Node.js than in my testing samples, I decided it was time to refactor the code to cut down on the duplicate redis client code and handling of null values. At this point I am unsure if my node.js code is as polished as it could be, but it does in fact work and handles null checks properly.

My dbFactory.js with the refactored code to expose a get method that handles null checking and returning the json data from redis:

module.exports = function RedisFactory(key, response) { var client = redis.createClient(settings.REDIS_DATABASE_PORT, settings.REDIS_DATABASE_HOSTNAME); client.on("error", function(err) { console.log("Error " + err); }); client.get(key, function(err, reply) { if (reply == null) { response.writeHead(200, { 'Content-Type': 'application/json' }); response.end(""); return response; } response.writeHead(200, { 'Content-Type': 'application/json' }); response.end(reply); return response; }); };

With this refactoring, my actual route files are pretty simple at least at this point. Below is my posts-router.js with the cleaned up code utilizing my new RedisFactory object:

var Router = require('restify-router').Router; var router = new Router(); var RedisFactoryClient = require("./dbFactory"); function getListing(request, response, next) { return RedisFactoryClient("PostListing", response); }; function getSinglePost(request, response, next) { return RedisFactoryClient(request.params.urlArg, response); }; router.get('/node/Posts', getListing); router.get('/node/Posts/:urlArg', getSinglePost); module.exports = router;

As one can see, the code is much simplified over what would have quickly become very redundant bloated code had I kept with my unfactored approach in my testing code.

Next up...

Tomorrow night I hope to start implementing automatic cache invalidation and polishing the cache entry in the business layer interfacing with Redis. With those changes I will detail out the approach with pros and cons to them. For those curious, the UWP client will become a fully supported client along with iOS and Android clients via Xamarin Forms. Those building the source code will see a very early look to getting the home screen posts pulling down with a UI that very closely resembles the MVC look and feel.

All of the code for the platform is committed on GitHub. I hope to begin automated builds like I setup with Raptor and setup releases as I add new features and continue making the platform more generic.
TAGS
none on this post

Introduction

In case you missed the other days:
Day 1 Deep Dive
Day 2 Deep Dive with MongoDB
Day 3 Deep Dive with MongoDB
Day 4 Deep Dive with MongoDB
Day 5 Deep Dive with MongoDB
Day 6 Deep Dive with Mongoose
Day 7 Deep Dive with Clustering
Day 8 Deep Dive with PM2
Day 9 Deep Dive with Restify
Day 10 Deep Dive with Redis

I was originally going to deep dive into AWS tonight, but the excitement over redis last night had me eager to switch gears a bit and get redis up and running in ASP.NET Core.

Prerequisites

I will assume you have downloaded and installed the redis server. If you're on Windows, you can download the Windows port or if you're on Linux.

Setting it up in ASP.NET

Unsure of the "approved" client, I searched around on redis's official site for some clients to try out. Based on that list and nuget, I chose to at least start with StackExchange.Redis. To get going simply issue a nuget command:

Install-Package StackExchange.Redis

Or search in NuGet for StackExchange.Redis. As of this writing I am using the latest version, 1.1.605.

To get a redis database up and running in ASP.NET was extremely painless, just a few connection lines and then an async call to set the key/value:

private static ConnectionMultiplexer redis; private IDatabase db; [HttpGet] public async Task Get(int id) { if (redis == null) { redis = ConnectionMultiplexer.Connect("localhost"); } if (db == null) { db = redis.GetDatabase(); } await db.StringSetAsync(id.ToString(), 2, flags: CommandFlags.FireAndForget); return "OK"; }

Performance

Interested to see the performance differences between how Node.js and ASP.NET Core performed with the same test. The numbers speak for themselves:


The performance results were interesting to say the least after having pretty near identical MongoDB results. Wondering if maybe there was a Kestrel difference, I re-ran the test:


Better, but not as dramatic as I would assume.

Next up...

Seeing as how the performance wasn't anywhere close to that of Node.js, I am wondering if utilzing the DI found in ASP.NET Core would alleviate the performance issues. My original intent was to spend Saturday adding Redis into my blogging platform for caching, but as of right now I will hold off until I can figure out the reason for the huge delta. All of the code thus far is committed on GitHub.
TAGS
none on this post

Introduction

In case you missed the other days:
Day 1 Deep Dive
Day 2 Deep Dive with MongoDB
Day 3 Deep Dive with MongoDB
Day 4 Deep Dive with MongoDB
Day 5 Deep Dive with MongoDB
Day 6 Deep Dive with Mongoose
Day 7 Deep Dive with Clustering
Day 8 Deep Dive with PM2
Day 9 Deep Dive with Restify

After playing around with MongoDB I wanted to try out redis not only for it's performance comparison to MongoDB, but also for practical experience for use at work and in other projects. So tonight I will get it up and running on Windows, make connections in node.js and do some perf testing.

Prerequisites

You will need to install Redis. If you're on Windows, you can download the Windows port or if you're on Linux.

redis

To get started we need to install the npm package for redis:

npm install redis -g

Once installed, getting the redis client connecting to the server was pretty painless especially after using MongoDB previously. For the sake of this testing app I simply replaced the MongoDB dbFactory.js file like so:

var redis = require("redis"); var settings = require('./config'); var client = redis.createClient(settings.REDIS_DATABASE_POST, settings.REDIS_DATABASE_HOSTNAME); client.on("error", function (err) { console.log("Error " + err); }); module.exports = client;

Since redis uses a port/hostname split value I added two additional config options and removed the older database connection property:

module.exports = { REDIS_DATABASE_HOSTNAME: 'localhost', REDIS_DATABASE_POST: 6379, HTTP_SERVER_PORT: 1338 };

In the actual route for the test it only required a few adjustments:

var Router = require('restify-router').Router; var router = new Router(); var RedisClient = require('./dbFactory'); function respond(request, response, next) { var argId = request.params.id; RedisClient.set(argId.toString(), 2); return response.json({ message: true }); } router.get('/api/Test', respond); module.exports = router;

Performance

Interested in the performance differences I was not surprised at a direct comparison being so dramatic. I should note these were done on my Razer Blade laptop not my desktop.


When I add more functionality to provide full CRUD operations it will be interesting to really do an in-depth test comparing SQL Server, MongoDB and Redis.

Next up...

Tomorrow night I am planning on diving into Amazon Web Services (AWS) to get a good comparison to Rackspace and Azure. In particular for node.js development as I imagine node dev is more commonly done on AWS.

All of the code thus far is committed on GitHub.
TAGS
none on this post

Introduction

In case you missed the other days:
Day 1 Deep Dive
Day 2 Deep Dive with MongoDB
Day 3 Deep Dive with MongoDB
Day 4 Deep Dive with MongoDB
Day 5 Deep Dive with MongoDB
Day 6 Deep Dive with Mongoose
Day 7 Deep Dive with Clustering
Day 8 Deep Dive with PM2

On the flight back from California last night I queued up a few Node.js videos on YouTube for the flight, one of which was a presentation by Kim Trott of Netflix discussing Netflix's migration from a Java backend to node.js. I really enjoyed this presentation because it went over not only what worked great, but what didn't - something that doesn't happen too often especially from a larger company. Another highlight for me was reviewing some of the node modules Netflix uses to provide the millions of hours of content daily. One of which was restify. Restify provides a clean interface for routes without the templating and rendering that Express offered - which in my current testing fits much better.

Prerequisites

As mentioned previously, this point I am going to assume MongoDB is up and running, if you do not, check my first day post for details on how to get it up and running.

restify

To get started with restify you will need to install via:

npm install restify -g

In diving into restify I found another module that goes hand in hand, restify-router. As detailed on the aforementioned link, this module offers the ability to have more separation for all of the routes (similarly to those coming ASP.NET like myself who are used to specific *Controller.cs files per grouping of routes).
To get going with restify-router simply install it via:

npm install restify-router -g

I should mention I have removed the worker.js from the git folder for Day 9 as I am now utilizing pm2 as mentioned in yesterday's post.

Replacing the Express code with restify was pretty painless:

var restify = require('restify'); var settings = require('./config'); var testRouter = require('./test.router');  var server = restify.createServer(); testRouter.applyRoutes(server); server.listen(settings.HTTP_SERVER_PORT);

For those following my deep dive this should look very similiar to the Express code with the addition of the testRouter reference which I will go over below.
As mentioned above, I chose to utilize restify-router to split out my routes into their own files. My test.router:

var Router = require('restify-router').Router; var router = new Router(); var Post = require('./dbFactory'); function respond(request, response, next) { var argId = request.params.id; var newPost = new Post({ id: argId, likes: 2 }); newPost.save(function (err) { if (err) { return response.json({ message: err }); } return response.json({ message: true }); }); } router.get('/api/Test', respond); module.exports = router;

Similiarly there is not much difference from how the older route definitions, the only big difference is the first line requiring the restify-router module.

Next up...

As mentioned last night I am still investigating why on Windows only one worker process is getting hit verses utilizing all of them. I tested this on my i7 desktop and had the same results as my 2014 Razer Blade. I hope to further deep dive into restify tomorrow night and hopefully resolve the weird scaling issue I'm noticing.

All of the code thus far is committed on GitHub.
TAGS
none on this post

Introduction

In case you missed the other days:
Day 1 Deep Dive
Day 2 Deep Dive with MongoDB
Day 3 Deep Dive with MongoDB
Day 4 Deep Dive with MongoDB
Day 5 Deep Dive with MongoDB
Day 6 Deep Dive with Mongoose
Day 7 Deep Dive with Clustering

Keeping with the clustering discussion from yesterday, I started to look into load balancers and other options outside of the cluster module I started utilizing yesterday to solve the scaling issue that exists outside of the box. In doing this research I came across a few solutions:

  1. http-proxy module
  2. nginx
  3. pm2
After some deep diving into other developer's comments and overall plans for my deep dive I chose pm2.

Prerequisites

As mentioned previously, this point I am going to assume MongoDB is up and running, if you do not, check my first day post for details on how to get it up and running.

pm2

To get started with pm2 you will need to install via:

npm install pm2 -g

Since pm2 has a built in clustering option, the code written last night has no purpose anymore, but thankfully I abstracted out the actual "worker" code to it's own worker.js file. That being said, all you have to do to have pm2 run your node.js code for all of your cpu cores is:

pm2 start worker.js -i 0

From there you should see pm2 kicking off a process for each of your cpu cores available. In my case, my 2014 Razer Blade laptop has 4 cores/8 threads so it kicked off 8 processes. If you wanted to limit the number of processes you can specify a different number instead of 0.

One of the neat things about pm2 is the ability to monitor the processes with a simple:

pm2 monit

Which on my machine produced:

A handy command when you're done is to issue a

pm2 stop all

command to stop all of the processes.

Next up...

Hopefully this was interesting for those following me along my deep dive in to node.js. I am excited to keep deep diving into pm2 tomorrow. An issue I was running into on Windows 10 (need to try it on Linux) is that only one process was being hit at ~65% according to pm2. Whether this is a Windows specific issue, a problem with pm2 or a problem with my code I need to further dive into.

All of the code thus far is committed on GitHub.
TAGS
none on this post

Introduction

In case you missed the other days:
Day 1 Deep Dive
Day 2 Deep Dive with MongoDB
Day 3 Deep Dive with MongoDB
Day 4 Deep Dive with MongoDB
Day 5 Deep Dive with MongoDB
Day 6 Deep Dive with MongoDB

As mentioned yesterday I wanted to abstract out the config settings and look into a way to better take advantage of the hardware at my disposal like I can with ASP.NET. Node.js being single threaded has a huge disadvantage when run on a multi-core server (as most are these days). Knowing there were work arounds I was keen to figure out what they were and then compare the "out of the box" mode verses using multiple cpus.

Prerequisites

As mentioned previously, this point I am going to assume MongoDB is up and running, if you do not, check my first day post for details on how to get it up and running.

Config

I read a couple different approaches in the Node.js world on configs. Some prefered a json configuration file as I have gotten use to ASP.NET Core while others prefered to just use a JavaScript file with module.exports to expose config values. For the sake of this test app at this point I went with the latter:

module.exports = { DATABASE_CONNECTIONSTRING: 'localhost/dayinnode', HTTP_SERVER_PORT: 1338 };

And then in my dbFactory.js:

var settings = require('./config'); mongoose.connect('mongodb://' + settings.DATABASE_CONNECTIONSTRING);

This way I can keep my code free of magic strings, while still having all of my configuration settings in one file.

Cluster

As it would turn out, Node.js had a built in module called Cluster that as the name would imply adds support for creating child processes of node.js.
To get it up and running was pretty painless, just a simply require on cluster and then a check to get the number of cpus. I took it one step further and abstracted away the actual "worker" code into the worker.js file. The server.js file now looks like this:

var cluster = require("cluster"); cluster.setupMaster({ exec: 'worker.js', silent: true }); var numCPUs = require("os").cpus().length; if (cluster.isMaster) { for (var i = 0; i < numCPUs; i++) { cluster.fork(); } cluster.on("exit", function (worker, code, signal) { cluster.fork(); }); }

In doing comparisons between the single threaded approach and the new cluster approach there wasn't a distinguishable difference, which leads me to believe at least on my 2014 Razer Blade laptop the bottleneck is the MongoDB database not node.js.

Next up...

When I get back home I hope to test this new code on my i7 desktop to see if there is any discernable difference between the cluster approach and the single threaded approach when using a MongoDB database. In addition, ensure that MongoDB is configured properly with Mongoose since the ASP.NET Core performance exceeded node.js's. All of the code thus far is committed on GitHub.
TAGS
none on this post

Introduction

In case you missed the other days:
Day 1 Deep Dive
Day 2 Deep Dive with MongoDB
Day 3 Deep Dive with MongoDB
Day 4 Deep Dive with MongoDB
Day 5 Deep Dive with MongoDB

Today's posting is a intro for myself into mongoose, a popular object modeler for node.js. This continues my deep dive into learning the node.js equivalents to what I used to in the ASP.NET world, for this post in particular in how mongoose compares to Entity Framework.

Prerequisites

As mentioned previously, this point I am going to assume MongoDB is up and running, if you do not, check my first day post for details on how to get it up and running.

Utilizing Mongoose and cleaning up the MongoDB code

As I mentioned in yesterday's post I wanted to clean up the code further in particular all of the MongoDB code. Knowing Mongoose would help in this regard I replaced the MongoDB code I had with interestingly enough less code and an object model for my "Posts" object. The process was fairly similar to a code first approach with Entity Framework so very comfortable.

Migrating all of my MongoDB code was extremely painless, even adding in the model creation code left it with just the following:

var mongoose = require('mongoose'); mongoose.connect('mongodb://localhost/dayinnode'); var postSchema = new mongoose.Schema({ id: Number, likes: Number }); var Post = mongoose.model('Post', postSchema); module.exports = Post;

What is neat is the actual usage of the Post object. In my routes.js file:

var express = require('express'); var Post = require('./dbFactory'); var router = express.Router(); router.get('/api/Test', function (request, response) { var argId = request.params.id; var newPost = new Post({ id: argId, likes: 2 }); newPost.save(function (err) { if (err) { return response.json({ message: err }); } return response.json({ message: true }); }); }); module.exports = router;

If you have followed the posts so far you will see there is no database specific connection objects or init calls cluttering up the business logic. In addition Mongoose made saving objects extremely easy by having a save function on the object.

Next up...

Overall simply adding in Mongoose has cleaned up my code, next up is the configuration component to remove the hard coded port number and database connection information. All of the code thus far is committed on GitHub..
TAGS
none on this post