Intro - Analytics and Deployment: Why should Data Scientist care about Production?

This is a series of write-up on a modern data pipeline implementaiton with case study of Tyco Cam security solutions using various data tooling features like Pacyderm, Kubernetes, Docker, Tensorflow and Storage infrastructre like AWS

Table of Contents

  1. Intro - Analytics and Deployment: Why should Data Scientist care about Production?
  2. Cloud Locations: Case for Objects Storage
  3. Model and Framework selection in Production: A Case of Object Detection with TensorFlow
  4. Building the full Data Pipeline - I
  5. Container- boxing code dependencies with Docker
  6. Building the full Data Pipeline - II
  7. Update, Maintain and Scale your Data Science Pipeline
●●●

This is a first of the series of articles discussing the elements required to make the locally developed data science model production-ready

Perfecting an analytical model for a business problem in a computer(development) is one thing, packaging it as a solution and putting it on a reusable, scalable, environment(production) is entirely different. As a data scientist, you may have balanced yourself perfectly between the art and science of analytics with skills in cleansing, feature extraction, and modeling but if all that is done on one computer with no reproducibility and visibility to other stakeholders then you work is not going to be a real value-add to your organization.

This all actually ties into the expectations from any Data Scientist team. Delivery expectations are defined in both explicit and implicit manner and meeting those consistently is crucial for the continued existence and growth. Consider the example of Tyco which offers solutions for Home Security solutions, offers analytical solutions involving live video from customer real-estate’s entry/exit points. The explicit expectations here would be of accuracy i.e. identifying intruders correctly. This too could go complex in the cases where application should differentiate between humans and animals or further human with animal costumes. But there are also implicit expectations of the solution being transparent (visibility into promised premium services), secure and available all the time. The production environment is where the real value of solution is extracted and without meeting these implicit expectations all efforts in meeting explicit ones could be just a big futile laboratory experimentation. The way in which you get the information out cannot be same when you have 10 users, 1000 users and 100000 users (extreme case!).

Development vs Production: Home vs. Away Game

As a team of data scientist, we must foresee the developed models in a production environment and be prepared of added complexity or “unknown-unknown”. In development, everything is familiar and favorable, but turfs can turn upside-down quickly in production where external factors come into play affecting the robustness of solutions.

Two can be differentiated in the following ways

Development Production
More for Batch prediction More towards near-real time
Static input datasets and results Pipeline of up-to-date dynamic data
Data Stored or brought to local environment before processing Data stored and processed in Cloud-Object environment like S3 and EC2
Processing all data at once in RAM – huge infra requirement if large datasets Data fetched in chunks – infra is more of “Plug-N-Play”
Generally, one version of Training/Test data Data snapshots may need to be extensively versioned
Driven by primarily Data Scientist Driven by primarily DevOps

Image of Analytical Process

To give a more context from the learning of familiar Analytics Pipeline in a dev environment vis-à-vis in production, differences can also be viewed as:

Analytical Steps Development steps Production analogous Reusability from Dev
Data Flow between Stages Stored in RAM of single local machine Intermediate tables/objects and MQ – messaging middleware No
Routine Operations like Data Ingestion Manual and ad-hoc customization of the ingestion approaches Automated Scripting + Multiple automated configurable data sources from customers each of them having data snapshots Yes
Data Cleansing and Transformation Very lengthy process of trials and testing Learning from Dev to Standardize and automate Partial
Data partitioning Training and Test Only Training; streaming data Partial
Model Think, Test and Optimize; Whole component of the machine Requirement-meeting models; translate for prod environment; subset of all the overall SAAS/CRM functionality Yes

Tooling to Manage production

There are commercial options available to manage production effectively, but it comes with a huge cost. Any good Chief Analytics Officer, is likely to be more inclined towards tooling which are: Open Source and Cloud Native. These are already being developed and used by the tech giants and have good support community for enterprise versions. Sample Tooling that can be considered and will be discussed in detail in future posts are as follows;

Data Storage

Object Storage providers: Amazon S3, Google Cloud Storage, Microsoft Azure Blob Storage This can be coupled with Data Versioning and Access Control layers of Pachyderm to mitigate the risks and easy accessibility

Data pipelining

Container Orchestration Tools and Platforms: Kubernetes, Docker, rkt, Apace Mesos Schedule and manages containers which are lightweight, scalable and isolated VM-like architectures which runs applications more faster and much less storage.

Analysis

Python/R Libraries latest of which is TensorFlow TensorFlow does numerical computation based on data flow graphs