Taking Models from Data Scientists to Users(Part 1): Intro to Deployment
Intro - Analytics and Deployment: Why should Data Scientist care about Production?
This is a series of write-up on a modern data pipeline implementaiton with case study of Tyco Cam security solutions using various data tooling features like Pacyderm, Kubernetes, Docker, Tensorflow and Storage infrastructre like AWS
Table of Contents
- Intro - Analytics and Deployment: Why should Data Scientist care about Production?
- Cloud Locations: Case for Objects Storage
- Model and Framework selection in Production: A Case of Object Detection with TensorFlow
- Building the full Data Pipeline - I
- Container- boxing code dependencies with Docker
- Building the full Data Pipeline - II
- Update, Maintain and Scale your Data Science Pipeline
This is a first of the series of articles discussing the elements required to make the locally developed data science model production-ready
Perfecting an analytical model for a business problem in a computer(development) is one thing, packaging it as a solution and putting it on a reusable, scalable, environment(production) is entirely different. As a data scientist, you may have balanced yourself perfectly between the art and science of analytics with skills in cleansing, feature extraction, and modeling but if all that is done on one computer with no reproducibility and visibility to other stakeholders then you work is not going to be a real value-add to your organization.
This all actually ties into the expectations from any Data Scientist team. Delivery expectations are defined in both explicit and implicit manner and meeting those consistently is crucial for the continued existence and growth. Consider the example of Tyco which offers solutions for Home Security solutions, offers analytical solutions involving live video from customer real-estate’s entry/exit points. The explicit expectations here would be of accuracy i.e. identifying intruders correctly. This too could go complex in the cases where application should differentiate between humans and animals or further human with animal costumes. But there are also implicit expectations of the solution being transparent (visibility into promised premium services), secure and available all the time. The production environment is where the real value of solution is extracted and without meeting these implicit expectations all efforts in meeting explicit ones could be just a big futile laboratory experimentation. The way in which you get the information out cannot be same when you have 10 users, 1000 users and 100000 users (extreme case!).
Development vs Production: Home vs. Away Game
As a team of data scientist, we must foresee the developed models in a production environment and be prepared of added complexity or “unknown-unknown”. In development, everything is familiar and favorable, but turfs can turn upside-down quickly in production where external factors come into play affecting the robustness of solutions.
Two can be differentiated in the following ways
Development | Production |
---|---|
More for Batch prediction | More towards near-real time |
Static input datasets and results | Pipeline of up-to-date dynamic data |
Data Stored or brought to local environment before processing | Data stored and processed in Cloud-Object environment like S3 and EC2 |
Processing all data at once in RAM – huge infra requirement if large datasets | Data fetched in chunks – infra is more of “Plug-N-Play” |
Generally, one version of Training/Test data | Data snapshots may need to be extensively versioned |
Driven by primarily Data Scientist | Driven by primarily DevOps |
To give a more context from the learning of familiar Analytics Pipeline in a dev environment vis-à-vis in production, differences can also be viewed as:
Analytical Steps | Development steps | Production analogous | Reusability from Dev |
---|---|---|---|
Data Flow between Stages | Stored in RAM of single local machine | Intermediate tables/objects and MQ – messaging middleware | No |
Routine Operations like Data Ingestion | Manual and ad-hoc customization of the ingestion approaches | Automated Scripting + Multiple automated configurable data sources from customers each of them having data snapshots | Yes |
Data Cleansing and Transformation | Very lengthy process of trials and testing | Learning from Dev to Standardize and automate | Partial |
Data partitioning | Training and Test | Only Training; streaming data | Partial |
Model | Think, Test and Optimize; Whole component of the machine | Requirement-meeting models; translate for prod environment; subset of all the overall SAAS/CRM functionality | Yes |
Tooling to Manage production
There are commercial options available to manage production effectively, but it comes with a huge cost. Any good Chief Analytics Officer, is likely to be more inclined towards tooling which are: Open Source and Cloud Native. These are already being developed and used by the tech giants and have good support community for enterprise versions. Sample Tooling that can be considered and will be discussed in detail in future posts are as follows;
Data Storage
Object Storage providers: Amazon S3, Google Cloud Storage, Microsoft Azure Blob Storage This can be coupled with Data Versioning and Access Control layers of Pachyderm to mitigate the risks and easy accessibility
Data pipelining
Container Orchestration Tools and Platforms: Kubernetes, Docker, rkt, Apace Mesos Schedule and manages containers which are lightweight, scalable and isolated VM-like architectures which runs applications more faster and much less storage.
Analysis
Python/R Libraries latest of which is TensorFlow TensorFlow does numerical computation based on data flow graphs