Machine-learning lies at the core of solutions provided by data analytics organizations. The very paradigm of how machine-learning based capabilities and solutions are offered has undergone a fundamental shift in very recent years. The days of monolithic, singular platforms or products in this space seem over, and have given way to nimbler, more modular, and more scalable architectures based on services and cloud computing. In particular, the micro-services architecture which was first applied to software systems and applications (in general), is penetrating machine-learning based software applications. The concept of “machine learning as micro-service” has arrived and is embraced by data scientists as the paradigm for productionalizing analytics solutions in an optimal fashion.
Service oriented architectures and Web services have been around for years, and a note on what distinguishes a micro service is in order. Micro services are Web services as well. They are accessed via APIs and users are abstracted from what lies underneath. Micro service in particular is small and light weight.
There are some key categories of technologies that we call out, that form the foundation for the next generation machine learning as a service applications. These include technologies for:
Packaging of components of analytic solutions using packaging frameworks such as Docker
Support of dynamic and real-time analytic processing using scalable event oriented frameworks such as Kafka
Modularity and interoperability of machine-learning services using machine-learning marketplaces such as Algorithmia
The next generation analytic solutions architecture must achieve three fundamental goals – reuse, scale and seamless evolution. Before more detail on what achieving these goals entails, we recall that any data analytics solution has two major components to it – data preparation, and machine-learning based analytics.
The reuse aspect is important for data preparation as well as analytics modules in solution development. Many tasks in data preparation are common (to some extent at least) across applications, and especially within application domains (such as the Teracrunch domains outlined above). Virtually every new application requires data preparation, which can include data cleaning such as addressing incorrect and/or missing data, data conversion, and data normalization. Applications may further require data abstraction such as coding certain fields to standard taxonomies and ontologies. Such data preparation capabilities can be reused across application if they are properly abstracted and packaged. For the analytics part, many machine-learning capabilities such as classification, clustering, association rule mining etc., are used in multiple applications. While every new application has some element of custom needs, a well abstracted packaging of core machine learning operations makes them viable building blocks that can be leveraged in new applications. Packaging frameworks such as Docker help to abstract the internal implementation and dependencies of individual modules. API frameworks provide a service oriented interface to the module. Marketplaces and frameworks such as Algorithmia provide interoperability and scale to modules developed in any language and stack of choice.
In the age of big data, analytics solutions must scale to big and/or complex data. However, despite the prevalence of (cloud) computing platforms and software frameworks supporting big data processing, developing true big data machine-learning solutions has been complex and time consuming. Scaling applications has required specialized big data engineers to architect and craft custom scalability solutions using paradigms such as map-reduce, sharding, large amounts of main memory, and elastic computing. With newer technologies the scaling of most big data processing modules, for data preparation as well as core analytics, can be abstracted. Container management solutions such as from Kubernetes and Swarm, real-time stream processing of big data frameworks such as Kafka, and scalable data pipeline frameworks such as Pachyderm are mechanisms with which applications can be scaled while being abstracted from the underlying data engineering complexities of achieving such scale.
Machine-learning is dynamic. Algorithms, frameworks, models evolve by the month especially in fast moving sub areas such as deep learning. Pre-trained models for transfer learning also evolve as more (training) data becomes available and/or the underlying algorithms are improved. Solutions based on machine-learning need to be dynamic as well if they are to truly leverage the state of the art. Analytic modules must be designed to be able to seamlessly evaluate multiple (machine-learning) models for a given task, easily upgrade to more recent and better models, and also seamlessly leverage pre-trained models for particular problems coming from transfer learning. Seamless evolution is not really achieved via frameworks, but is a data science design issue where data scientists must build such configurability and the ability to seamless evolve into the design of the machine-learning based analytic modules.
TeraCrunch Machine Learning Solutions
TeraCrunch is a turn-key predictive analytics & artificial intelligence solutions company. TeraCrunch’s solutions in this area leverage its proprietary machine learning algorithms/models, tools and methodologies, which are customized by its PhD Data Scientists for the following domains:
Marketing and retail analytics, where we analyze customer data from various touchpoints throughout their journey with the client, with goals such as enhancing marketing or sales outreach, increasing lifetime value of the customer, attribution analysis, media-mix analysis, or churn reduction and management.
Health analytics, where we uncover patterns to develop predictive solutions over health domain data such as from electronic medical records (EMR), insurance claims data, operational & logistics management data and HR data from hospitals.
Operations optimization analytics, where we provide solutions for risk management, predictive maintenance, predicting supply and demand (of products, workforce requirements and projects), forecasting sales, and improving Human Resource/Talent hiring, promotions and optimal management.
Dr. Naveen Ashish is Director and Head of Data Sciences at the Fred Hutchinson Cancer Research Center, and Data Science Advisor to Teracrunch LLC.