Contrary to common belief, most of data science is focused on the gathering and preparation of data to be fed into analytic algorithms and then interpreting the results, rather than the actual development of those algorithms. We often break down the process of data science into 5 steps; 1) identify a problem, 2) gather the data, 3) clean the data, 4) analyze the data, and 5) present the result. While these 5 steps seem nice and clean, the reality is that data scientists often need to be working on steps 3, 4, and 5 before they even have the data, or in the case that a new data source is discovered that could refine existing analysis. This more chaotic environment means that code and processes developed to solve one identified problem can only be used on that problem and cannot be used for other problems. This is often because steps 2-5 are combined into a monolithic process where data is explicitly passed from data gathering processes directly into data cleaning processes and so on.
A Modular Analytics Framework (MAF) will separate the data passing infrastructure from the data manipulation processes. This allows for the clean definition of the processes as modules that can be added, changed, and connected at will within a data infrastructure. In this paper, we discuss the requirements for a MAF and introduce a standardized definition for describing data, modules, and pipelines. Finally, we explore some more complicated example uses of a MAF including competing Machine Learning (ML) configurations and techniques, and continuous training and deployment of Artificial Intelligence (AI) models.
Keywords
AI;DATA;FRAMEWORK;MACHINE LEARNING;MODULARITY
Additional Keywords