Value-Creation in Data Science

Cloud tools to help generate minimum viable products. A Machine Learning (ML) use case from the trenches.

 

By Joe Torres, MS in Data Analytics

The value of analytics is clear, right? Maybe not. Value-creation from data is perhaps the biggest challenge working data scientists face. After all, many highlight the inability of analytics to produce tangible business value. Plenty of articles pinpoint how and why projects go wrong due to factors such as the lack of stakeholder buy-in, inadequate talent, poor data quality, and unclear direction. Given this environment, it’s hard to overestimate the value of incorporating minimum viable products into the project life cycle as soon as possible.

A minimum viable product is a fully operational product containing only the necessary functionality to highlight its core usefulness. The product could be an application, report, or predictive model. Minimal viable products often make the most persuasive case for highlighting the value of analytics to an organization. This also forces data scientists to reckon with data-related challenges earlier, which could affect the project’s value proposition. Value, for all parties, should be the project’s true north. And where value exists, investment follows.

However, generating minimum viable products is a challenging venture. Organizational skepticism often translates into tight budgets, strict deadlines, and clear value-add communication. Under such working conditions, engineering creativity is paramount. It’s in this suspenseful stage of the process that I’ve found tremendous value in tools like Docker and cloud products like Azure Container Instances and Logic Apps. These tools help data scientists stay nimble in producing minimum viable products that can handle production environments. I’d like to share one generalizable use case.


The Use Case — Scheduling Scripts

A lot of my analyses start in either R Studio or a Jupyter Notebook and usually take some variant of the following pattern:

  1. Pull data from a warehouse or some other data store.
  2. Perform the analyses.
  3. Publish the results back into a data warehouse.
  4. Incorporate the new analytic dataset into the reporting tool in a way users can understand.
  5. Iterate.

The current walk-through example followed that pattern:

 

When the entire machine-learning workflow is housed in a single file, the goal is often to schedule the file on a recurrent basis. While useful options exist in Azure for executing machine learning workflows, many of the use cases shared by Microsoft and others assume large-scale infrastructure requirements, relatively straightforward modeling, and a desire to integrate a full-blown CI/CD pipeline. These options are all legitimate areas of development but are not conducive to a minimum viable product.

Analytics in a Behavioral Health Care Organizations

The value proposition is straightforward: When a clinician needs to create a personalized service plan for an individual seeking services, they’d like to view what treatment options were recommended to individuals in the past with a similar clinical and demographic profile.

A clustering algorithm is employed to group clients together with similar characteristics. The resulting model output and custom metrics are merged with other datasets and staged in a SQL database for use in PowerBI. However, the value proposition requires that the process run daily.

Running model scripts daily

The following setup tackled the challenges of running jobs with database credentials, custom compute, cost control, and highly customized analyses all in a fast and cost-effective manner.

  1. Build the R/Python file to run the model and stage the data in SQL for use in PowerBI.
  2. Dockerize the R/Python file.
  3. Create Azure Container Instance for your docker image.
  4. Spin up & run the Azure Container Instance once a day using Azure Logic Apps.

 

A few notes on the architecture:

  • Azure Container Instances allows you to inject environment variables as secrets into the environment upon launch, allowing the script to access required database credentials in a secure manner.
  • There are many cloud alternatives to Azure Container Instance such as AWS Fargate and Google Cloud Run.
  • When it comes to Container Instances, you are limited in the number CPU’s and RAM specifications.

Conclusion

Value-creation is the goal to which all work efforts should be aimed, and minimum-viable products help to achieve this purpose. The workflow presented is generalizable to a wide range of use cases faced by working data scientists.

Sources


Gray, D. A. (2023, March 14).Why Data Science Projects Fail: Practical Principles for Data Science Success. Analytics Informs. https://pubsonline.informs.org/do/10.1287/LYTX.2023.02.04/full