Life cycle of a data science project

Published in

Analytics Vidhya

4 min readApr 20, 2021

This article is an insight into the typical data science project flow. It is aimed for both the data scientist and the stakeholders of a project involving data science/machine learning. It is important for everyone involved to understand the nature of such projects and be comfortable with it’s various activities. It is also important for everyone to be comfortable with the idea of experiments. This article describes the way I think & go about a data science project.

The data science project has a cyclic lifecycle. Each cycle can be considered as an experiment(s). This is shown in the above figure.

Every data science project starts with defining the problem statement. Having a clear understanding of what we are trying to solve. This also involves defining the success criteria.

Problem Definition

In some projects defining the problem statement itself is one of the stages of the project. This is typically when a client comes in and says, we have (for example) sales data collected. What useful insight or prediction can be done with it. Defining the problem statements also means figuring out the uncertainties. E.g. If we are trying to forecast the demand, then there must be clarity on specifics such as ‘How much in the future do we want to forecast the demand? For the next week or the next month?’

Research

The next stage involves research. This typically includes reading up about the domain, understanding the kind of problem statement to be solved, estimating the kind of data we would require. If the problem statement is vague, then research and problem definition may overlap & help answer some of the questions which may arise in the problem definition stage.

Post this there are 3 tracks or activities that run in parallel:

Data
Literature
Ideation

Data

This involves understanding the current data in the system. It involves understanding the kind of data useful to solve the problem and doing a gap analysis. Which gaps in data would be mandatory to fill in before starting full fledged with the project and which gaps can be worked with.

Literature

While the research stage involves studying the domain, the literature stage is more specific. It is very important & useful to survey the existing literature out there. In very exceptional occasions will one come across a problem statement that has never been looked at before. While each project in data science is unique due to the unique nature of data even in similar projects, the approach to these can be studied. This study is usually done in 2 stages — breadth first and then depth. The idea is to superficially understand the different ways & methodologies in which the problem statement has been tackled already. Then from these identify the few relevant to the current problem at hand and study the same in depth.

Ideate

This involves alot of brainstorming, discussions, hypothesising and whiteboarding. It would involve information obtained from both the data track & the literature track. But also discussions on new ways to approach & solve the problem. Even if the solution seems obvious, it is always a good idea to walk it through with someone atleast as a sounding board. It is the max flow of ideas that we want in this track. Once we have many ideas, one can combine, discard, dwell & evolve some of these. This track is more focused on getting many ideas & looking at the problem at hand in many different angles. The ideation phase may involve PoCs if required.

Combining the above 3 tracks, possible approaches are identified. These are then prioritised & the initial approach(es) is finalised. So for example, we have a classification problem, one could outline the kind of models to explore and prioritize them — based on experience & time-effort tradeoff. Or if there are different features to explore, a good way would be to list them all out. Discuss these with the stakeholders to identify which features are important and worth the effort of creating (not all features are readily available in a database). Take these top few for the first cycle. You can also think of each cycle as a milestone.

These approaches and algorithms are then implemented & the results analysed. This is not just looking at the accuracy of the model, but more of understanding the input features and their impact on the model (feature importance). It also involves comparison with previous experiments results.

Based on the results, understanding of the models, the features involved & the success criteria, one would typically go back to the data & ideation phase. Sometimes the literature phase too. We now have a better understanding of the data and how our initial ideas/ hypothesis have worked. Accordingly, we are in a better position to tweak our approach & methodology and the whole cycle starts again.

Once we have good results and sufficient confidence in the model, it can be integrated in the tool/product/ softwar

That is not the end though. Before the model is put into production, monitoring metrics need to be put in place. Over time the nature of the data may change. And the model will no longer perform as well as it once used to. This is also called model drift (but that’s for another time). Hence, the model will need to be revised from time to time. The monitoring metrics can help track the model performance and will aid in deciding when & how frequently the model should be revised.

Thanks for reading. If something is not clear or you agree/ disagree with some of the points? Please do let me know in your comments. I love to hear different perspectives :)