The Data Mining Process (CRISP-DM)
Dylan | Nov 16, 2019
Data Science is a craft with its own clearly defined process for framing problems. Just as engineers have their own design process to create functional products, data scientists need a structure to ensure consistent and repeatable conclusions free of bias. Thankfully for us, this process has already been well established and is known as the Cross-Industry Standard Process for Data Mining (CRISP-DM).
It is often visualized by the graphic below.
General PrinciplesCRISP-DM is not a linear progress where you always begin at step A and after passing through steps B, C, and D, you finish the project after step E.
Repetition is a critical component of CRISP-DM. It’s important to bear in mind that a pass through the process without a solution generally isn’t a failure. With every pass through the process, the data science team learns more about the data and overarching project.
In general, CRISP-DM should not be mistaken for an engineering process, rather a research and development step. It is exploration-focused with farless certain outcomes than traditional engineering processes.
Although we’ll often return to an earlier stage, the general progression of CRISP-DM is
- Business Understanding
- Data Understanding
- Data Preparation
Let’s explore each of the stages in depth!
Stage One: Business UnderstandingThe success of all other stages in CRISP-DM depend on the fundamental groundwork established during the business understanding stage. In this stage, the data scientist must carefully consider the end objective of the data mining project. In short, the data science team must answer the following three questions
- ”What exactly do we want to do?”
- ”How will we do it?”
- ”How will we define success?”
Dedicating the proper time to flushing out detailed and accurate answers to these questions can save a data science team countless iterations through CRISP-DM. Although this stage may sound as interesting on the surface as others, this stage often requires the most creativity.
Stage Two: Data UnderstandingIn the second stage, the data science team should dive into data exploration to assess what data currently exists and how noisy it is. Perhaps after exploring the available dataset, we realize that multiple accounts exists for the same user, which if undiscovered, could result in inaccurate conclusions about our target population.
Another important thing to ensure in this stage is that we have trustworthy data available for our target variable. Before we can predict if a customer will churn, our model will require sufficient data about customers who have churned and who have not in the past. If our dataset doesn’t include information about which customers churned, it will be impossible to train our model down the road.
Stage Three: Data PreparationThe first part of this stage is determining which data you need to be able to train a predictive model. Data science teams will have to determine which features are the best suited to predict our target value and which should be left out of the model.
Unfortunately, data isn’t often stored in the exact way that a model requires to learn from it. During this stage, the data science team will often have to convert the data to tabular format (a format with columns and rows), handle missing values, convert data to the proper type, and/or normalize the data.
Stage Four: ModelingAlthough you probably already selected which tool (e.g. classification) you would use during the Business Understanding Phase, here you will select an exact technique (e.g. random forest). After selecting a model, a team should prepare some means of model validation to evaluate the effectiveness of the model after training. This often requires the available data to be split into training and testing groups.
After diving the data into groups, we’re ready to dive into building and training the model. For each available model, there are a number of parameters to consider and tune. Parameters should be set during this stage alongside the justification for their selection. After training the model, it’s ready to be assessed. Could it be improved by tuning one or more of the parameters? What was the model’s accuracy?
Stage Five: EvaluationOur data mining results require further exploration and scrutiny to ensure sample anomalies have not resulted in an inaccurate conclusion. Testing results in the field requires more time, capital, and effort than testing them in a controlled laboratory setting. A data science team must be confident in the model’s validity and reliability before considering deployment. If sufficient confidence has not been met or the initial standard of success laid out during the first stage remains unmet, the team should return to the first stage again, providing their time and budget allows it.
Stage Six: DeploymentGenerally, the final model will be handed off to a development team who will deploy it in its final form in production. For this to take place, the model will likely need to be re-coded by the data science team before passing it off. Sometimes the data mining phase will be extended into production to continue to optimize the model’s performance with unsupervised tasks. In this case, fail safes must be built-in.
Thanks for reading this post about CRISP-DM! I hope you took away something valuable that you’ll be able to immediately apply in your own projects! As always, if you have any questions or anything remains unclear please post in the comments below! I look forward to continuing this discussion! Until next time, happy coding from Nimble Coding!