Are you new to data science? Are you transitioning from a business analyst to a citizen data scientist? Or are you a seasoned data scientist with a relevant degree from university?
Regardless, when you started out on the journey of machine learning, have you ever felt lost when facing the long list of models, not knowing which one you should choose for your problem? Or maybe you are familiar with Logistic Regression and Linear Regression, but have always wondered what those other algorithms can be used for?
I would like to share with you the Alteryx Predictive Flow Charts that we created at TrueCue.
The Predictive Flow Charts visualise some common considerations that analysts face when choosing a Predictive algorithm. They give some ideas and guidance when selecting which model to use, for instance, considering what kind of data you are trying to predict, what volume of data you have, or how important it is for the model to be interpretable.
The flow charts are designed to accompany TrueCue’s Predictive Analytics Alteryx training for novice data analysts and provide a starting point for learning the Predictive Analytics toolbox in Alteryx. A trained Data Scientist will spot some simplifications and generalisations.
The Data Investigation tool category includes tools for understanding the data to be used in a predictive analytics project, and tools for conducting specialised data sampling tasks for predictive analytics. Understanding what your data looks like is the first step of designing a machine learning solution.
Model selection plays a crucial role in a predictive project. When we get our data, we typically start with some basic descriptive analysis to investigate and understand the data we are dealing with. Then based on the predictive goal, we determine if we have a Classification problem (where we want to classify data into groups or categories, e.g. predicting if a loan applicant will default), or a Prediction problem (where we want to predict numbers, e.g. predicting how many software licences we are going to sell next quarter).
After deciding whether we have a Classification or Prediction problem, we can move to the model selection. You will see in the flow charts that some models can be used for both Classification and Prediction, while some can only be used for one of the two. There might be multiple models that are suitable in a given situation, and you don’t know which one will perform better.
This is why we design a Validation process, where we split the data, train the selected models and validate their performance with a “hold-out” dataset (which was hidden from all the models during the training stage) so that we can compare the model performance. Sometimes we split the data in multiple folds so that we can test the performance multiple times to increase the robustness – this is called Cross Validation.
Once we have a winning model, we can then use this model to create prediction – this is called inference (or scoring).
The flow charts were created by Katelyn Weber (analytics) and Jakub Szepietowski (design).
TrueCue were the 2019 Alteryx European Partner of the year.
Bingqian believes in the power of Analytics and Data Science in uncovering insights and helping to better inform decision making. As a Senior Consultant and Data Science Lead at TrueCue, she enjoys finding solutions for challenges in data consolidation, modelling, visualisation and Advanced Analytics.
She leverages modern technology such as Alteryx, Tableau, DataRobot, and Microsoft Azure Machine Learning, and is one of the 17 Certified Alteryx Experts in the world. Outside of work, she enjoys a wide range of activities, from oil painting, poetry reading, scuba diving, to boxing and krav maga.