In a previous article, I described the 4 development standards we developed at The Data Analysis Bureau (T-DAB) for the successful delivery of machine learning projects. It featured a table that listed the components that we use to develop the machine learning model, and in this article, Machine Learning Procedure at The Data Analysis Bureau, I describe in more detail what each of these steps consist of.
Pre-processing consists of transforming the data into a table with rows being observations and columns being the variables. This can be as simple as importing a well-presented csv, to pulling a bunch of raw data from a database cluster.
It is typically a very quick step in the PoC as we receive the data in an offline format such as in csv files. For the Demonstrator, we usually introduce Data Engineering support if the data is accessible in a Database format and we increase the Data Engineering support for the Prototype and MVP.
Data cleaning notoriously takes a large part of any data science project. Indeed, incorrect, missing data will impair the quality of the model we build from the data. For the Prototype and MVP, we first check each variable. We also look for any correlation and trend between multiple variables. This informs us of the structure of the data and allows us to correct any mistakes (e.g. typos or inconsistent capitalisation) and deal with missing values. Once we are confident the dataset is valid and complete, we move on to the next steps.
However, for the PoC and Demonstrator, we use quick and iterative approaches. Indeed, for these early stages, we start by performing a light cleaning to quickly move to the next steps. If, or when, an issue in the data impairs the feature engineering or ML model building, we go back to cleaning stage and fix the issue. This is because our goal in these early stages is not to create a perfect model, but just one that is good enough to deliver insight back to the business. Therefore, if usually the ratio data cleaning/model building is around 80/20, for PoC and Demonstrator, we aim for a ratio of 20/80.
However, reducing cleaning to 20% does not mean taking it lightly, each of the steps outlined below are approached rigorously.
Exploratory Data Analysis on the Raw Data
To start with data cleaning, we run an exploratory data analysis on the raw data. This gives us an opportunity to look at each variable, and understand what it contains, what it represents, how complete and meaningful it is. For each variable, we consider (where appropriate):
- Data type
- Number of missing values, zeros, infinite
- Range of values
- Count of unique values
- Distribution (using histograms to visualise it)
We use a combination of several R packages that automate this process. This allows us to detect missing and invalid values quickly.
Our favourites are:
- funModelling: generates a simple table, yet useful giving the proportion of zeros, infinite and missing values in each variable
- autoEDA: generates a table showing characteristic of each variable. We are particularly fond of the ConstantFeature and ZeroSpreadFeature categories, showing whether a variable is constant (very useful for categorical variables) or has very low variance (for numerical variables)
- dataexplorer: generates an html report. Very visual, it gives a graphical view of the amount of missing values in each variable and overall. It also plots a histogram and QQ plot of each numerical variable, and a correlation matrix
- DataMaid: autogenerates a data dictionary, useful to consider each variable one by one. It also represents distribution of categorical variables, and makes a prediction of possible outliers.
Dealing with missing values
Our approach to dealing with missing values depends on how much data is missing.
If there is more than 10% missing data, we talk to the client to understand what the missing values mean. For example, it could be that if there is no change in a variable since the last entry, the person/sensor might not record it. In this case the missing value would be replaced by the previous value. In other cases, however, a missing value might mean there were no observation, in which case the missing value should be replaced with a zero.
If there is less than 10% missing data and it seems to be missing at random, we use imputation methods to populate the missing data points:
- Nearest neighbours for a quick approach
- Multiple imputation by chained equations (MICE) for a more complete and elegant approach
If we find there is more than 10% missing data from a few variables only, then we consider dropping these variables entirely. We usually then discuss with the client to see, if these variables seem important, and advise them on improving their data collection process.
We work collaboratively with clients who are domain experts to brainstorm some interesting features that could help the performance of the ML model we are building. For a PoC and Demonstrator, we order all new features by the perceived importance, then on each iteration we add new features from the top of the list to see how much it improves the performance. This allows us to potentially test many different features while not over-investing our time into feature engineering.
For Prototype and MVP, we go to great lengths to build all the features we think might have an impact. This might require finding other data sources or extra engineering efforts to compute them from the existing data.
After building features, we run a quick Exploratory Data Analysis (EDA) on them, during which we:
- Clean the new features (see data cleaning section)
- Consider the data type (count, proportion)
- Plot and name distribution (normal, binomial, Poisson, etc). We consider transforming the data (ex. log, if it improves understanding or helps meet statistical assumptions)
- Compute statistics necessary to check assumptions (see table on assumptions)
- Sense check the units and scales, to make sure they match other features in the dataset
- If we work with time series, we make the data stationary (remove the frequencies that are lower than the time window we plan on predicting)
Machine learning model building and training
Once we have a clean dataset with features that we think are relevant for the problem, we try fitting some ML models to the data. Here is the checklist we use:
- Scale and centre the data
- Split into training and test sets
- Consider the problem type Classification/Regression/Clustering
- Choose an evaluation metric suitable for the problem (see table 2)
- Run the basic algorithms* using k-fold cross validation
- Note when working with time series, we usually make sure that observations forming the training set occurred chronologically before the observations used for the testing set.
- Check assumptions of the algorithms we have run*
- Hyper-parameter tuning
- Feature importance extraction
- To select important features, we either use the embedded methods if they are available from the regularisation of the models (e.g. LASSO and RIDGE methods in regression, regularised random forests). If not, we use the Recursive Feature Extraction wrapper method.
- For clustering, we use dimensionality reduction such as PCA to select most useful features
- Evaluate models on the training, cross-validation, and test sets.
- Compare models and select the best performing
* see the table 3 below for a description of the PoC and Demonstrator steps
Table 2: Evaluation metrics
Table 3: Common algorithms
Table 3 shows the algorithms we usually test in the PoC/Demonstrator. For the Prototype and MVP, we might use less commSon algorithms, and work more on a case by case basis.