You are feeling extremely hungry. You have bought a list of ingredients from the grocery store to prepare a delicious meal for the evening. You start cooking, adding different flavors to spice things up. But soon, there awaits a problem which will spoil the mood of the whole evening.
As the food is ready, you take the first bite, only to realize something is wrong. It tastes ‘bad’. And then you realize some of the ingredients used, while preparing the food, were of poor quality which have resulted in the whole meal not turning out as you had hoped for.
Let’s apply the same analogy to the landscape of data science. Here, the meal refers to the machine learning model that you are developing, and the ingredients refer to the data that you are using to feed into the model.
If the data is of poor quality, so will be the model and, consequently, the predictions. In the world of computers, it’s called GARBAGE IN, GARBAGE OUT.
If you're still not sure how to get started with machine learning without any coding, this article's got you covered.
If wisdom were young, it would tell you to investigate the quality of your data before building any machine learning model. Before you take on the role of a data scientist, you might want to put on the hat of a detective to investigate and examine the quality of data residing in your company’s internal systems. You should also be sure that your company is really ready to take on an AI/ML project.
We will go through some of the most common data quality issues you might encounter when working with real-world datasets. The aim here is not to shed light on why these issues arise and what could be done to resolve them, but merely to give an overview of the most common problems that you might come across and need to watch out for. Some of these issues can be fixed through your data cleaning or data wrangling process.
To illustrate our point, we have a table below that lists the names of employees working at a company, their age, department, and their contract start and end dates.
Before reading any further, we suggest that you have a look at the table and try to figure out if you can spot any data quality issues. (Did you spot them all?)
As you can see in the table above, there are a lot of NULL values that means the data is missing. A simplistic approach would be to ignore all such rows that have missing values but the trade-off is that it would result in loss of information that could have helped the model train and learn the underlying trends and patterns in the data.
Have a look at the last row in the table. Notice anything strange?
Ah37@ does not conform to the standards that a name must have. The Name column should only consist of alphabets, but Ah37@ contains alphabets, numbers, and alphanumeric characters.
This is an example of invalid data where an entry does not comply with the column’s data type.
These kinds of issues usually arise when dealing with date-time columns. Take a look at the contract_start and contract_end dates in the last row and compare them with the other values.
The dates in all the other rows seem to follow the YYYY-MM-DD date format whereas the last row seems to have the MM-DD-YYYY format. All dates need to be in the same format.
Another example of inconsistent data could be found by taking a look at the first two rows in the table. Both rows are about the employee John and the data in all the columns seem to be the same except for one column: Department.
In the first row, we see that John works in Sales but if we scan the second row, it appears that John works in IT. So which one is correct? It is not possible that an employee can work in two different departments. When data is compiled or aggregated from multiple systems working in silos, such issues do arise.
Have a look at rows six and seven.
They are both about the same employee Alex and contain the exact same information. This is called redundant data that needs to be checked and the duplicates removed before proceeding.
Have a closer look at the first row in the table and pay close attention to the contract_start and contract_end date? Do you find anything peculiar about those dates?
On the surface, both dates seem to be in the correct format (YYYY-MM-DD). But if you take a closer look, you will notice the contract start date is greater than the contract end date. This means that John’s contract started in 2021 but his contract ended in 2020. This defies business logic as the contract end date must be after the contract start date.
Such checks need to be performed, especially when dates are involved. The dates must be compared with each other to check if the differences seem reasonable. This was a relatively simple example for illustration but in a company, there can be scenarios in which different stakeholders use different dates to measure a KPI. In such scenarios, it is extremely important that all stakeholders involved come to a consensus on business definitions.
With the huge volumes of data being generated on a daily basis, organizations are faced with an ever-increasing challenge to keep their data quality in check, or else they won’t be able to unleash the true potential of their data to get meaningful insights for decision making. If businesses truly want to be a part of the data revolution that is taking place, then they need to take timely measures to ensure that their data is high-quality, consistent, accurate, and relevant.
Not sure where to start with machine learning? Reach out to us with your business problem, and we’ll get in touch with how the Engine can help you specifically.