AI & Analytics Engine Tutorial 2/4: Data Preparation

Written by Kurtis Pykes | May 17, 2021 12:29:38 AM

This article is tutorial 2 of 4 and will help you use The AI & Analytics Engine for data preparation, a vital step in machine learning.

In the previous tutorial, we discussed how you could create your first project in the AI & Analytics Engine. Tutorial 1/4: Creating Your First Project. Refer to this article if you need help creating your first project.

Why is Data Preparation so Important?

A familiar concept in machine learning is Garbage In Garbage Out. It means that the quality of a predictive model depends directly on the data quality - getting this step right, repeatable and systematic is key to running smooth projects. The AI & Analytics Engine simplifies the data preparation, data cleaning, and data wrangling process for you.

If you need a refresher on data prep or the basics of machine learning - now is a good time.

Data Used in the Tutorial

For our tutorial, we will be using the Titanic dataset. In this dataset, (composed of x3.csv files, we will be using the train.csv for this article). The data provides various features and 891 instances of passengers that boarded the titanic.

Our task: To predict whether someone survived or not.

Data Best Practices

There are some best practices and things to note about your data before you import the data into The AI & Analytics Engine. Here are the best practices as recommended by PI.EXCHANGE.

- Do allow a column to start with a number. On our platform, we assign "Unknown" in front of it i.e. Unknown_0.
- Special characters in column names are not ideal and make the data non-portable. Our platform removes them
- Spaces must be substituted with underscores
- Columns with the same names must be de-duplicated

Smart Data Preparation with The AI & Analytics Engine

Step 1: Importing the Data

So, you have signed in, head to the project page (or refer to starting a project tutorial if you need!), hover on the floating-action buttons at the bottom right, and choose "New Dataset".

When presented with an option to choose a method, you simply select where you are importing the data from.

In this example, I will be importing from my local drive so I've selected that, then giving my dataset the name titanic_train and dragged and dropped my file in. After the data is uploaded to the platform, the process of data preparation begins.

Step 2: Creating a Recipe

When a dataset is first imported, the option is given to create a new data wrangling recipe.

A recipe is a reproducible pipeline of data transformations, each transformation is called an action. When you commit an action, it represents an iteration. Recipes are built 1 iteration at a time. The Engine after ending the current iteration, begins a new iteration with subsequent additional recommended actions provided - This cycle may continue until The Engine can no longer find actionable insights. (see step 3.2).

Here is where it gets cool.

If you've created recipes in the past, the reproducible nature of a recipe means that the recipe can be applied to a new dataset as long as it has a compliant schema (more on schema below!). This can radically improve the efficiency of your data-wrangling workflow.

For this tutorial, we are going to create a new data wrangling recipe by selecting "Create a new data wrangling recipe". Then, enter a new name or confirm the provided name of your wrangling recipe. Select "Create".

Step 3: Confirm Recommended Actions for Data Wrangling

In the first iteration of actions, the recommendation is to edit or confirm the dataset schema. The schema refers to meta-information such as the column names and types (numeric, text date etc) of the columns. To understand how to proceed, it is best to understand the Data Wrangling Window.

Data Wrangling Window

The data wrangling window is divided into 3 sections:

The Search Bar - If you'd like to find and display any column from your dataset, you'd simply enter the name of the column you are searching for here.
The Data Viewport - The data viewport is where you view the data. It can show you your transformations on a sample of the dataset visible prior to committing the actions. It only displays the first 1000 rows of the tabular dataset so don't be worried if you can't see all of your instances.
The Action Dialogue Box - The main interface to the data wrangling process

Step 3.1: Confirm or Edit the Recommended Schema

Our Titanic training dataset is made up of 12 columns.

Columns Explained

- survival - This is a binary column that indicates whether a person actually survived or not: 1 = survived, 0 = did not survive. (This is what we want to predict).
- pclass - People on the Titanic had different ticket classes as is shown by this variable; 1=1st class, 2= 2nd class, 3= 3rd class.
- sex - The gender of the person
- Age - How old the person was
- sibsp - The number of siblings/spouses the person has aboard the titanic
- parch - The number of parents/children the person has aboard the titanic
- ticket - The ticket number of the person
- fare - How much the person paid
- cabin - The cabin number of the person
- embarked - The port the passenger embarked from; C= Cherbourg, Q= Queenstown, S= Southampton

Select Your Target Column*

If you know your target column select it from the drop-down to aid the recommender in suggesting better actions. In this tutorial we select the column Survived - because that is what we are trying to predict.

*Note: It is common not to have a target column in your data, this is where you can use recipe actions to create one that suits your problem.

Adding Actions to The Recipe Queue (Iteration 1)

Now you can determine what actions to commit for iteration 1.

First view the actions - we can see here they are both to cast the column to inferred data type.
Click on "see analysis" for both recommended actions to view The Engine's justifications for the suggestions it has provided. Upon upload, the initial recommended actions are to cast columns to the inferred data type of categorical & numerical (as these types were detected by The Engine in our data set.
If you are happy analysis, add the action to the recipe queue by clicking the "plus" icon.

Once we've added some actions to our recipe, they will be visible in the Data Viewport. If you want to now commit those actions, head to the Recipe tab.

Here you can see the actions you have added to the recipe queue
If ready to finalize iteration 1, click "commit actions"
Now head back to the suggestions tab to see the additional action generated from iteration 1 actions being committed.

Step 3.2: Add More Recommended Actions or Add Manual Actions (iteration 2,3,4...n)

We can begin to view the generated insights from the changes we've made from iteration 1. To find how to view the insights, navigate back to the "Suggestions" tab in the action dialogue box. When you view the data, the viewport displays a sample of the data (1000 rows). This allows the platform to preview the results of your actions as quickly as possible. However, when actions are committed, the actions are carried on the whole dataset. Each commit represents an iteration.

We may decide to proceed with the actions that are recommended by The Engine by simply selecting the "plus" icon adjacent to the suggestion - clicking "see analysis" to understand the reason behind the recommendation, or we may want to add actions manually. This process happens until we are happy with the transformation or/and there are no more recommended actions to commit. This can take several iterations of committing actions in the recipe queue.

Step 3.3: Manually Adding Actions*

*This step is optional - If there are unique transformations you would like to make that aren't suggested by The Engine, or you want to undertake the process manually or experiment, then simply go to the "Add Action" tab in the action dialogue box and type the name of the action you'd like to add.

When actions in the queue are removed, edited, or added, similar to the recommended actions, The Engine will generate a preview of the transformed dataset in the Data Viewport, allowing you to see the results of your manually added actions before committing the actions to the whole dataset.

For more on the actions that could be added to a dataset, view the Actions Catalogue.

Step 4: Finalizing Your Recipe

If you're satisfied with all the actions that you have queued then selecting "Commit Action" would apply all of the transformations to the entire dataset. Performing this action would also illuminate the "Finalize & End" button which would save the dataset and the recipe you used so that it could be reused for future datasets.

What's Next?

In the next article, we are going to take the prepared dataset we created in this tutorial and build a Machine Learning application using the PI.EXCHANGE AI & Analytics Engine. The Engine empowers expert data users to streamline end-to-end machine learning projects - and empowers those of us that need a little data science guidance with a helping hand through the process.

Go on to Part 3: Building an AI Application

For a recap of Creating your first ML project, read Part 1.

Ready to start your first machine learning project?

View full post