This article is tutorial 2 of 4 and will help you use The AI & Analytics Engine for data preparation, a vital step in machine learning.
In the previous tutorial, we discussed how you could create your first project in the AI & Analytics Engine. Tutorial 1/4: Creating Your First Project. Refer to this article if you need help creating your first project.
A familiar concept in machine learning is Garbage In Garbage Out. It means that the quality of a predictive model depends directly on the data quality - getting this step right, repeatable and systematic is key to running smooth projects. The AI & Analytics Engine simplifies the data preparation, data cleaning, and data wrangling process for you.
If you need a refresher on data prep or the basics of machine learning - now is a good time.
For our tutorial, we will be using the Titanic dataset. In this dataset, (composed of x3.csv files, we will be using the train.csv for this article). The data provides various features and 891 instances of passengers that boarded the titanic.
Our task: To predict whether someone survived or not.
There are some best practices and things to note about your data before you import the data into The AI & Analytics Engine. Here are the best practices as recommended by PI.EXCHANGE.
Unknown_0
. So, you have signed in, head to the project page (or refer to starting a project tutorial if you need!), hover on the floating-action buttons at the bottom right, and choose "New Dataset".
When presented with an option to choose a method, you simply select where you are importing the data from.
In this example, I will be importing from my local drive so I've selected that, then giving my dataset the name titanic_train
and dragged and dropped my file in. After the data is uploaded to the platform, the process of data preparation begins.
When a dataset is first imported, the option is given to create a new data wrangling recipe.
A recipe is a reproducible pipeline of data transformations, each transformation is called an action. When you commit an action, it represents an iteration. Recipes are built 1 iteration at a time. The Engine after ending the current iteration, begins a new iteration with subsequent additional recommended actions provided - This cycle may continue until The Engine can no longer find actionable insights. (see step 3.2).
Here is where it gets cool.
If you've created recipes in the past, the reproducible nature of a recipe means that the recipe can be applied to a new dataset as long as it has a compliant schema (more on schema below!). This can radically improve the efficiency of your data-wrangling workflow.
For this tutorial, we are going to create a new data wrangling recipe by selecting "Create a new data wrangling recipe". Then, enter a new name or confirm the provided name of your wrangling recipe. Select "Create".
In the first iteration of actions, the recommendation is to edit or confirm the dataset schema. The schema refers to meta-information such as the column names and types (numeric, text date etc) of the columns. To understand how to proceed, it is best to understand the Data Wrangling Window.
The data wrangling window is divided into 3 sections:
Our Titanic training dataset is made up of 12 columns.
Columns Explained
If you know your target column select it from the drop-down to aid the recommender in suggesting better actions. In this tutorial we select the column Survived - because that is what we are trying to predict.
*Note: It is common not to have a target column in your data, this is where you can use recipe actions to create one that suits your problem.
Now you can determine what actions to commit for iteration 1.
Once we've added some actions to our recipe, they will be visible in the Data Viewport. If you want to now commit those actions, head to the Recipe tab.
We can begin to view the generated insights from the changes we've made from iteration 1. To find how to view the insights, navigate back to the "Suggestions" tab in the action dialogue box. When you view the data, the viewport displays a sample of the data (1000 rows). This allows the platform to preview the results of your actions as quickly as possible. However, when actions are committed, the actions are carried on the whole dataset. Each commit represents an iteration.
We may decide to proceed with the actions that are recommended by The Engine by simply selecting the "plus" icon adjacent to the suggestion - clicking "see analysis" to understand the reason behind the recommendation, or we may want to add actions manually. This process happens until we are happy with the transformation or/and there are no more recommended actions to commit. This can take several iterations of committing actions in the recipe queue.
*This step is optional - If there are unique transformations you would like to make that aren't suggested by The Engine, or you want to undertake the process manually or experiment, then simply go to the "Add Action" tab in the action dialogue box and type the name of the action you'd like to add.
When actions in the queue are removed, edited, or added, similar to the recommended actions, The Engine will generate a preview of the transformed dataset in the Data Viewport, allowing you to see the results of your manually added actions before committing the actions to the whole dataset.
For more on the actions that could be added to a dataset, view the Actions Catalogue.
If you're satisfied with all the actions that you have queued then selecting "Commit Action" would apply all of the transformations to the entire dataset. Performing this action would also illuminate the "Finalize & End" button which would save the dataset and the recipe you used so that it could be reused for future datasets.
In the next article, we are going to take the prepared dataset we created in this tutorial and build a Machine Learning application using the PI.EXCHANGE AI & Analytics Engine. The Engine empowers expert data users to streamline end-to-end machine learning projects - and empowers those of us that need a little data science guidance with a helping hand through the process.
Go on to Part 3: Building an AI Application
For a recap of Creating your first ML project, read Part 1.
Ready to start your first machine learning project?