PI.EXCHANGE | Blog

Easy Data Import from Multi-Sheet Excel Files

Written by Samuel Kolpinizki | Nov 25, 2021 2:00:00 AM

It is very common for businesses to store tabular data as excel files. This is often a result of the business' structure, legacy data storage methods, and the popularity and convenience of spreadsheet software which itself, makes the Excel format the preferred default way of storage for many businesses.

While there are clear advantages for storing data in Excel format, it is quite challenging to rely on it for advanced analytics applications and predictive models. It is tricky to import a particular table of interest from an Excel file for analysis in a third-party application since there can be multiple sheets within a file and multiple tables within a sheet. Also, the format is prone to messy data. For example, it is possible to store a date, a number, and a string in the same column.

These issues can be easily tackled by the AI & Analytics Engine. The latest release of Engine brings about enhanced data import capabilities by providing a user-friendly UI to handle different data formats, including Excel spreadsheets. It also offers advanced options to modify data import configurations for greater flexibility, control, and handle of many types of data formats, including Excel spreadsheets.

After the data import, if it is necessary to prepare the data further for modeling, the user can choose to enter a guided data wrangling process. In this process, a list of repeatable cleaning steps (recipe), which prepare the data for modeling, is created, thereby, tackling the main issues with Excel files.

In this blog, we will demonstrate how to import tables from Excel files using the new data import capabilities of the AI & Analytics Engine. 

Excel Data Import: The Traditional Way

To demonstrate, we took some data from Kaggle. It contains simulated data from a fictional retail company and includes information about the number of goods sold, the associated turnover, invested working hours, and the sales area.

Let’s view a small sample of our data:

We have an Excel file with 2 sheets and we’ll focus on the second one. In the second sheet, the header structure is rather complex: It contains multiple levels and a blank column (Column Q). Also, we have 2 tables within the same file.

The “traditional” method of importing a file would require a data scientist to manually specify the header rows, missing values, sheet names, rename duplicated column names and verify the data types of all columns, and correct if required (and we have quite a few columns).

This process requires writing code, manually renaming columns, validating data types, etc. This is a labor-intensive task prone to errors.

To get started with no code machine learning, check out this article!

Excel Data Import: Using the AI & Analytics Engine

The AI & Analytics Engine supplies an accessible and user-friendly environment where users can import data from various sources.

Once the source is selected, the user is guided through different steps in the data import process: data preview inspection, specifying the cell range, custom configuration, and schema adjustments according to the data type.

Using the data import UI helps resolve many of the issues mentioned above in general, and specifically with the data mentioned above. We will now demonstrate this for the same multi-sheet Excel file presented in the previous section.

The workflow of the platform requires the user to be a part of an organization. Within the organization, the user creates a project. Within the project, the user can import datasets by opening the data import wizard:

For our case, we choose the File Import option. Then, we can choose a local file to upload:

Once the file has been uploaded, the user proceeds to the Configure part where the Engine analyses a small portion of the data and shows a quick preview of the first few rows:

The user can inspect the data within the sheets, column names, column types and decide if the full data is needed, or only a subset of the existing sheets or tables within the sheets.

In order to demonstrate the usefulness of the data import wizard, we will focus on the opening_schemes sheet, which is “trickier” to handle. To do so, we change our selection to include only that sheet and select the Show Configurations button to allow for some filtering.

Doing so results in this:

Inspecting the data, we see that due to the irregular structure (See figure below), columns A to D are unnamed, and the first row is partially empty (until column E). The columns are unnamed since the headers are actually in the 3rd row of the raw spreadsheet, due to the arrangement of the Excel file table.

The yellow area is the actual table structure, which is irregular. This table has “Multi-index” column headers.

Scrolling a little to the right on the preview window of our platform, we notice that the columns headers we want are actually in the second row

However, starting from column E, these headers are the month numbers, and without the information in the first row, we can’t associate them to the corresponding year.

We can resolve that using the Engine’s UI in 2 steps:

1. In the range selection, we specify the area in the excel sheet containing the “actual” table using a cell range. In our case, from A3:AC53 and select apply:

We can see the updated preview showing that now the column headers are “normalized”. However, we still need to handle the column names and types.

2. Pressing “NEXT”, we proceed to the schema and column names menu:

In this menu, we see that the platform detected the column types automatically. Additionally, it renamed duplicated/”invalid” column names. (invalid - standard column names do not start with numeric literals, contain whitespaces, etc.).

We know that columns 10, 11, 12, 1, 2, 3, ... represent months of different years, so we rename them accordingly to get:

Finally, we create the dataset.

We can now inspect it and go over the analysis of the columns and start getting insights towards the next steps needed for the data preparation and predictive model generation, which will be described in a subsequent article.

Wrap-Up:

Using the AI & Analytics Engine import wizard, with just a few steps, we uploaded multi-sheet Excel datasets to the platform.

We have done so without writing any code and managed to easily resolve various data issues along the way such as:

  • Auto-schema analysis;

  • De-duplication of column names; 

  • Specifying required cell ranges of the target rows/columns.

 

Have any questions? We're happy to answer them. Get in touch with us!