PI.EXCHANGE | Blog

AWS Glue DataBrew, how does it compare to the AI & Analytics Engine’s Smart Data Preparation? 

Written by Dr. Ramanan Subramanian | Aug 13, 2021 3:05:49 AM
Data preparation. Love it or hate it, one thing is certain — it can chew up hours...

Blowing out project timelines and adding a little grey to your hair along the way. It is no surprise then, that Amazon Web Services (AWS) recently announced the release of their new data preparation tool, AWS Glue DataBrew. We wanted to provide a comparative look at this new option in the market and identify the similarities and differences with the AI & Analytics Engine’s Smart Data Preparation feature

So if you are looking for an AutoML tool to help claw back time from data preparation or data wrangling, read on...

First, a little background on both options. 

 

DataBrew

AWS’s new data preparation tool belongs to the category of no-code easy-to-use visual data-preparation engine. It is primarily purposed as a tool for cleaning, normalization, and profiling of data as well as automation of recipe jobs. It can be viewed as a standalone tool for data preparation but can be integrated within AWS’ ecosystem of tools and services such as S3 or other AWS data lakes and databases for storage, import, and export of unprepared/prepared data.

We decided to try out the platform to develop this article. Below is a screenshot of the ASW Glue DataBrew graphical user interface (GUI). 

 

Smart Data Preparation (The AI & Analytics Engine)

Within the AI & Analytics Engine, Smart Data Preparation is a fully interactive feature that lets users prepare their data at scale in a flexible manner. 

The premise is that the user is guided by smart recommendations during the recipe-creation process. It offers a variety of “actions” (data transformation steps) through an action catalogue that offers the ability to fully customize and edit their data-preparation recipes. It covers the four stages of data preparation commonly required in analytics and machine learning tasks: 

  1. Cleaning
  2. Structuring
  3. Enriching 
  4. Feature engineering

Rather than a standalone tool, the Smart Data Preparation feature is a tightly integrated functionality within the end-to-end user journey on the AI & Analytics Engine platform. It sits between the data import and app creation phases of the journey in a unified graphical user interface (GUI).

                                        

Walkthrough of Smart Data Preparation and model training

1. Import Data

2. Smart Data Preparation

  1. Examine recommended actions, analysis
  2. Customize and commit the actions
  3. Finalize recipe

3. Inspect the Statistical Profile of the Finalized Dataset

4. Create app (Target variable selection, train/test split)

5. Select features

6. Select models to train

 

7. Deploy trained models

Key Benefits: It's what you DON’T need to do

Smart Data Preparation on the AI & Analytics Engine provides ease of use through the above seamless interface. In particular, there is:

    • No need to manually generate access tokens and turn on/off secure access of data between different tools
    • No need to manually set up S3 buckets or migrate all of their data into a single ecosystem such as AWS
    • No need to write scripts to manually orchestrate different components of the end-to-end pipeline

 

Similarities

The main similarity between the two tools is that they both cater to the need for an easy-to-use no-code interactive data preparation tool. Let's take a closer look: 

Data Ingestion and Outcome

Both tools are targeted at tabular data files in the following formats: CSV, JSON (lines), and PARQUET.

While DataBrew allows the import of data mainly from data lakes on the AWS cloud such as S3, Redshift, and RDS — The Smart Data Preparation (AI & Analytics Engine) allows a diverse set of options for importing data, such as HTTP (URLs), SQL and NoSQL databases. Users can still upload data from cloud storage services such as S3 or GCS (google cloud) by getting a pre-signed URL for their dataset and using the HTTP option.

The outcome of the recipe-building process is a re-runnable recipe that can be re-used to transform a larger dataset of the same input schema.

Interactive Recipe Building and Recommender

Both AWS Glue DataBrew and the Smart Data Preparation feature have an interactive recipe-building user interface. There are many similarities between the two:

  1. Quick preview of the dataset being prepared
  2. The list of actions (steps) selected so far in the recipe
  3. The ability to edit or delete a step in the recipe

Validation of Recipe

Whenever the list of actions in a recipe is modified, a validation check is applied to ensure that the recipe is legitimate. It consists of the following among many checks:

    • Input columns are available in the schema and are of the correct data types
    • The combination of parameters is valid
    • Names of new columns output by an action are not colliding with existing column names

 

Differences

Diversity of the action catalog

DataBrew’s official documentation of “Recipe actions reference” shows about 170 actions in their catalog. Comparably, we support 85 actions + 81 formula functions in our Smart Data Preparation feature.

The coverage areas of the actions also differ between the two platforms, as shown in the table below:

Column group as output to complex actions

Some actions such as “pivot”, “Extract PCA components” etc. result in a “column group” rather than a column. A column group serves as a placeholder in the schema wherein one or more component columns can be generated by a recipe action. This enables:

    • Better understandability, since the user is aware that the individual components within the group are generated by a single action.
    • Universal validity of the recipe for all future batches of data, since running actions of the aforementioned types on different batches of data can lead to a different number of individual components. For example, if a “pivot” action is run on a different batch of data, the number of columns produced in the output can differ.
The column groups themselves as a whole can be input to other actions if appropriate. Finally, the concept of a column group is flexible, as there are actions in the recipe that allow the user to take a particular component in a column group and treat it just like any other column if they desire.

 

Advanced column selectors

Our Smart Data Preparation offers the ability for users to apply a transformation on multiple columns with a single action. To aid these, our graphical user interface and API provides the following modes of including/excluding input columns, column groups, or both into the selection:

    • By Name: explicitly list out the name
    • By Type: select by schema type
    • By Pattern: select by match in regex pattern of the name

The user is also allowed to combine multiple such criteria with the and/or operator. This enables full flexibility to let the user specify complex selection criteria such as “columns matching the name pattern ‘x_.*’, excluding non-numeric columns.”

Queuing and Committing actions to a recipe

In The AI & Analytics Engine, whenever an action is added, it is “queued” to be committed to the recipe. Like in DataBrew, the queued actions are run on a fixed-size sample (first 5k rows) of the full dataset and the result is displayed as a preview. This serves as visual feedback to users, allowing them to change the configuration of their action to ensure that the preview represents what they desire as the result.

Our platform also provides an additional functionality enabled for the user, called the “committing” of the queued actions to a recipe, before continuing to edit the recipe further. Committing of actions signals the platform to run the recipe actions on the full dataset (rather than on the sample) and then show a renewed sample preview. This results in a more accurate data preview, where the sample was first generated from the raw data before the recipe actions were applied.

Committing actions to a recipe also provides the platform to run intelligent algorithms with the fully processed data to provide users with good recommendations for the next set of actions in their recipe.


Recommendations

In DataBrew, recommendations are generated on a per-column basis, and are not available by default. The user needs to click on a particular column and request recommendations corresponding to one column.

The AI & Analytics Engine’s Smart Data Preparation also provides recommendations of the next set of actions likely to be helpful to the user. 

The key differences are that: 

  1. These recommendations are generated automatically the first time the user starts a recipe and after every commit. 

  2. The recommendations are based on a sample of the entire dataset rather than a single column. This makes it quite attractive for users who want to detect, for example, input columns with too little correlation with the target variable that needs to be dropped.

  3. Every recommendation is accompanied by reasons why these actions are useful, showing charts and summary statistics to help the user understand and scrutinize their data. This is particularly useful for users without data science expertise who want to access the benefits of AI/ML
 

Pricing

The pricing structure for both options is very different, keep in mind that you purchase DataBrew as a single tool and the AI & Analytics Engine as an end-to-end toolchain.  

AWS Glue DataBrew: With AWS Glue Databrew's pricing is calculated on an hourly rate billed per second with additional costs based on tasks and region. It is really contextual to how you would use the platform so best check it out here: https://aws.amazon.com/glue/pricing/. You can also use their calculator.

"For the AWS Glue Data Catalog, you pay a simple monthly fee for storing and accessing the metadata. The first million objects stored are free, and the first million accesses are free. If you provision a development endpoint to interactively develop your ETL code, you pay an hourly rate, billed per second." 

(https://aws.amazon.com/glue/pricing/)

The AI & Analytics Engine: There are four subscription tiers to cater to individual data users all the way through to enterprise options. A free trial for The AI & Analytics Engine is available currently for 2 weeks.  For more information, you can check out PI.EXCHANGE’s AI & Analytics Engine pricing. The prices start from $129 USD p/month. 

 

Wrapping up

If you are after a tool to hasten the data preparation stage, of the data science process. Both options will assist in this endeavor. However, there are differences to consider that may mean that one option may fit your needs better than the other. The key differences are;

  1. Utility of an integrated tool-chain: Smart Data Preparation is an integrated feature. This means that unlike DataBrew — you can seamlessly prepare, build and deploy. This benefits those that have prepared the data for downstream ML purposes to jump straight into the next step. 

  2. Diverse options for data import: Smart Data Preparation has a diverse set of options for importing data, with DataBrew mainly allowing the import of data from data lakes on the AWS cloud such as S3, Redshift, and RDS.  If you store your data with AWS this is not an issue. 

  3. Similar action amount — different action coverage: DataBrew has slightly more actions at 171, whilst Smart Data Preparation supports an 85 actions + 81 formula function. So, understanding the type of actions useful to you, given your data, and the task at hand is key.

  4. Advanced column selectors: Smart Data Preparation offers the ability to apply a transformation on multiple columns with a single action. DataBrew does not. This is an issue when working with large datasets. 

  5. Advanced recommendations: Whilst DataBrew requires users to click on a particular column and request recommended actions, Smart Data Preparation provides the recommendations of the next set of actions likely to be helpful to the user. These recommendations are generated automatically when the user starts a recipe and after every commit. The recommendations are based on a sample of the entire dataset rather than a single column (like DataBrew). Every recommendation is accompanied by reasons why these actions are useful, showing charts and summary statistics. The benefit is you get a deeper understanding of your data, with a greater understanding of why the platform has made said recommendations. 

Not sure where to start with machine learning? Reach out to us with your business problem, and we’ll get in touch with how the Engine can help you specifically.