This article outlines the concept of a feature type.
ML-ready datasets contain contain feature columns (plus a target column, if the problem type is classification or regression). Each row contains the feature values for one particular instance. For example, a real-estate dataset may contain information about suburbs. Such a table could look like this:
Suburb_id | Suburb_name | Suburb_established_date | Population | Most_common_zone | Number_of_primary_schools |
g4uuRG | Magic Unicorn | 01/01/1983 | 10000 | Residential | 10 |
npVdob | Gold Frog | 05/02/1995 | 5000 | Commercial | 0 |
When stored digitally, every column has a data type, which is the underlying data structure that is used to store the values of each column. When importing datasets the AI & Analytics Engine automatically chooses the best data type.
🎓 For more information about data types, read What are column datatypes in the Engine.
However, apart from the storage type, each feature column in a ML-ready dataset also has a “meaning”.
For example, a column of post codes cannot be interpreted as a numerical quantity that can be added and multiplied, even though it is stored with a Numeric data type. Similarly, colors may sometimes be represented by numbers according to a coding convention.
The feature type of a feature column in a ML-ready dataset refers to its meaning the dataset and dictates how column is to be treated and interpreted in an App (machine-learning task).
For example, it determines how the features are to be preprocessed to create the dataset ready for training a model.
Feature types in the AI & Analytics Engine
The AI & Analytics Engine supports 3 feature types:
-
Numeric
-
Features such as sale_price, or student_height, etc.
-
-
Text
-
Columns that represent textual documents, usually containing many unique words that form sentences, such as tweet_content, article_title, injury_description
-
-
Categorical
-
Columns that represent a value out of a limited set of possible values, such as Shirt size (['S', ‘M', ‘L', ‘XL', ‘XXL']), Transaction type (['credit', ‘debit']), Laptop RAM size (['8GB', '16GB', '32GB']).
-
The AI & Analytics Engine infers the feature types automatically and also allows the user to change them manually if the user wishes to change them as per their domain knowledge. Returning to the example above:
-
Most_common_zone has a data type of Text. However, the engine will infer that the feature type as Categorical since it has only a small number of possible unique values.
-
Assume that the population column is “dirty”. (Example values could be: [1000, 500, 2000, E, 2500, 3281, 1;381] ). The Engine might infer the column as a Text column. However, before training a model, users may manually change the feature type to Numeric.
💡The Engine uses Generative AI to smartly detect and suggest appropriate feature types for your datasets