8 April 2018
Machine Learning is a field of computer science that uses statistical techniques to train a computer system to learn from data and act according to it. Machine Learning applications are not explicitly programmed but learnt from data.These applications infers patterns and relationships between different variables in a datasets.It then uses this knowledge to predict the outcome from the target datasets.
This section explains the most common terminology used in context of Machine Learning.
A feature or an independent variable represents an attribute or a property of an observation. Features are also known as dimensions. In a tabular dataset, a row represents an observation and column represents a feature.
For example,consider the below datasets which includes fields such as age,gender,profession,city and income.
age profession city income 25 Accountant Dallas 100000 30 Teacher Atlanta 60000 35 Doctor Houston 15000
Each field in this dataset is a feature in the context of machine learning. Each row in this data is an observation Thus, a dataset with high dimensionality has large number of features.
There are two types features categorical and Numerical.
A categorical feature or variable is a descriptive feature. It can take on one of a fixed number of discrete values. It represents a qualitative value, which is a name or a label. The values of a categorical feature have no ordering.
Some examples are below.
It is a quantitative variable that can take on any numerical value. It describes a measurable quantity as a number. The values in a numerical feature have mathematical ordering.
Numerical features can be further classified into discrete and continuous features. A discrete numerical feature can take on only certain values. A continuous numerical feature can take on any value within a finite or infinite interval.
Some examples are given below.
A label or a dependent variable is the final variable that machine learning algorithm learns to predict .
It can be classified into two caregories: categorical and numerical.
Categorical It represents a class or category . If we are developing a Machine Learning applications that classifies news articles ,categorical variables can be politics,business,sports or any other news section
It represents numerical dependent variable.If we are developing an application for house market ,one of numerical dependent variables can be house price.
A model is a mathematical relationship between dependent and independent variables which is used for capturing patterns within a dataset.Once a ML model is developed ,it can used for prediction given some input parameters .
Given the values of the independent variables, it can calculate or predict the value for the dependent variable. A ML algorithm trains a model with data so that this model can predict the label for any new observation.
It is the data that is used by ML algorithm to train a mathematical model. It is either a historical or known data sets.
Training data can be classified into two categories: labeled and unlabeled.
Labeled dataset is a datasets which has label for each observations .
Unlabeled dataset does not have a column that can be used as a label.
Test Datasets is used for evaluating the predictive performance of model .
A ML model should not be tested with the training datasets.Generally 80% of the total data sets is used for training a model and remaining 20% is used as test datasets.
Machine Leaning are used in many applications .