Day 12: Dataset In Machine Learning

In #ML, the common task is to study &build algorithms that can learn from and make data predictions
These algorithms operate by making data-driven predictions by constructing a mathematical model from input data

So what is ML Dataset?
#thread https://twitter.com/RealSaintSteven/status/1293275483568775170
Machine learning Dataset is known as the collection of data required to Train the model and make predictions
Such datasets are categorized as Structured and Unstructured datasets that are acquired through Data Collection, Data Wrangling &Data Exploration

#Data #ML #DataAnalytics
In Machine Learning, there are (3) three main steps needed in Data Analysis:

*Data Acquisition.
*Data Wrangling or Data Pre-Processing.
*Data Exploration.

As an output of this data analysis, we will be having a relevant dataset that can be used in the training of the model
Types of Datasets

In Machine Learning, when building a model i.e during the learning process datasets are usually divided into three to overcome the issue of over-fitting and under-fitting.

We need to split our dataset into:
*Training Dataset
*Validation Dataset
*Test Dataset
a. Training Dataset:
The training set is the material through which the computer learns how to process information.

A training dataset is a dataset of examples used during the learning process and is used to fit the parameters (i.e., weights)
b. Validation Dataset

*These types of dataset are used to reduce overfitting
*A validation dataset is a dataset of examples used to tune the hyperparameters (i.e. the architecture) of a classifier.
*Validation set is used to validate the output produced by your model
#Dataset
c. Test Dataset:
*A set of examples used only to assess the performance of a fully-specified classifier.
The sample of data used to provide an unbiased evaluation of a final model fit on the training dataset
*It is used to assess the performance of the ML Model
Types of Data

Let us look at the form of data available in datasets from the point of view of machine learning.

a. Numerical Data
b. Categorical Data
c. Time Series Data
a. Numerical Data:
Any data points that are numbers are referred to as numerical data.

Types of #Numerical Data
*Continuous data are measurements type of data such as speed data, volume &weight.
*Discrete/Count data are numerical data that can be counted such as number of units
b. Categorical details
*are used to represent "non-numerical" characteristics of the data, such as gender, yes/no etc

Categorical Data is divided into two
a. Nominal Data: is a type of data that is used to label variables without providing any quantitative value
b. Ordinal Data
Ordinal values represent discrete and ordered units.
Unlike nominal data, ordinal data can be ordered and cannot be measured e.g Education background.

#DataType #MachineLearning #Statistics
#Data #ML #AI #Math #Python
3. Time Series Data
This is the compilation of a sequence of numbers obtained over a period of time at a regular interval. This is very important, as in the field of the stock market, where we need the price of the stock over a constant period of time.
*Online Dataset Sources

Let look at where you can get a FREE dataset for your Machine Learning Project

a. Google Dataset Search Engine
Link: https://datasetsearch.research.google.com/ 
Google goal was to unify almost all the available databases of datasets and make them discoverable.
*Microsoft Dataset
Link: https://msropendata.com 
This is a data repository that makes the data set generated by Microsoft researchers accessible to data scientists

*Computer Vision Dataset
Link: https://www.visualdata.io 
You can use this if you to work on Image Recognition, CV etc
*Kaggle Dataset
Link: https://www.kaggle.com/datasets 
It includes a variety of data with various shapes &sizes. I think I will rate Kaggle as the best dataset repository

*Amazon Dataset
Link: https://registry.opendata.aws/ 
This includes a dataset in the field of public transport, satellite etc
*UCI Machine Learning Repository:
Link: http://mlr.cs.umass.edu/ml/ 
The Repository at UCI provides an up to date resource for open-source datasets.

*VisualData
Link: https://www.visualdata.io/ 
Discover computer vision datasets by category; it allows searchable queries.
In this post, we discussed the dataset of machine learning and the significance of data analysis.
We have also seen the different types of datasets and data available from a machine learning perspective.

If you need help? .... just DM
You can follow @RealSaintSteven.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: