Introduction to Data
Raj Kishore Agrawal
Data Analyst | SQL, Python, Power BI | Open to Data Analytics Opportunities
Data is a critical component of machine learning, and is defined as information that has been converted into a format that can be more efficiently processed and transferred. Data can be structured or unstructured, and is often collected to be measured, reported, visualized, and analysed. In machine learning, data is a set of observations or measurements that can be used to train a machine-learning model.?
Types of Data
Most data can be categorized into 4 basic types from a Machine Learning perspective: numerical data, categorical data, time-series data, and text.
Numeric data or Quantitative Data : This type of data consists of numbers and can include continuous values (such as prices or temperatures) or discrete values (such as counts or rankings). Numeric data can be used as input or output in machine learning algorithms
Categorical data or Qualitative Data : This type of data consists of categories or labels, such as names, types, or categories. Categorical data can be used as input or output in machine learning algorithms, but it may need to be converted into a numerical form in order to be used by certain algorithms.
Time series data: This type of data consists of measurements taken at regular intervals over a period of time. Time series data is often used in machine learning algorithms for tasks such as forecasting or trend analysis.
Text data: This type of data consists of written or spoken words and can include things like emails, social media posts, or customer reviews. Text data is often used in machine learning algorithms for tasks such as natural language processing or sentiment analysis.
Labelled Data & Un-Labelled Data
Un-labelled data refers to data that lacks any identifying labels or categories that describe its characteristics or properties. Examples include photos, videos, or text that are not categorized or classified in any way. It is primarily used in Un-Supervised Machine Learning. Examples of Un-Labelled Data is Customer Segmentation, Fraud Detection, Image and Video Recognition.
Labelled data refers to a dataset that includes both input features and corresponding output labels. In other words, each data point is tagged with a label that indicates the correct answer or category for that data. For example, in a dataset for image recognition, labelled data would include images along with labels that specify what each image represents (e.g., "cat," "dog," "car"). Labelled data is essential for supervised machine learning, where algorithms learn to make predictions based on this type of training data.
Data Labelling
Data labelling is defined as a process of identifying raw data- like text, pdf, files, images and classifying and adding one or more labels to it to enable machine learning models to learn from it.
Labelling helps the machine learning model identify the attributes of the data to analyze and make predictions. Over time, machine learning starts identifying the data and can make accurate predictions seamlessly.
Types of Variables
A variable is an object, event, idea, feeling, time period, or any other type of category you are trying to measure. There are two types of variables-independent and dependent.
Dependent Variable
The dependent variable is characterized as the variable whose quality depends on the estimation of another variable in its condition. That is, the estimation of the word variable is dependably said to be reliant on the free variable of math condition.
For instance, consider the condition y = 4x + 3. In this condition, the estimation of the variable ‘y’ changes as per the adjustments in the estimation of ‘x’. In this manner, the variable ‘y’ is said to be a reliant variable. A portion of the cases that include subordinate variables is talked about in point of interest as beneath with their answers.
In-Dependent Variable
An independent variable describes a variable whose values are independent of changes. If x and y are two variables in an algebraic equation and every value of x is linked with any other value of y, then ‘y’ value is said to be a function of x value known as an independent variable, and ‘y’ value is known as a dependent variable.
Example: In the expression y = x2, x is an independent variable and y is a dependent variable.
Training Data and Testing Data
There are two key types of data used for machine learning training and testing data. They each have a specific function to perform when building and evaluating machine learning models. Machine learning algorithms are used to learn from data in datasets. They discover patterns and gain knowledge. make choices, and examine those decisions.
Training Data
Training data is the power that supplies the model in machine learning, it is larger than testing data. Because more data helps to more effective predictive models. When a machine learning algorithm receives data from our records, it recognizes patterns and creates a decision-making model. The type of training data that we provide to the model is highly responsible for the model's accuracy and prediction ability. It means that the better the quality of the training data, the better will be the performance of the model. Training data is approximately more than or equal to 70% of the total data for an ML project. It has two process -
a) Training the Model - Train the Model
b) Validation -
Testing Data
Once we train the model with the training dataset, it's time to test the model with the test dataset. This dataset evaluates the performance of the model and ensures that the model can generalize well with the new or unseen dataset. However, it has some similar types of features and class probability distribution and uses it as a benchmark for model evaluation once the model training is completed. Test data is a well-organized dataset that contains data for each type of scenario for a given problem that the model would be facing when used in the real world. Usually, the test dataset is approximately 25-30% of the total original data for an ML project.