登录查看更多内容

K-means clustering Algorithm and version control systems

Dnyaneshwari Kondhalkar

Intern @IMD | Former MongoDB Intern @Data sentinel | AISSMS IOIT'2025 | GPP'2022

发布日期: 2023年7月25日

In previous article we have seen integration of machine learning, DevOps and MLOps. Now we are going forward in machine learning and version control systems.

Machine learning algorithms are programs that can learn from data and improve from experience, without human intervention. Learning tasks may include learning the function that maps the input to the output, learning the hidden structure in unlabelled data; or ‘instance-based learning’, where a class label is produced for a new instance by comparing the new instance (row) to instances from the training data, which were stored in memory. ‘Instance-based learning’ does not create an abstraction from specific instances.

In this article we are covering k-means clustering algorithm,

K-means clustering is one of the simplest and popular unsupervised machine learning algorithms.

Typically, unsupervised algorithms make inferences from datasets using only input vectors without referring to known, or labelled, outcomes. A cluster refers to a collection of data points aggregated together because of certain similarities.

You’ll define a target number k, which refers to the number of centroids you need in the dataset. A centroid is the imaginary or real location representing the centre of the cluster. Every data point is allocated to each of the clusters through reducing the in-cluster sum of squares.

In other words, the K-means algorithm identifies k number of centroids, and then allocates every data point to the nearest cluster, while keeping the centroids as small as possible.

The ‘means’ in the K-means refers to averaging of the data; that is, finding the centroid.

How the K-means algorithm works

To process the learning data, the K-means algorithm in data mining starts with a first group of randomly selected centroids, which are used as the beginning points for every cluster, and then performs iterative (repetitive) calculations to optimize the positions of the centroidsIt halts creating and optimizing clusters when either:

The centroids have stabilized — there is no change in their values because the clustering has been successful.

The defined number of iterations has been achieved.

K-means algorithm example:

Let’s see the steps on how the K-means machine learning algorithm works using the Python programming language.

We’ll use the Scikit-learn library and some random data to illustrate a K-means clustering simple explanation.

Step 1: Import libraries and load data (data handling).

As you can see from the above code, we’ll import Pandas library as pd and then data is being loaded using read_csv().?????????

No alt text provided for this image — Data Handling

Here name of the csv file is data.csv. so we can see data is loaded successfully. In this file there are 2 attributes i.e., Service_providers ID and their rating which are given by Service consumers.

Step 2: Model creation and prediction.

Here we are creating a model and giving the no. of clusters = 2, and we are training the model using fit(). Data prediction is done using predict().

Here data is grouped on the basis of ratings here Service provider with ID 1 have highest ratings and 2,3 and 4 have low ratings. Ratings are divided in two categories low rating and highest rating. Low rating is considered if rating is between 0-2 and highest if it is between?3-5.

Version control systems

Version control, also known as source control, is the practice of tracking and managing changes to software code. Version control systems are software tools that help software teams manage changes to source code over time. As development environments have accelerated, version control systems help software teams work faster and smarter. They are especially useful for DevOps teams since they help them to reduce development time and increase successful deployments.

Version control software keeps track of every modification to the code in a special kind of database. If a mistake is made, developers can turn back the clock and compare earlier versions of the code to help fix the mistake while minimizing disruption to all team members.

For almost all software projects, the source code is like the crown jewels - a precious asset whose value must be protected. For most software teams, the source code is a repository of the invaluable knowledge and understanding about the problem domain that the developers have collected and refined through careful effort. Version control protects source code from both catastrophe and the casual degradation of human error and unintended consequences. Software developers working in teams are continually writing new source code and changing existing source code. The code for a project, app or software component is typically organized in a folder structure or "file tree". One developer on the team may be working on a new feature while another developer fixes an unrelated bug by changing code, each developer may make their changes in several parts of the file tree.

Version control helps teams solve these kinds of problems, tracking every individual change by each contributor and helping prevent concurrent work from conflicting. Changes made in one part of the software can be incompatible with those made by another developer working at the same time. This problem should be discovered and solved in an orderly manner without blocking the work of the rest of the team. Further, in all software development, any change can introduce new bugs on its own and new software can't be trusted until it's tested. So testing and development proceed together until a new version is ready.

领英推荐

GroupBy #11: Python at Meta, Netflix Incremental…

Vu Trinh 1 年前

Beyond Algorithms: The Essential Skills for Thriving…

Anubhav S. 9 个月前

Exploring Scikit-Learn in 10 Examples

Leonardo A. 2 年前

Git is software tool of version control system so what is git?

Git:

Git is a tool which stores and thinks about information in a very different way, and understanding these differences will help you avoid becoming confused while using it.

Most operations in Git need only local files and resources to operate — generally no information is needed from another computer on your network. If you’re used to a CVCS where most operations have that network latency overhead, this aspect of Git will make you think that the gods of speed have blessed Git with unworldly powers. Because you have the entire history of the project right there on your local disk, most operations seem almost instantaneous.

For example, to browse the history of the project, Git doesn’t need to go out to the server to get the history and display it for you — it simply reads it directly from your local database. This means you see the project history almost instantly. If you want to see the changes introduced between the current version of a file and the file a month ago, Git can look up the file a month ago and do a local difference calculation, instead of having to either ask a remote server to do it or pull an older version of the file from the remote server to do it locally. This also means that there is very little you can’t do if you’re offline or off VPN. If you get on an airplane or a train and want to do a little work, you can commit happily (to your local copy, remember?) until you get to a network connection to upload. If you go home and can’t get your VPN client working properly, you can still work. In many other systems, doing so is either impossible or painful. In Perforce, for example, you can’t do much when you aren’t connected to the server; in Subversion and CVS, you can edit files, but you can’t commit changes to your database (because your database is offline). This may not seem like a huge deal, but you may be surprised what a big difference it can make.

Git Generally Only Adds Data. When you do actions in Git, nearly all of them only add data to the Git database. It is hard to get the system to do anything that is not undoable or to make it erase data in any way. As with any VCS, you can lose or mess up changes you haven’t committed yet, but after you commit a snapshot into Git, it is very difficult to lose, especially if you regularly push your database to another repository.

This makes using Git a joy because we know we can experiment without the danger of severely screwing things up. For a more in-depth look at how Git stores its data and how you can recover data that seems lost, see Undoing Things.

Three States

Git has three main states that your files can reside in: modified, staged, and committed:

Modified means that you have changed the file but have not committed it to your database yet. Committed means that the data is safely stored in your local database.

This leads us to the three main sections of a Git project: the working tree, the staging area, and the Git directory.

The basic Git workflow goes something like this:

-?????????You modify files in your working tree.

-?????????You selectively stage just those changes you want to be part of your next commit, which adds only those changes to the staging area.

-?????????You do a commit, which takes the files as they are in the staging area and stores that snapshot permanently to your Git directory.

By downloading Git we can perform commands on git bash-

By giving ls command we can get list of files, folders or directories.

Here I have created a directory i.e., demo1

This directory is empty that’s why when I’ve fired ls command it’s showing nothing.

?I’ve created one text document in it then again fired ls command then we can we it is giving the result i.e., names of that document.

-The git init command creates a new Git repository.

Thank you Kushal Sharma Sir for such a wonderful session. Thank you AISSMS Institute of Information Technology for organizing this value added course.

#day2 #learning #machinelearning #clustering

Incredible Interns

11 个月

Wow, your grasp on the K-means clustering Algorithm shows you've really nailed the details, amazing work! Broadening your skills in other machine learning algorithms could really boost your expertise. What other areas in tech are you planning to explore?

Vedanti Patil

Student for Life!

1 年

Wow..nice article

2 次回应

查看更多评论

要查看或添加评论，请登录

Dnyaneshwari Kondhalkar的更多文章

Comparing MLOPs Libraries.

2023年7月27日

Comparing MLOPs Libraries.

Hello LinkedIn Community!?? I'm super excited to share today's experience in a workshop of MLOPs. Today we learnt about…

1 条评论
Implementation of MLOps with git part-II.

2023年7月26日

Implementation of MLOps with git part-II.

Hello LinkedIn Community!?? I’m excited to share my Day 3 experience of Value-added course in AISSMS IOIT. Today we…

1 条评论
Integration of Machine learning, DevOps and MLOps

2023年7月24日

Integration of Machine learning, DevOps and MLOps

Hello everyone, As we all know that MLOps is very important aspect regarding any project nowadays. MLOps creates a…

1 条评论

K-means clustering Algorithm and version control systems

Dnyaneshwari Kondhalkar

Intern @IMD | Former MongoDB Intern @Data sentinel | AISSMS IOIT'2025 | GPP'2022

领英推荐

Dnyaneshwari Kondhalkar的更多文章

社区洞察

其他会员也浏览了

Top 8 Low Code/No Code ML Libraries Every Data Scientist Should Know About

Generating Simulated Datasets for Machine Learning: A Comprehensive Guide

Top 8 Low Code/No Code ML Libraries Every Data Scientist Should Know About

Structuring RAG Projects in Python Using Databricks

?? AI/ML Weekly #379: Quick Python Intro to OpenAI Chat Completion Functions

General steps to implement a language model like GPT-4 in an enterprise context:

Google launches free Gemini-powered Data Science Agent on its Colab Python platform

Federated Machine Learning with AI-Enhanced Python Pipelines for Cloud-Integrated Data Processing

TASK 3- Machine Learning Integration With DevOps (to select best Hyperparameter for dataset)

Face_Mask_counter

领英推荐

Dnyaneshwari Kondhalkar的更多文章

Comparing MLOPs Libraries.

Implementation of MLOps with git part-II.

Integration of Machine learning, DevOps and MLOps

社区洞察

其他会员也浏览了

Top 8 Low Code/No Code ML Libraries Every Data Scientist Should Know About

Generating Simulated Datasets for Machine Learning: A Comprehensive Guide

Top 8 Low Code/No Code ML Libraries Every Data Scientist Should Know About

Structuring RAG Projects in Python Using Databricks

?? AI/ML Weekly #379: Quick Python Intro to OpenAI Chat Completion Functions

General steps to implement a language model like GPT-4 in an enterprise context:

Google launches free Gemini-powered Data Science Agent on its Colab Python platform

Federated Machine Learning with AI-Enhanced Python Pipelines for Cloud-Integrated Data Processing

TASK 3- Machine Learning Integration With DevOps (to select best Hyperparameter for dataset)

Face_Mask_counter