Understanding GitHub Essentials in Machine Learning
Debi Prasad Rath
@AmazeDataAI- Technical Architect | Machine Learning | Deep Learning | NLP | Gen AI | Azure | AWS | Databricks
When I started learning data science, I was interacting with aspirants in this field. One significant thing I have noticed is that all of them are working on use cases successively. Then I looked back and started working on use cases. But there was a problem in cracking interviews at that time or getting through the job. I think through but could not able to trace it down. I discussed with my mentor about things that are not falling at the right place with some other reason. He suggested me one beautiful thing, possibly that can be the best recommendation. He advised me to start publishing the work and code in a public platform. I asked why is this required? Got confused at that time. Then my mentor explained that sharing your work with code in a public platform will be appealing to the prospective employers. The best public platform available is commonly known as GitHub. If you do not know what is GitHub? It is fine. Adding to this question the answer is GitHub is a platform where developer, tester and even individuals can publish, review and maintain their code using version control. Even we can fork any project developed by others to make use of it for our own development without disturbing the original source code. Surprisingly this is an awesome tool and a must have in skill set for every data scientist. However, it has been noticed that setting up with the GitHub repository with continuous drafting in code base becomes a bit tough for many of us. That is the reason we are here in this post, to set this up clearly and working seamlessly in sync with its local version.
With this let us get started. If you do not have a GitHub account then head to https://github.com/ , sign up ,create your own GitHub account by following some instructions. The steps and instructions are relevant to a windows machine (as I have windows machine). Please make a note of this. Going further I am assuming that you have Git installed in your machine. If no then go ahead and install Git Bash - https://git-scm.com/downloads . For sake of simplicity you can select all default options during installation wizard. Well we have thoroughly discussed everything in detail till now. Without any further let us zoom in to the steps now.
1- The most important aspect of using GitHub is to publish codes with changes. To do so, first thing is to add our machine with GitHub with SSH key. With this key GitHub is going to authenticate every time when we want to manipulate our repository. This will save a lot of time rather than entering user id and password always which is painstaking.
To create SSH key follow the steps as below,
a. Launch Git Bash and type the below command
ssh-keygen -t rsa -b 4096 -C [email protected]
I hope this makes sense and clear to you till now.
a. But getting the ssh key will not solve the purpose of authentication entirely. Now we need to feed the same ssh key(id_rsa) to the ssh-agent. When you run both of these below commands, you can see that your ssh-key identity will be added to the ssh-agent successfully. Let us first locate the agent in git bash. Fire up these below commands one by one from the git bash.
eval “$(ssh-agent -s)”
ssh-add ~/.ssh/id_rsa
2- Okay as we have added our ssh key with agent, now then we must add ssh key to GitHub. The first thing run the command in git bash. Nothing to worry about with this command. Just we are coping the public key to the clipboard buffer. Go ahead and enter.
cat ~/.ssh/id_rsa.pub | clip
Then head back to your GitHub account, navigate to your profile on the top right side of your GitHub account. Click on the profile pic which will give you a drop-down list. From that list select settings. Under this, select “SSH and GPG keys” to proceed further.
Coming on to the next page just click on the “New SSH key” button. Add the copied ssh key from your clipboard buffer(ctrl-v) and add a name to your key. Your copied key will contain your email address at the end. In this case I have already added the key to GitHub. So, I am not doing it again. If you can see the image below, just paste your key in Key section and there you are. Finally click on the Add SSH key button to complete the process. You are all set.
1- Now your machine is linked to GitHub account. This entire process is well and truly very important as GitHub now understands that all manipulation code and repositories are coming from your machine. Now GitHub is sure that every change is being triggered by you. Going forward the task is to create a GitHub repository. Let us do that.
Again, head back to GitHub home page, you can see that all your repositories are listed on the left side under Repositories provided that if you have any repositories already. You can create a repository by clicking on the green new button-as highlighted yellow shown below.
Another option for you is to click on the “+” sign given next to your profile pic and from the drop-down menu you can select “New Repository” as shown below. Just to make it simple whatever is the option you follow; both will be taking to same landing page. Here you need to fill out a form to create your repository.
I have already created a Repository in my GitHub. But do not worry I will help you to create a repository. In the above image if you see the first field is “repository name”. You can fill up with any name. For instance, “GitHub-essentials”. The next field is description field of the repository. This is required in order to give other recipients or users an idea about what the repository is all about. Having said this, it looks more professional at the same time. Next is the field asking you whether you want your repository to be private or public. Both the options are useful to its own way to making the code disclosed or non-disclosed to any given users. Private repositories will charge you some cost, not free. On the other side if your goal is to showcase your work to employers and share the code base with other data scientists then you want to make your repositories public. One more thing ensure that you are creating the repository with a README file. The description we have added earlier that will be coming to this README file by default. Now that we have done all these things correctly, go ahead and click the “Create repository” button.
Since README file is the first thing users will see while navigating to your repository, it is good practice that you populate good sort of information around the use case and the problem statement that you are working on. Use the edit (pencil button) on the right side to add more quality information about your repository. Perhaps you can set some examples, prerequisites and instructions to use your repository. Is not it making real sense? I hope you are getting it correctly. Fine, let us jump to the next stuff.
4- Now that your repository is created successfully, let us create a local version of it. This process is known as cloning. Just click on the “Clone or download” button and click on the copy icon to copy the repository URL to the clipboard buffer. Remember that you are cloning with SSH not HTTPS. However, you can download the repository as well to your own destination folder in the machine. But I personally prefer Cloning as it is automated.
Head back to git bash and navigate to the directory where you want to reside the local version of the repository. For example, you can create a folder called “My_Projects” that will host all your stuff into one single place. Run this below command,
mkdir ~/Desktop/My_Projects
This will be creating a folder called “My_Projects” in desktop. Then navigate to the directory with this command from git bash.
cd ~/Desktop/My_Projects
Now that you are in “My_Projects” folder, clone your repository from the clipboard by just replacing URL part. Run this below command,
git clone [email protected]:rathdebi/github_essential.git
Now if you do a ls command in git bash then you should see a folder with your repository name under the “My_Projects” directory. In my case I should see “github_essential” as a folder under “My_Projects” directory.
5- Now that we have cloned the repository successfully, let us go ahead and add one python script (code base) to our project. Just navigate to the cloned repository called github_essential. You can run the following command from git bash.
cd ~/Desktop/My_Projects/github_essential
If you would have opted vim as your editor while installing git bash, then you can simply create python file using vim. If that is not the case, then you have manually create a python script in anaconda Jupyter notebook and save it under “github_essential” folder. Type the following command for creating a file through vim editor,
vim python_github.py
After this a new editor will pop up in vim named as python_github.py. Then press i to switch for insert mode, add a print statement to the file. Then you need to save the file by using esc and : wq which will allow you to save by coming out git bash.
Note: In this case it is just few lines code for demo. In real world use cases try to use any IDE like Anaconda, Sublime etc.
Now just perform a git status from your git bash terminal.
remote repository
Updating the remote repository with the local version demands few steps to be performed.
a- Staging changes
b- Commit
c- Publish
In order to get stage changes, just do git add command with the python script name. Run this following command from your git bash terminal,
git add python_github.py
Now if you do a git status command then you can see that staged changes are coming in green.
The next step is to commit all the changes. In order to do that use git commit command. For my python_github.py file, I will run below command,
git commit -m “adding a python script which prints a message”
Here after -m we are just adding a message which makes meaning out of the script.
Once this is executed a summary of changed description should be coming as given below. In this python_github.py we noticed that there is just one file with one insertion that we have added in the file while creating it. Had it been more files with insertions and deletions, then all of them would have reflected in displayed summary.
We have come to the end. Sounds great. Now we just need to push the changes to remote repository. To do so you will run this command,
git push origin master
git push (both means the same)
And now if I check my “github-essential” repository on GitHub, there exists a file called python_github.py script along with a git commit message command. Just refer to below image,
Congratulations on successfully creating your GitHub repository and properly setting it up so that you can work iteratively with your code.Take this as a successful build and add more and more files to your project.
I hope this makes your life a bit easy in understanding GitHub essentials for machine learning. At least you would have a fair idea as of now, how to create a repository, push changes and commits. This will help you keep everything updated and save your work simultaneously. Thanks for reading. Have a nice day.
Yours,
Debi Prasad Rath
Data Scientist