Cloud Tech - Under The Hood (Part-I)
Cloud Tech - Under The Hood

Cloud Tech - Under The Hood (Part-I)

We are everyday doing so much online stuff using software products and services like Google, Facebook, Amazon, Uber, etc. for our various needs like net searching, socializing, shopping, taxi bookings, banking, travel reservations, and so on. Ever wondered where the major part of these kind of software is running? Yes, behind the scene it is running on cloud. It is the various cloud technologies that enables powering up the raw compute thus providing the cloud infrastructure to run these application on various formats like desktops, phones, tablets etc. But “why” cloud is needed, “what” is cloud and “how” it works? Let us try to briefly understand that and then delve a little into under the hood concepts that drive the Cloud.

Why Cloud?

In earlier days there used to be big computing machines called mainframes and there were terminals connected to these machines. The users would work on these terminals and submit the job to be run and wait for the output to come. It was a client server model where the terminals were the clients and mainframe was the server. The mainframes from one region would be able to communicate with mainframe of other region. The implementation of these systems was based on concepts of distributed computing. As the technology advanced, we got a personal computer (PC) which could do stuff which we were earlier doing on such big systems. This was a major shift as the software usage now moved from Enterprise/Organization domain users to personal domain users. The business opportunities boomed and we could see in the markets the PC software like operating systems (Windows), office application (Lotus Notes, HPG, MS office etc.), internet browsers (Netscape Navigator, Internet Explorer) etc. So, now applications which were earlier being designed keeping in mind the mainframe tech were now being designed keeping in mind the PCs. As more people started to get PCs and getting used to it, the software usage started to increase rapidly and so was demand for more products and features. As the usage and complexity of these software increased they needed more powerful PCs in terms of faster processors, more RAM, more disk space (scale up). But, not everything was and needed to be running on PCs and there was processing happening at the backend servers as well for applications like banking, searching etc. So we had software which completely ran on PC (say MS Word etc.) or were partially running on PC and partially on backend machines (say a banking application either used by customers or by employees serving the customers in retail branches). As the internet speeds increased the trend shifted to using more of backend powered application and more generally termed as online applications and browsers became the de facto clients. The nature of the applications developed demanded that. Say, a search engine like Google had to do lot of background processing of all the webpages across the world and as the search queries came in it required to quickly give back the results. Another example could be that of an e-commerce site like Amazon which had to process long list of catalogues, merchants, payment methods, shipment methods etc. and cater to million and millions of users with transactions running in billions. This led to a big economic shift which was at that time termed as The New Internet Economy. Lot of things in the physical world started to move to online world like shopping, banking etc. This put a huge demand on backend infrastructure. To start with, this infrastructure needed to be up all the time (24x7) and give good performance. The organizations had to invest in getting the physical space and put in place things like racks for server machines, electricity cabling, air conditioning, network cabling, compute machines and staff. This gave rise to a business opportunity where someone could provide all this as a service and organization can get things on demand. Like pull things magically off the cloud. There were earlier starters like Amazon who realized this and came with a service called AWS. Now, organization can get on the click of button all the compute (scale in and scale out) they needed without having to bother about all things mentioned above to get the infra in place. Phew! what a relief it was. Now they just had to focus on building the applications and rely on backend infra provider (like AWS) to take care of all their infra needs. This is the kind of model that many organizations now like to follow, like Uber, Netflix etc.  So, I guess this gives a quick perspective of the changing needs and why cloud is needed. Now, let’s take a step further and try to look under the hood and find out “what” is cloud and “how” some of the cloud computing concepts make it work. 

What is cloud?

In order to understand what is cloud, let us understand what do we need to run an application at the backend. At the very basic level we need compute, data storage and networking. In order to provide this we need to put together bunch of machines and have them communicate with each other over a network.  So, that’s simple. Well, there is more to it. Let’s consider you purchase one PC machine and assume the likelihood of it failing (crashing, hanging, faulty RAM etc.) be once in 3 years. So, we can say mean time to failure (MTTF) would be 3 years or 3*12 =36 month.  However, if there are 3 such machines then MTTF would be 1 year or 12 month. If we put 1200 machines then MTTF would be .03 month. So, when we have these hundreds and thousands of machines running, then, there is likelihood of machines failing every day, hour and minute. What is the impact of that on applications running on these machines. You do not want to see things like you made a money transfer to some other account and it got deducted and before the other account could be updated as the server machine crashed and the other account could not be updated. Another example, you were shopping online and put something in the cart and as you were to check out the server machine failed at the backend and your cart became empty. So, with millions of users using the system there is huge number of machines running at the backend doing various transactions and it is very likely that machines would be failing. So, what we need from our backend infra (Cloud) are the ways and mechanism to handle this and many other kind of scenarios? Let’s try to list few things that are needed. We need a kind of backend system which can provide us compute (nodes with different kind of processing capabilities and memory), data storage, network capabilities, job handling and various services like monitoring etc. A physical setup to accommodate these machines is needed where all the machines are housed on racks and a network topology is set. Storage nodes are connected to this network. A software layer through which all these can be accessed, allocated (auto scaled to meet heavy demands) and monitored as needed. This is the layer via which we can acquire all infra that we need on the click of mouse. This complete computing setup at a single location is what is more generally called datacentre (DC). There could be multiple DCs geographically distributed in different regions. A cloud can be offered on a single DC or on multiple DCs in different regions across the world. Organization may be setting up their own cloud (private) or using commercially available clouds (public) like AWS, Azure, Google, Oracle etc.   Check out some videos given below to get some idea about DCs and also the kind of information(data) they process:

Google DC walkthrough : https://www.youtube.com/watch?v=avP5d16wEp0 , https://www.youtube.com/watch?v=XZmGGAbHqa0

Tour Oracle's State of the Art Data Centers: https://www.youtube.com/watch?v=f4RBs43G17g

Inside Facebook's Oregon data center (CNET News) : https://www.youtube.com/watch?v=4A_A-CmrqpQ , https://www.youtube.com/watch?v=Ypn1zgsNgbo


How Cloud Works – Cloud Tech Challenges?

This is a vast area and there are many aspects and hence topics that need to be studied. Let’s look at some of them for now and maybe more later. Broadly it works on the concepts of distributing computing systems. We have hundreds and thousands of the servers and clients. There is immense communication happening between client to server, server to server and client to client (P2P). This is accompanied with huge data flow and of course supported by huge data storage.  

How to design and run distributed systems has thrown interesting challenges and solutions at various stages of its evolution. In P2P systems there have seen interesting designs that came up and how it worked for various solutions. These are like Gnutella, Napster (Remember Sean Parker and the movie Social Network :-) , ever wondered why it is called Napster? ech it out. ) , Kazaa, BitTorrent, FastTrack, eDonkey/Overnet etc. The scale of data that is being handled is huge and different usecases that need to be catered to led to development of new cloud storage concepts and solutions like Key-Value stores, NoSQL, Cassandra, Hadoop, Redis, Riak etc. Given these huge number of clients and servers, solutions were needed to solve problems related to concurrency, replication and coordination (Paxos, Leader Election, Snapshots etc.). This is just a starter list and there are many more items that can be added to it. Let’s look at few of them for now as follows.

Given a cluster of nodes (N nodes), how a node (A) with a piece of information can communicate it to every other node ( N-1 nodes) . That is, this node (A) needs to multicast this information. One simple approach is that this node (A) establishes a one-o-one communication with rest of the nodes (N-1). The performance of such an approach would be O(N). Can this be better? Also, what if the nodes crash and packets are dropped, So, there is other approach and that is the tree based approach, spanning tree is build amongst the processes of this cluster. Each receiver node sends an acknowledgement to sender node of having received the packet. Some approaches like SRM (Scalable Reliable Multicast) and RMTP ( Reliable Multicast Transport Protocol) are being used for this purpose. However given each node sends acknowledgement, this approaches is still O(N) and we need some faster mechanism. This is where the “Epidemic” multicast (“Gossip”) approach comes in. Ever wondered how infectious diseases spread fast and turn into epidemics or rumors spread so fast. Well, it works on the multicast sender principle.  Any node receiving a packet acts as a sender and multicasts it. The packets can either be pushed out by nodes (Push Gossip) or pulled by nodes (Pull Gossip) or could be both. This way the packets of information spreads at a very rapid rate and its performance is O(Log(N) which is for all practical reason is a constant. This means as more number of nodes are added the impact on performance is not very large and we get consistent almost constant performance. Gossip approach is possibly being used by many cloud infra providers and used in solutions like Cassandra given that it is fast, reliable, scalable, topology aware and fault tolerant.

Time Synchronization is another challenge. On a desktop PC all the processes share the same common clock. However in cloud each node has  its own clock. So, is that a problem? Let’s consider that on an e-commerce system, Server A receives an order for the last copy of Harry Potter book "Goblet of Fire" in the inventory and timestamps it using its local clock as 5h:15m:30.15s. Server A send message to Server B that inventory for this book is “empty” and B records this status at its end with local timestamp, say, 5h:10m:15.15s. Now when Server C queries A and B, it is confused that user was able to buy the book at A even after the inventory became empty at earlier time at B. This can lead to further problems. So, how do we handle this. In this asynchronous system model there are no bounds on message delays and processing delays. The clock on different nodes may have relative difference in clock values (clock skew) and relative difference in clock frequencies (clock drift). A non-zero clock skew implies clocks are not synchronized and a non-zero clock drift causes skew to increase (eventually). How do we go about synchronizing these clocks? One way is to define Absolute Maximum Drift Rate (MDR) relative to Coordinated Universal Time (UTC). Given this, the maximum drift rate between two clocks with similar MDR is 2*MDR. If we set the maximum acceptable skew as M between these clocks then they need to synchronize at least once M/(2*MDR) time units. Synchronization can be categorized as External (Christian’s algorithm and NTP) and internal (Berkley algorithm).

Trying to sync the clocks is one approach. But, can the timestamps attached to events be not "absolute time" but something that obeys “causality”.  That is if an event A happens casually before event B then timestamp(A) < timestamp(b). This would help in achieving synchronization without need of clock based timestamps. E.g. Bob will listen (event B) to what Amy said after Amy has spoken (event A). This concept was proposed by Leslie Lamport and is being used in almost all distributed systems. The paper published by him on this can be found at:

https://lamport.azurewebsites.net/pubs/time-clocks.pdf

Also, a video on this :

https://www.youtube.com/watch?v=gY9VwiPTa60

Lamport was well recognised for this and got the reputed Turing award for this work:

https://amturing.acm.org/award_winners/lamport_1205376.cfm

https://www.microsoft.com/en-us/research/blog/leslie-lamport-receives-turing-award/


This is all for Part-1 and will cover more in remaining parts. There are lot of interesting under the hood cloud tech things that we can find while looking at things like Open Stack, Globus, Membership & Grids, MapReduce, Napsters, Gnutella, Fastrack, Bitorrent, Pastry, Kelips and various concepts related to key-value stores etc.  Later we will look into interesting concepts for building applications using this cloud tech which includes things like defining cloud app architecture, deployment topology, performance and scalability ( scale in, scale out, auto-scaling), security , storage, monitoring (metrics and charts) , backup and disaster recovery, costing of using cloud infra and more and evolving new roles like SRE. Stay Tuned!

The cloud tech is a really exciting areas as exciting as the song ""Cloud No. 9" by Bryan Adams :-) ... Enjoy as we learn more about Cloud Tech:

https://www.youtube.com/watch?v=QhO-4cCQSUU

Feel free to write to me for further queries and discussions

Ganesh Sahai

(Techno Business and Social Explorer)



Raza Sheikh

Data & Digital Architect | Consultant

1 年

Ganesh, thanks for sharing!

回复
abhishek kumar

Platform and Availability || Devops ||SRE ||Operations || Project Management || Leadership

5 年

explanatory ?article....very nice.

回复
Roshan Singh

QA Engineer III (Data Quality / Automation) | Selenium WebDriver | JAVA | Jmeter | Api Testing | Postman | Mobile App Testing | Performance Testing | SQL

5 年

Very good article, Easy to understand about cloud from basic.

Very interesting Ganesh, looking forward for part 2

Neeraj Goel

Senior Principal Performance Architect at Oracle

5 年

Nice.

要查看或添加评论,请登录

Ganesh Sahai的更多文章

  • SmartPhone to SmartCompanion (iPhone to iPal)

    SmartPhone to SmartCompanion (iPhone to iPal)

    "Humans are tool builders. We build tools that can dramatically amplify our innate human abilities" - Steve Jobs Over…

    3 条评论
  • AI Mapping Canvas

    AI Mapping Canvas

    AI is the new electricity. There was a time when electricity was a great new thing in the market, and it was a novelty…

    9 条评论
  • Ground Station As A Service (GSAAS)

    Ground Station As A Service (GSAAS)

    During the past few months, on and off, I have been fascinated by the fast moving developments in Space Tech and many…

    6 条评论
  • Software Testing By Innovation Techniques

    Software Testing By Innovation Techniques

    You would have heard about Testing Techniques used in software testing and also about Innovation Techniques using in…

    3 条评论

社区洞察

其他会员也浏览了