What happens when you type www.google.com in your browser?

What happens when you type www.google.com in your browser?

Google search is the most successful product from Alphabet and controls by far the largest search engine market. But, have you ever pondered on what happens when you type "www.google.com" into your browser and hit enter?

From a layman's perspective, you put in some letters with a domain extension at the end, and you are taken to a webpage. However, from a high-level perspective, what happens looks more complex and more than just a search.

Before we go further, here are some terms you'd need to get to for a better understanding of the concepts and theories behind this magic action.

TERMINOLOGIES

These terms would help us through this article to better understand how information is been organized and transmitted through the WWW.

  • Client: A client is an essential part of the web structure as it helps us send requests to another program or computer hardware in order to access a service or functionalities hosted on it. Most times, clients are usually our laptops, computers smartphones, which allow us to access a service through the internet. In the context of this article, "clients" refers to our web browsers.
  • Servers: This represents another piece of hardware or software that provide functionality or services for other programs or devices called - "clients". They are involved in sharing and distribution of packets of data to multiple clients. In the context of this article, usage of the term “server(s)” will refer to the computer system(s) hosting?www.google.com.
  • Protocol: Or, more specifically, communication protocol — a general term for a system of rules, or methods, for transmitting data between two devices.?The (OSI) model, the conceptual model used to describe telecommunications between computers, consists of a myriad of protocols.

Understanding these three terms allows us to further go into the key details and steps on how our search terms and URL are able to fetch data on the web.

1. URL PARSING

Web browsers function on the Internet Protocol addresses ((think of them as the addresses of servers) and are essential to everything we do on the web. They function entirely to take a string of text and return the web page that corresponds with it — nothing more, nothing less. Of course, a web browser can only complete this function if it knows what a given string of text it receives is.

Thus, we instead encode IP addresses as domain names, which make up the?.com,?.net,?.gov?and many, many other domains we’re commonly familiar with. This?Domain Name System?(DNS) makes web browsing vastly more efficient and convenient; yet, domain names in and of themselves are useless to a web browser, which must decode, or resolve, a given one into its corresponding IP address.

What do you do when you are given something you’ve never seen before and asked to describe what it is? You probably examine the given item according to smaller, identifiable features, such as individual details about its looks, feel, or functionalities. Web browsers are no different.

First and foremost upon receiving the inputted string https://www.google.com, the web browser, which can either be Chrome, Firefox or Safari, breaks it up like so:

No alt text provided for this image


Our example URL is parsed according to the format above as follows:

  • Protocol —?https— the data transfer method to be used between the client and server. In this case, the protocol is HTTPS (HyperText Transfer Protocol Secure).
  • Hostname —?www.google.com— the domain name corresponding to the IP address of the server.
  • Port — Think of this as the server’s mailbox) where our request will be sent. Empty in our example URL, but correspondingly implied by the web browser based on the protocol — HTTPS uses port?443.
  • Path-and-file-name — the name of the file requested and its location in the server’s directory. Also left empty in our example URL, thus implying that we are querying the server at the root?/.

After parsing the URL, the web browser does some double-checking on the hostname,?www.google.com. First, it scans it for any non-ASCII characters,?a-z,?A-Z,?0-9,?.?or?-. Our given URL is clean, but in the case, it contained a weird character, Firefox would use?Punycode?to encode the URL into a functional string.

Finally, the web browser checks caches. Recall that web browsers are just like us. Humans don’t like repeating work they’ve already done, and again, browsers are the same way. In its cache, your browser keeps running storage of recent hostnames that it has already resolved.

Let's assume Chrome is our browser. Anytime it matches a repeated hostname, it pulls out its IP directly. Upon failing there, it will search in one last place — the operating system’s cache. In the case that it fails again, and as we’ll assume for our example, the browser must undergo the DNS resolution process.

2. DNS LOOKUP

In a situation Chrome fails to match the received hostname in either its own or the operating system’s cache, Firefox sends it off to the nearest resolver server (typically your Internet Service Provider) to be resolved into its IP through the Domain Name System.

For a fantastic, approachable, and fun resource to learn how DNS works, I highly recommend dnsimple’s freely-available comic “How DNS Works.”

For the purposes of this article, simply know that the resolver contacts both the top-level domain server (.com, in our case) and domain registrar before successfully resolving the hostname into its corresponding IP address. When it’s all said and done, Firefox knows the IP address corresponding to?www.google.com?— we'll call it?8.8.8.8

3. TCP/IP

Finally, our web browser is ready to go. Having resolved the IP address associated with?www.google.com, Firefox proceeds to begin communication with the corresponding server.

The communication between the browser and server occurs over what is referred to as Transmission Control Protocol/Internet Protocol (TCP/IP). This communication protocol is not mandatory — any working protocol goes — but is a standard when it comes to web infrastructure and the OSI model.

TCP, the transport-layer protocol, is responsible for establishing the connection between the client and server. TCP is defined by its reliability — packet (ie. request/response data) delivery in TCP is guaranteed, even if it takes more time. An alternative transport-layer protocol,?User Datagram Package?(UDP) is faster but less reliable — packet delivery is not double-checked. UDP is typical of streaming services where instant content takes priority; TCP is used most everywhere else.

No alt text provided for this image

4. SSL

The first thing Chrome always?sends to the resolved IP address of?www.google.com?is a message containing its?Transport Layer Security?(TLS) version along with a list of supported cipher algorithms and compression methods. TLS is a symmetric cryptography encryption method used to keep communicated data?private,?authenticated, and?reliable. The method is an improved version of what was originally?Secure Sockets Layer?(SSL). While TLS is the standard web cryptography protocol today, SSL remains the representative of the “S” in HTTPS, despite its deprecation in 2015.

Upon receiving this initial communication, the server chooses its preferred TLS algorithm and method and responds with a certificate and security approval including the server’s TLS public key. Back at the client side, the browser uses this public key to encrypt a pre-master key that is sent back to the server.

If the public key sent to our browser was authentic, then the server is able to decrypt the pre-master key with its TLS private key. Upon proof of successful decryption, the browser and server have effectively established a trusted connection and symmetric method of sending messages back and forth.

This entire security process is referred to as the TLS handshake and is responsible for that cool green lock displayed in your browser whenever you connect to a website through HTTPS.

5. HTTPS

Recall that HTTPS initially came up as the first block parsed from our URL, and represents the protocol of our website request.?HTTP?stands for HyperText Transfer Protocol, a stateless, asymmetric request-response client-server protocol that overarches TCP/IP. Originally written by the inventor of the internet himself, Tim Berners-Lee, HTTP has persisted as the standard-bearer protocol for internet communication. The most recent version is HTTP/1.1 and is maintained by the?Worldwide Web Consortium?(W3C).

Where TCP/IP defines the method?of?communication, HTTP defines?how?computers interact with each other. For instance, after completing the TLS handshake, Firefox sends an HTTP request message to the server.

The first line, the request line, defines the type of request the browser makes to the server. There are many types of request messages, two such as POST, to submit data to a server, and DELETE, to delete data from one. Our entry of?https://www.google.com?qualifies as a GET message, which acquires a web resource (web page) for a client from a server.

In the header section of HTTP request messages, the browser can specify details of the request, such as if the connection to the server should be terminated immediately or not, or whether the server should store cookies (persisted session information for a given client).

The request body is optional and mostly irrelevant to request messages.

That was a lot at once, and HTTP will come up again later. For now, take away that at this point in our example, Firefox has sent a TLS-encrypted HTTP GET request to the resolved IP of?www.google.com?to retrieve the web page configured at the root of the host server.

6. LOAD BALANCER

Throughout these first five steps, I’ve repeatedly referenced our browser as communicating with the server hosting?www.google.com. Now, I was not [intentionally] misleading you, I promise, but the truth is, up to this point, we haven’t quite yet interacted with the server hosting our desired web page. Instead, we’ve been interacting with an initial in-between — the load balancer.

To introduce the importance of load balancing, realize that?over four billion people use the internet today. Think about the number of HTTP GET requests sent to a particular website every day, every hour, every minute, and every second, especially for those behemoth websites such as Google or Amazon.

Then think about how your personal computer begins to slow down after running just a handful of processes, let alone if it had to manage?over 63,000 requests per second.

If you’re thinking that there’s no possible way a single computer could effectively handle that much traffic, you’re absolutely correct. In reality, most all established websites split up traffic across a multitude of servers. Each is configured to serve requests identically, and by splitting requests among them, traffic is handled much more efficiently.

A load balancer is an intermediary responsible for handling this traffic-splitting work. A load balancer is software that can be configured either on the same server as that hosting web content or on a server all its own. One such common and free load balancer software is?HAProxy

No alt text provided for this image

HTTP request traffic is split up by a program such as HAProxy according to a load-balancing algorithm. There are various types of load-balancing algorithms, each with its own advantages and disadvantages. One such example includes round-robin load balancing, which sends requests to servers in turn according to a queue. Another is the least connections, which sends a new request to the server currently handling the least number of connections. You can read about more load-balancing algorithms?here.

Backtracking in our example, the resolved IP address of?www.google.com?was truly the IP address of the load balancer server. Firefox completed the TLS handshake with this load balancer server, thus making it the?TLS termination proxy. Almost like a post office, this server, which we’ll imagine is configured with a round-robin algorithm on HAProxy, was the receiver of our HTTP GET request. HAProxy took the request, pulled up the IP address of the next web server in its queue, and sent it off that way.

So far I’ve described a setup with just a single load balancer. However, if a website is configured with just one load balancer, and that load balancer server fails, the entire site would be inaccessible — the load balancer would be referred to as a?single point of failure.

Ideally, a stable website will be configured with multiple load balancers set up as a transparent cluster. Each load balancer in the cluster always knows the status of its companions, and anyone can handle a greater share of requests in the case that another goes down.

7. FIREWALL

Now, we are very close to retrieving our web page. But, before our GET request is officially, and finally received by the host server, the message goes through one last security check — a firewall.

Through the TLS handshake, our browser came to an agreement with the load balancer server as to how to encrypt messages as they are passed back and forth. TLS achieves three crucial security purposes — privacy, integrity, and identification — yet it fails to account for a fourth — honesty.

Firewalls are hardware, software, or implementation of both that filter all traffic coming into and out of a server. TLS is effective for preventing data from being intercepted mid-transmission. Yet, it assumes that the received data is coming from a trusted source. Firewalls make no such assumptions and utilize a combination of packet filters, application gateways, circuit-level gateways, and proxy servers to make certain that a packet does not contain viruses or malicious hardware.

Firewalls are relatively straightforward to install, and are typically configured anywhere data is received, including both load balancer and host servers. One such freely available and fantastically-named one on Linux is?the Uncomplicated Firewall?(UFW).

Contextualizing firewalls in our example, at this point, our GET request has already passed one firewall, installed on the load balancer. It will next pass another installed on whichever host server it is distributed to.

8. Web Server(Host/Application)

Finally, our website is back!!!

Resolved into its IP, transferred over TCP/IP, encrypted by TLS, formatted as HTTP, passed by a firewall, (*huff, huff*) distributed by a load balancer, and passed through another firewall, our initial URL has been received as an HTTP GET request by a server hosting our desired web page.

TL;DR

  • The browser receives the URL?https://www.google.com?and parses it into its protocol (https), hostname (www.google.com), port (implicity,?443), and location (implicity, root?/).
  • The browser checks if the hostname has already been resolved in its own or the OS’s cache. If so, the corresponding IP is retrieved right there and then.
  • Otherwise, the hostname is resolved through the Domain Name System.
  • The browser completes a TLS handshake with the load balancer specified at the resolved IP. This communication occurs over TCP/IP.
  • Having established an encrypted connection method, the browser sends the load balancer a GET request for the file located at the root of?www.google.com.
  • The GET request is passed through a firewall on the load balancer.
  • The load balancer distributes the GET request to the next available host server, as determined by its configured load balancing algorithm.
  • The GET request is passed through a firewall on the host server.
  • The host server retrieves the file located at its root directory and returns its content, served dynamically by the application and database servers.
  • The browser receives the HTTP response message containing the file content and renders the HTML page to the user.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了