Blockchain and Hashing
Lawrence Cummins
Founder & CEO at Black Cactus Holding Pty Ltd. Artificial Intelligence, Quantum Simulation, Blockchain, Data Science and Cloud Computing, Cryptography.
What is Blockchain Hashing
It is important to know how blockchain Hashing works. To do that, however, we need to first understand one of the core principles that go into blockchain creation. Blockchain technology is one of the most innovative and era-defining discoveries of the past century.
To understand how various cryptocurrencies like Ethereum and Bitcoin function. In simple terms, hashing means taking an input string of any length and giving out an output of a fixed length.
In the context of cryptocurrencies like Bitcoin, the transactions are taken as an input and run through a hashing algorithm (Bitcoin uses SHA-256) which gives an output of a fixed length.
Blockchain Hashing the bits, and bytes
You may have heard that all data in computers, is either 0 or 1. The smallest piece of data is a bit, and is either a 0 or 1. Imagine a computer as having many light bulbs, and the bulbs are either on (1) or off (0).
Different pieces of data are represented by the pattern displayed by the bulbs. Large data, such as videos, use many light bulbs. A short email, would use fewer light bulbs. A single light bulb is a bit.
Another term you may have heard of is a byte, which is simply a group of 8 bulbs. A megabyte of data is 1 million bytes, which would be 8 million bulbs.
Today’s home computers have billions and even trillions of light bulbs. But with the power of doubling, notice that even just a group of 256 bulbs is almost enough to represent any of the observable particles in the universe.
Imagine all the patterns that a group of 256 bulbs can produce, and it is an astronomical number: mathematically it’s 2 to the power of 256.
Cryptographic hash functions
A hash function, takes any input, and produces an output of a specific size. The process of applying a hash function to some data, is called hashing.
The output of a hash function is called a hash. The basic feature of a particular hash function is the size of the output it produces.
For the examples, we’ll use a hash function that produces an output of 256 bits (32 bytes). There are hash functions to produce smaller output, and others that produce larger output.
There are also many hash functions that produce an output of 256 bits, but the examples don’t care about which specific one is used.
Using the example hash function, when a video of many megabytes is hashed, the output will be 256 light bulbs with some of the bulbs lighted.
When a short email is hashed, the output will be 256 light bulbs with a different pattern. In some ways, hashing looks like compression.
To briefly explain a difference between the two, hashing always produces the same number of light bulbs, whereas compressing a video of many megabytes will still produce an output of millions of light bulbs. A compressed video, can be decompressed to obtain the original video.
When a video is hashed to only 256 light bulbs, reconstructing the many megabytes of the original video from the hash is unlikely. It may not sound desired, but this behavior is actually a powerful feature of hash functions.
One critical characteristic of a secure cryptographic hash function is that it is one-way. This means that from the output, it is virtually impossible, or mathematically and computationally improbable, to determine what the input is. That is, given a hash, it should be infeasible to learn about or find the input data that was provided to the hash function. The technical term for this is pre-image resistance.
A consequence is that a hash function should consume approximately the same amount of time, whether it is hashing a large or small input. Another desirable outcome is that hashes, patterns of light bulbs produced by hash functions, should appear to be random.
Hashing the data “password1” should produce a very different pattern of light bulbs, than from hashing the data “password2”. Otherwise, if the patterns were similar, an adversary could infer that the inputs are similar, and if either or related words (like “pass”, “word”) are discovered, the passwords could be found easily.
Secure hash functions produce drastically different outputs, even if inputs differ by only a single bit.
The ideal behavior for security is that given a hash, the only way to find the input data is by hashing all combinations of inputs, until the correct input is hashed. If the input is random, finding it would take an indeterminate amount of time, and thus effectively impossible.
While finding the input for a hash should be very difficult and take a very long time, computing a hash should be fast to compute. One can provide a hash function, with a very large input, and still receive the output in less than a second. Given that today’s smartphones perform billions of computations per second, 1 second is a long time computing-wise.
Cryptographic hash functions should also be collision resistant. A collision is when a hash function produces the same output, for more than one input. If hashing data1, which maybe a spreadsheet, and hashing data2, which maybe a picture, produce the same output, then a collision has occurred.The importance of the properties of secure cryptographic hash functions will be more evident as we describe blockchains and hashing.
Blockchains and Hashing
Hashing is extensively used with blockchains and here are some examples. Addresses on a blockchain are derived by a process of hashing public keys. An Ethereum account is computed by hashing a public key with Keccak-256. A Bitcoin address is computed by hashing a public key with SHA2–256 and RIPEMD160.
Collision resistance of the hash functions are important because if 2 people generate the same address (a collision) then either could spend the money sent to that address.
Signatures are a fundamental part of blockchains. Similar to signing a check, cryptographic signatures determine which transactions are valid.
Signatures are generated from a hash of data to be signed, and a private key.Transaction hashes are highly visible in a blockchain. Instead of describing a transaction as “the transaction where Alice sent Bob X units of currency at date D and time T”, transactions are referred to by their hash.
For example:
5c404ed432cb51128bcf09aa5e8a510dd4a1e104ef84bfed1be16dfba1b22070 is a transaction in the Ethereum blockchain. Transaction hashes are also more direct to use, as compared to a description like the “1024th transaction in block 1337”. Just copy the hash, and paste it into a blockchain explorer, to see details of the transaction.
Metaphysically, blocks in a blockchain are identified by their hash, which serves the dual purpose of identification as well as integrity verification. An identification string that also provides its own integrity is called a self-certifying identifier.
For blockchains that use mining, the Proof-of-Work is a number, called a nonce, that when combined with other data and hashed, produces a value smaller than a specified target. Mining makes full use of the properties that hashing is fast one-way, and not reversible.
Finding a valid nonce takes time because there are no clues available that will lead to a sufficiently small hash, and the only approach to find one that is smaller than the target, is to compute many hashes: in Bitcoin, currently that’s over 10 septillion hashes.
When a valid nonce is found, verifying it is done within a second, and then the new block propagates across the network, forming the latest consensus and blockchain.
Since storage in blockchains is permanent, and storing large amounts of data on a blockchain is not economical, the practical way to store data on a blockchain is to store a fixed (and normally smaller) size representation of the data called the “hash of the data.”
One use for a blockchain is as a timestamping service. Suppose there is a picture that you wanted to prove currently exists, and is not fabricated in the future. You could store the picture in the blockchain now, and a year later, if a judge asks if the picture was really taken a year ago, you could show it on the blockchain.
But, since you know about hashes, you hash the picture and store the hash on the blockchain instead. When the judge asks for proof, you provide the picture, then the judge can hash the picture and compare it against the hash that you stored on the blockchain.
There are also more advanced examples where hashing is involved, for example in Merkle trees which are at the root of innovation for blockchains, scalability, and mobile and light wallets.
Hashes for identifying anything securely
Secure cryptographic hash functions are one-way, fast to compute, and collision resistant. Combined with the property that they process any type of input to produce an output of fixed-size, called a hash, hashes are very useful as an identifier for any data.
Hashes of length 256 bits represent an astronomical number of combinations, that they are more than enough to be a globally unique identifier for the Internet of Things, even at the scale of nanotechnology and beyond.
And these hashes can be written as 64 characters (hexadecimal), which make them practical enough to use as identifiers. In blockchains, hashes are used as identifiers for blocks, transactions, and addresses.
Hashes enjoy advantages of security and privacy. If a song is recorded in a digital format, and the hash of the song is stored on a blockchain, there is no way for someone else to claim that they were first to create the song that produced the hash, without knowing the song itself: someone can’t write a song and tamper with its hash.
Similarly, unless the song or other digitized property or data is revealed, it remains private with only the hash displayed on the blockchain. Ownership registries can be stored on the blockchain, with the details of what’s owned remaining private among the parties involved. As a simple example, a vehicle registry could store hashes of car data (pictures, VIN, license plate) and only the owners, insurance company, and government would know the actual details of the vehicle.
So what is hashing
In simple terms, hashing means taking an input string of any length and giving out an output of a fixed length. In the context of cryptocurrencies like Bitcoin, the transactions are taken as an input and run through a hashing algorithm (Bitcoin uses SHA-256) which gives an output of a fixed length.
Thus is how hashing process works. We are going put in certain inputs. For this exercise, we are going to use the SHA-256 (Secure Hashing Algorithm 256. In the case of SHA-256, no matter how big or small your input is, the output will always have a fixed 256-bits length.
This becomes critical when you are dealing with a huge amount of data and transactions. So basically, instead of remembering the input data which could be huge, you can just remember the hash and keep track. Before we go any further we need to first see the various properties of hashing functions and how they get implemented in the blockchain.
Cryptographic hash functions
A cryptographic hash function is a special class of hash functions which has various properties making it ideal for cryptography. There are certain properties that a cryptographic hash function needs to have in order to be considered secure. Let’s run through them one by one.
Property 1: Deterministic
This means that no matter how many times you parse through a particular input through a hash function you will always get the same result. This is critical because if you get different hashes every single time it will be impossible to keep track of the input.
Property 2: Quick
ComputationThe hash function should be capable of returning the hash of an input quickly. If the process isn’t fast enough then the system simply won’t be efficient.
Property 3: Pre-Image Resistance
What pre-image resistance states is that given H(A) it is infeasible to determine A, where A is the input and H(A) is the output hash. Notice the use of the word “infeasible” instead of “impossible”. We already know that it is not impossible to determine the original input from its hash value. Let’s take an example.
Suppose you are rolling a dice and the output is the hash of the number that comes up from the dice. How will you be able to determine what the original number was? It’s simple all that you have to do is to find out the hashes of all numbers from 1-6 and compare.
Since hash functions are deterministic, the hash of a particular input will always be the same, so you can simply compare the hashes and find out the original input.But this only works when the given amount of data is very less.
What happens when you have a huge amount of data? Suppose you are dealing with a 128-bit hash. The only method that you have to find the original input is by using the “brute-force method”. Brute-force method basically means that you have to pick up a random input, hash it and then compare the output with the target hash and repeat until you find a match.
So, what will happen if you use this method
Best case scenario: You get your answer on the first try itself. You will seriously have to be the luckiest person in the world for this to happen. The odds of this happening are astronomical.
Worst case scenario: You get your answer after 2^128 – 1 times. Basically, it means that you will find your answer at the end of all the data.
Average scenario: You will find it somewhere in the middle so basically after 2^128/2 = 2^127 times. To put that into perspective, 2^127 = 1.7 X 10^38. In other words, it is a huge number.
So, while it is possible to break pre-image resistance via brute force method, it takes so long that it doesn’t matter.
Property 4: Small Changes In The Input Changes the Hash.
Even if you make a small change in your input, the changes that will be reflected in the hash will be huge. Let’s test it out using SHA-256:
You see that? Even though you just changed the case of the first alphabet of the input, look at how much that has affected the output hash. This is a critical function because this property of hashing leads to one of the greatest qualities of the blockchain, its immutability (more on that later.)
Property 5: Collision Resistant
Given two different inputs A and B where H(A) and H(B) are their respective hashes, it is infeasible for H(A) to be equal to H(B). What that means is that for the most part, each input will have its own unique hash. Why did we say “for the most part”? Let’s talk about an interesting concept called “The Birthday Paradox”.
What is the Birthday Paradox
If you meet any random stranger out on the streets the chances are very low for both of you to have the same birthday. In fact, assuming that all days of the year have the same likelihood of having a birthday, the chances of another person sharing your birthday is 1/365 which is a 0.27%. In other words, it is really low.
However, having said that, if you gather up 20-30 people in one room, the odds of two people sharing the exact same birthday rises up astronomically. In fact, there is a 50-50 chance for 2 people of sharing the same birthday in this scenario!
Why does that happen? It is because of a simple rule in probability which goes as follows. Suppose you have N different possibilities of an even happening, then you need square root of N random items for them to have a 50% chance of a collision.
So applying this theory for birthdays, you have 365 different possibilities of birthdays, so you just need Sqrt(365), which is ~23~, randomly chosen people for 50% chance of two people sharing birthdays.
What is the application of this in hashing
Suppose you have a 128-bit hash which has 2^128 different possibilities. By using the birthday paradox, you have a 50% chance to break the collision resistance at the sqrt(2^128) = 2^64th instance.
As you can see, it is much easier to break collision resistance than it is to break preimage resistance. No hash function is collision free, but it usually takes so long to find a collision. So, if you are using a function like SHA-256, it is safe to assume that if H(A) = H(B) then A = B.
Property 6: Puzzle Friendly
Now, this is a fascinating property, and the application and impact that this one property has had on cryptocurrency are huge (more on that later when we cover mining and crypto puzzles). First let’s define the property, after that we will go over each term in detail.
For every output “Y”, if k is chosen from a distribution with high min-entropy it is infeasible to find an input x such that H(k|x) = Y. That probably went all over your head! But it’s ok, let’s now understand what that definition means.
What is the meaning of “high min-entropy”.
It means that the distribution from which the value is chosen is hugely distributed so much so that us choosing a random value has negligible probability. Basically, if you were told to chose a number between 1 and 5, that’s a low min-entropy distribution. However, if you were to choose a number between 1 and a gazillion, that is a high min-entropy distribution.
What does “k|x” mean
The “|” denotes concatenation. Concatenation means adding two strings together. Eg. If I were to concatenate “BLUE” and “SKY” together, then the result will be “BLUESKY”.
Suppose you have an output value “Y”. If you choose a random value “k” from a wide distribution, it is infeasible to find a value X such that the hash of the concatenation of k and x will give the output Y.
Once again, notice the word “infeasible”, it is not impossible because people do this all the time. In fact, the whole process of mining works upon this.
Examples of cryptographic hash functions
- MD 5: It produces a 128-bit hash. Collision resistance was broken after ~2^21 hashes.
- SHA 1: Produces a 160-bit hash. Collision resistance broke after ~2^61 hashes.
- SHA 256: Produces a 256-bit hash. This is currently being used by Bitcoin.
- Keccak-256: Produces a 256-bit hash and is currently used by Ethereum.
Hashing and data structures
A data structure is a specialized way of storing data. There are two data structure properties that are critical if you want to understand how a blockchain works.
Pointers
Pointers are variables in programming which stores the address of another variable. Usually normal variables in any programming language stores data.
Eg. int a = 10, means that there is a variable “a” which stores integer values. In this case, it is storing an integer value which is 10. This is a normal variable.
Pointers, however, instead of storing values will store addresses of other variables. Which is why they are called pointers, because they are literally pointing towards the location of other variables.
Linked Lists
A linked list is one of the most important items in data structures. This is what a linked list looks like:
It is a sequence of blocks, each containing data which is linked to the next block via a pointer. The pointer variable, in this case, contains the address of the next node in it and hence the connection is made. The last node, as you can see, has a null pointer which means that it has no value.
One important thing to note here, the pointer inside each block contains the address of the next block. That is how the pointing is achieved. Now you might be asking what does that mean for the first block in the list? Where does the pointer of the first block stay.
The first block is called the “genesis block” and its pointer lies out in the system itself. It sort of looks like this:
This is what the structure of the blockchain is based on. A block chain is basically a linked list. Let’s see what the blockchain structure looks like:
The blockchain is a linked list which contains data and a hash pointer which points to its previous block, hence creating the chain.
What is a hash pointer.
A hash pointer is similar to a pointer, but instead of just containing the address of the previous block it also contains the hash of the data inside the previous block. This one small tweak is what makes blockchains so amazingly reliable and trailblazing.
A hacker attacks block 3 and tries to change the data. Because of the properties of hash functions, a slight change in data will change the hash drastically.
This means that any slight changes made in block 3, will change the hash which is stored in block 2, now that in turn will change the data and the hash of block 2 which will result in changes in block 1 and so on and so forth. This will completely change the chain, which is impossible. This is exactly how blockchains attain immutability.
So what does a block header look like?
A block header contains:
- Version: The block version number.
- Time: the current timestamp.
- The current difficult target. (More on this later).
- Hash of the previous block.
- Nonce (more on this later).
- Hash of the Merkle Root.
Right now, let’s focus on the Hash of the Merkle Root. But before that, we need to understand what a Merkle Tree is.
What is a Merkle Tree
The above diagram shows what a Merkle tree looks like. In a Merkle tree, each non-leaf node is the hash of the values of their child nodes.
Leaf Node
The leaf nodes are the nodes in the lowest tier of the tree. So wrt the diagram above, the leaf nodes will be L1, L2, L3 and L4.
Child Nodes
For a node, the nodes below its tier which are feeding into it are its child nodes. Wrt the diagram, the nodes labeled “Hash 0-0” and “Hash 0-1” are the child nodes of the node labeled “Hash 0”.
Root Node
The single node on the highest tier labeled “Top Hash” is the root node.
So what does a Merkle Tree have to do with blockchains.
Each block contains thousands and thousands of transactions. It will be very time inefficient to store all the data inside each block as a series. Doing so will make finding any particular transaction extremely cumbersome and time-consuming. Now suppose I want to find out whether this particular data belongs in the block or not::
Instead of going through the cumbersome process of looking at each individual hash and seeing whether it belongs to the data or not, I can simply track it down by following the trail of hashes leading up to the data:
Doing this significantly reduces the time taken.
Hashing in mining: The crypto puzzles.
When we say “mining”, it basically means searching for a new block to be added in the blockchain.
Miners from around the world are constantly working to make sure that the chain keeps on growing. Earlier it used to be easy for people to mine using just their laptops, but over time, people started forming mining pools to pool in their computer powers and mine more efficiently.
This, however, could have been a problem. There is a cap for each cryptocurrency, eg. for bitcoin, it is just 21 million. There are only 21 million bitcoins out there. If the miners are allowed to carry on, at this rate, they will fish out all the bitcoins in existence. On top of that, there needs to be a specific time limit in between the creation of each blocks.
For bitcoin, the time limit in between block creation is 10 mins. If the blocks were allowed to be created faster, it would result in:
More collisions: More hash functions will be generated which will inevitably cause more collisions.
More orphaned blocks: If a lot of miners are over mining they will come up with new blocks simultaneously. This will result in or more blocks not getting to be part of the main chain and becoming orphan blocks.
So, in order to restrict block creation, a specific difficulty level is set. Mining is like a game, you solve the puzzle and you get rewards. Setting difficulty makes that puzzle much harder to solve and hence more time-consuming.
WRT bitcoins the difficulty target is a 64-character string (which is the same as a SHA-256 output) which begins with a bunch of zeroes. A number of zeroes increases as the difficulty level increases. The difficulty level changes after every 2016th block.
If you use a Merkle tree, however, you will greatly cut down the time required to find out whether a particular transaction belongs in that block or not.
An example of a Merkle tree:
Now suppose I want to find out whether this particular data belongs in the block or not:
Instead of going through the cumbersome process of looking at each individual hash and seeing whether it belongs to the data or not, I can simply track it down by following the trail of hashes leading up to the data:Doing this significantly reduces the time taken.
The mining process
Note: We will primarily be talking about Bitcoin mining here. When the Bitcoin mining software wants to add a new block to the blockchain, this is the procedure it follows. Whenever a new block arrives, all the contents of the blocks are first hashed. If the hash is lesser than the difficulty target, then it is added to the blockchain and everyone in the community acknowledges the new block.
However, it is not as simple as that. You will have to be extremely lucky to get a new block just like that. This is where the nonce comes in. The nonce is an arbitrary string which is concatenated with the hash of the block. After that this concatenated string is hashed again and compared to the difficulty level. If it is not less than the difficulty level, then the nonce is changed and this keeps on repeating a million times until finally, the requirements are met. When that happens the block is added to the block chain.
So, when it comes to bitcoin mining:
- K = Nonce
- x= the hash of the block
- Y = the difficulty target
The entire process is completely random, there is no thought process behind the selection of the nonces. It is just pure brute-force where the software keep on randomly generating strings till they reach their goal. The entire process follows the Proof Of Work protocol which basically means:
The puzzle solving should be difficult.
Checking the answer should, however, be easy for everyone. This is done to make sure that no underhanded methods were used to solve the problem.
What is hash rate
Hash rate basically means how fast these hashing operations are taking place while mining. A high hash rate means more people and software machines are taking part in the mining process and as a result, the system is running smoothly. If the hash rate is too fast the difficulty level is increased. If the hash rate becomes too slow then the difficulty level is decreased.
Conclusion
Hashing has truly been fundamental in the creation of blockchain technology. If one wants to understand what the blockchain is all about, they should definitely understand what hashing means.
Deeply theoretical but widely practical
Designing cryptographic hash functions require a combination of art and science. To prove their security requires advanced mathematics and computer science. Blockchains are the first user interfaces filled with hashes, available to a broad demographic.
Good user experience will hide many hashes behind the scenes, but just as we see various ID's and serial numbers today, sometimes a hash is the best identifier for something instead of a long-winded description. As technology such as cryptography and the Internet of Things become widespread, expect to see 64 character hashes more in the future!