AAAW #6 - Cloud Storage
Phew! It's been a really busy H1, and the Tipping Point was a really good opportunity to take a break from work-work, and re-look at other soft skills important to personal development. It was a bit longer than expected - I caught myself falling down the Wikipedia rabbit hole a few times - something I haven't really had the time to do ever since I left school a couple years back.
That being said, I'm ready to get back to cloud computing! *Spoiler alert - this eventually leads to an Amazon Certified Cloud Practitioner exam :)*
___________________
Cloud Storage Basics
Storage has become increasingly necessary over the years, especially with the big boom in data in recent decades. Immersive applications that use technologies like VR and AR (which are close to my heart) are often by nature extremely large, due to the complex interactions and high-definition textures that are often part and parcel of the experience. As we progress, our need for larger, faster, and more secure storage solutions is likely to increase exponentially.
In the past, this would have been resolved by purchasing disk packs or bare-metal systems from IBM or EMC to increase on-premise storage - but while our demand for storage has increased, the average value per gigabyte has fallen. As such, third-party data centres which are able to leverage economies of scale to make such storage more economical are growing in popularity. This ranges from consumer systems like Dropbox and Google Drive to companies that provide enterprise-grade storage to the tune of petabytes, with impeccable security and integration between other cloud systems.
There are a few types of storage offerings that we'll be going through - each have their pros and cons:
> Block Storage > Object Storage > File Storage
First up: block storage. We first launch an instance of storage, and then we can push information onto them for storage. This is essentially how our magnetic hard-drives work, where data is broken into blocks and stored as separate pieces, with a unique identifier. Storage Area Networks (or SANs) store these blocks of data wherever it is most efficient, where they are decoupled from user environments. When the data is requested, the underlying storage system reassembles the data blocks and presents it to the user or application. This solution is ideal for situations where they require fast, efficient, and reliable data transportation, such as databases and email services. Another excellent use of block storage is for containers, allowing you to quickly define and launch them.
Object storage, on the other hand, is more suitable if we need to store high volumes of different kinds of databits. It's often more popular because it's more flexible and handles non-structured information and different data types better. In fact, some databases have migrated over to object storage from block storage because of this flexibility. Here, versus a file system, information is stored via objects that exist within an object storage database or repository. Unlike traditional storage where things have to be arranged into files, and these files structured in a certain way, there is no such limitation for object storage. Another key difference is that metadata can be customised to include additional, detailed information about the data files in the object (vs basic file attributes for block storage). This makes sense for big data systems, which typically deal with both structured information (text files, data) and unstructured information (videos, applications etc). Versus file systems, there is better performance as there are no folders, directories, or hierarchies to manage (especially important when there are massive quantities of data. That being said, block storage solutions are still more efficient as no file-folder structure is required). Do note, of course, that these files ideally shouldn't be changed often, as any changes result in the creation of a new object. Rather than be used for containers directly, object storage systems work best with container management solutions like Kubernetes.
Lastly, file systems basically arrange files in hierarchical file systems much like what we see on our Windows or Mac machines. Data is received via directory trees, folders, and individual files - something most computer users are already intimately familiar with. A benefit is that you can have simultaneous read/writes from multiple users without having to worry about your data being overwritten. Security is also much easier, as you can assign roles and responsibilities or rights to each part of the hierarchy. When you have easily organised amounts of structured data, this is fine. However, a major drawback is that access to data is constrained by a single path to the data, making file retrieval difficult to scale. Additionally, only common file-level protocols like NTFS for windows may be supported, limiting usability across different systems. This could cost a lot of money over the years, considering some of the inefficiencies of this format.
___________________
Cloud Storage Planning
When planning for cloud storage, we can consider this sequence of steps:
> Obtain business case and funding (conception) > Breadth Analysis (defining) > Modernization (developing) > Migration (testing) > Operate & improve (operating)
Conception: how the storage system will bring value to the business, analysing to see if there is more value in cloud systems or on-premise systems (hint: more often than not, it's the former)
Pricing is often a huge factor in the shift: in on-premise applications, storage requirements have to be forecasted ahead of demand in order to ensure that the hardware is ready when the time comes. With cloud computing, resources are elastic and can be scaled easily up and down. In addition to this, cloud computing companies are often able to leverage economies of scale: this allows them to price storage extremely competitively as data centres consolidate maintenance, governance, security requirements etc. That being said, users must know how to most effectively leverage their cloud systems to extract the most value.
Defining: determining the scope of the projects, value-add, and applications used
The key is understanding the business - what problems are being solved, what's in the roadmap etc. This will affect the capacity roadmap, system types, types of applications being used, and more.
What's also important to note is analysing the data you're storing. As in the previous chapter, different file formats will affect the type of storage solution best used. Apart from formats, different types of files may have different legal or compliance requirements to be considered during deployment. Different departments will also probably have vastly different storage requirements, with different patterns of use (number of users, applications, use of memory etc). Hence, when considering the full deployment of the project, try to leverage monitoring tools to better understand existing patterns of consumption, which will help forecast growing these patterns into the future.
Developing: what platforms and what capabilities to leverage to get the best price and value
Different cloud providers have different advantages and disadvantages - there are differing native solutions, different rates, different tools at your disposal. It may be necessary to create a scorecard to analyse which cloud provider is better after considering the various lifecycle costs.
Testing: migrating the data (and of course, testing it)
Operating: Implementing management, monitoring, security, governance, etc - and ensuring that the system is constantly improving
The second chapter mainly goes over the procedures of setting up and removing a storage instance on AWS (highlighting the S3 and EBS, as well as hygiene by detaching volumes you no longer use), how to import that data onto your instance (via the Internet / cloud gateways / appliance based systems like AWS Snowball / GCP's Transfer Appliance), setting up security (via security groups) and cost management (who's using and paying for what).
The third chapter goes into greater detail for planning, and I have combined it into the material above.
The last chapter covers use-cases, and similarly has been integrated into the content above.
For additional reading, I recommend IBM's writeups here (for object storage, but the others are all linked from within) : https://www.ibm.com/cloud/learn/object-storage