Data Nugget April 2024

Data Nugget April 2024

29 April 2024

Welcome to the April edition of our Data Nugget, where we explore the dynamic and crucial world of data management. From breakthroughs in data integration and storage solutions to advancements in data security and privacy, our stories this month provide a panoramic view of how these developments are reshaping industries.

First, we have a brief review of the demand for data. Second, we have an insightful nugget on how to navigate the web of cloud computing. And then, we?have the next podcast?on scientific data management.

Enjoy the reading!

Let's grow Data Nugget together. Forward it to a friend. They can sign up here to get a fresh version of Data Nugget on the last day of every month.


Demand for data?

Nugget by Isa Oxenaar

AI promises to accelerate the demand for data. “AI systems and the large language models that power them need lots of data for training and operations and that’s fueling demand for platforms and tools to collect and manage all that data.”

While data centers have to meet the storage needs on one end, do companies have to meet the need for data on the other hand. It is predicted that the volume of data created reaches between 175 and 180 zettabytes in 2025, an equal amount of storage space has to be provided by data centers around the world. Next to a large amount of data, the dispersion of data – structured, semi-structured and unstructured – across many locations in cloud and on-premises systems adds to the demand for data management challenges. The data center?market has grown rapidly, about 240% in London between 2016 and 2023. “Advanced graphics processing units (GPUs), which have the capability to process the large amounts of data needed for training and applying AI models, depend upon high-performance, secure, and stable data centre environments”.

The Big Data 100 and the AI 100 provided by CRN is an annual list of recently released big data products by leading vendors. The list can give some useful tips to operators of big data practices. Their newer AI 100 list might also give some insights into streamlining?the demand for data. There appears to be only one natural boundary to the amount of zettabytes and that is when electricity sources will not be able to meet the data centers' needs.


Navigating the Web of Cloud Computing

Nugget by Gaurav Sood

We are in the third decade of the 21st century and Cloud computing is already the norm. From every industry to each corner of the globe people are either on the cloud with their platform or working towards it. Every day we see some or the other applications being developed using the vast infrastructure provided by the cloud. It has opened tremendous opportunities for everyone, from small-scale businesses to large enterprises. But this has come at the cost of too much development in too short a time leading to a vast variety of cloud tools and infrastructure being scattered all over the market and thus making it difficult to navigate to the correct tool.

Cloud computing refers to the delivery of computing services — both Software and hardware including servers, storage, databases, networking, and more — over the Internet (the cloud). Instead of relying on physical servers or personal devices to store and process data, users can access resources and services hosted remotely by third-party providers by paying a fee.?

Service Models

Users can start using the cloud services by using one of the following service models as it suits their needs.

  • Infrastructure as a Service (IaaS):?IaaS provides virtualized computing resources over the Internet, allowing users to rent servers, storage, and networking infrastructure on a pay-as-you-go basis. This means that instead of purchasing and maintaining physical hardware, users can leverage virtualized resources provided by the cloud service provider. Examples include Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP)?
  • Platform as a Service (PaaS):?PaaS offers a platform allowing customers to develop, run, and manage applications without dealing with the underlying infrastructure. With PaaS, developers can focus on building and deploying applications without worrying about hardware provisioning, software updates, or scaling. Popular PaaS providers include Heroku, Google App Engine, and Microsoft Azure App Service.?
  • Software as a Service (SaaS):?SaaS delivers software applications over the Internet on a subscription basis, eliminating the need for users to install, maintain, or update the software locally. This model allows users to access applications from any device with an internet connection, making it convenient and flexible. Common examples of SaaS include Gmail, Microsoft Office 365, and Salesforce.?

Deployment Models

With the onset of multiple players in the cloud market and because of the privacy laws, there is a possibility to use the cloud infrastructure in one of the following deployment models suiting your needs.

  • Public Cloud:?Services provided by third-party vendors (e.g., Amazon Web Services, Google Cloud, Microsoft Azure) accessible to the public.
  • Private Cloud:?Infrastructure dedicated to a single organization, often hosted on-premises or in a data center.
  • Hybrid Cloud:?Combines elements of both public and private clouds for seamless data and application movement.

Getting Started with Cloud Computing

Here are some steps to get you started in the field of Cloud computing:

  • Set clear objectives:?Determine your goals and requirements for adopting cloud computing.
  • Choose the right cloud service provider: Research different cloud providers and their offerings to find the best fit for your needs. Consider factors such as pricing, performance, security, and customer support when making your decision.
  • Start small:?Begin by migrating non-critical workloads or experimenting with cloud-based services on a small scale. This will allow you to gain hands-on experience with the cloud while minimizing risk.
  • Educate yourself:?Take advantage of online resources, tutorials, and training courses to deepen your understanding of cloud computing concepts and best practices.
  • Stay updated:?Cloud computing is a rapidly evolving field, with new technologies and features being introduced regularly. Stay informed about the latest trends and developments to ensure you’re making the most of the cloud’s capabilities. ?

Benefits of Cloud Computing for Beginners

Now that we have a foundational understanding of cloud computing, let’s explore some of the key benefits it offers to beginners:

  • Scalability and Flexibility:?
  • Cost-Efficiency
  • Accessibility and Collaboration
  • Security and Reliability ?

Challenges of Cloud Computing for Beginners

While cloud computing offers numerous benefits, beginners may encounter some challenges along the way. These challenges include:

  • Security Concerns
  • Compliance and Regulatory Issues
  • Vendor Lock-In
  • Performance and Reliability

Conclusion

Cloud computing holds immense potential for beginners looking to streamline operations, enhance collaboration, and drive innovation in today’s digital world. By understanding the basics of cloud computing, leveraging its benefits, and following best practices, beginners can navigate the cloud with confidence and unlock new opportunities for growth and success.


MetaDAMA?2#18: Scientific Data Management ?

Nugget by Winfried Adalbert Etzel

How can we consolidate data and describe it in a standardized way?

Scientific data management has some unique challenges but also provides multiple learnings for other sectors. We focused on Data Storage and Operations as a knowledge area in DMBok. A topic that is often viewed as basic, not in focus, but is a fundamental part of data operations.

I talked to Nicolai Hermann J?rgensen at NMBU - Norwegian University of Life Sciences . Nicolai has a diverse background. His journey in data started in 1983.?In his free time, Nicolai spends time with photography and AI for text-to-image generation

Here are my key takeaways:

Scientific Data Management

  • To describe data in a unified way, we need standards, like Dublin Core or Darwin Core for scientific data.
  • Data is an embedded part of Science and Research - you cannot have those without data.
  • You need to make sure you collect the right data, the right amount of data, valid data, +++
  • You need to optimize your amount of time, energy and expenses when collecting and validating data.
  • You need to standardize the way you collect data, to ensure that it can be verified.
  • There needs to be an audit trail (lineage) between the data you have collected and the result presented in a publication.
  • Data needs to be freely available for research and testing hypotheses.
  • Data needs to be findable, accessible and interoperable, but also reusable.
  • ML algorithms can help extract and find changes to scientific data, that is internationally available.
  • Describing data is key to tapping into knowledge - for that you need metadata.
  • In times of AI and ML, Metadata is still the key to uncovering data.
  • The development of AI models is a race - maybe we need to pause and get a better picture of cause and effect, and most of all risk.

Standardizing Infrastructure

  • How can were standardize the infrastructure for research projects; Minimize or get rid of volatile data storage and infrastructure; Standardize data storage solutions; Secure what needs to be secured; Split?out sensitive or classified data and store it separately (e.g., Personal data); Train your end users and educate data stewards.
  • Have good guidelines for researchers on how to store, use and manipulate data.
  • There is a direct correlation between disc-space use and sustainability.
  • Storage is cheap, is a correct saying, if you look at its in isolation - but in the bigger picture the cost is just moved.
  • Just adding more storage doesn’t solve your problems, it might just yet increase them.

Long-term Preservation & Integrity

  • To preserve data for long-term you need to: Encapsulate data at a certain level; Standardize the way you describe the data Upload data package to a common governed platform; Enclose if there is a government body that can take responsibility to preserve your data for the time necessary; Ensure that metadata is machine-readable; Formats like XML provide the possibility to read the data by both machines and humans.
  • Research integrity: conducting research in a way which allows others to have trust and confidence in the methods used and the findings in that result.
  • Ensure lineage and audit trails for your scientific data.
  • Fake data?and data fabrication, are serious issues in research - the understanding and methods for keeping data integrity at the highest possible level is not getting easier, but are increasingly important.
  • Changes to data (change logs, change data capture, etc.) can be studied as well; you can build models to build scenarios around data changes.
  • You can fetch data from other sources to enrich the quality of your data.

You can listen to the podcast??here??or on any of the common streaming services (Apple Podcast, Google Podcast, Spotify,? etc.)?Note: The podcasts in our monthly newsletters are behind the actual airtime of the MetaDAMA podcast series.


Thank you for reading this edition of Data Nugget. We hope you liked it.

Data Nugget was delivered with a vision, zeal and courage from the editors and the collaborators.

You can visit our website here, or write us at [email protected].

I would love to hear your feedback and ideas.

Nazia Qureshi

Data Nugget Head Editor



要查看或添加评论,请登录

Data Management Association Norway (DAMA)的更多文章

社区洞察

其他会员也浏览了