The Ten Big Data Commandments
The 10 Big Data Commandments

The Ten Big Data Commandments

I. Thou Shalt Mind (Know) Thy Business

No alt text provided for this image

  • Data is the key driver for insights. Ingest all data raw and apply strong data wrangling practices to retain all data in a physical data model that closely resembles the sources.
  • Understand the business domain. In order to obtain meaningful insights from data, all related data must be joined/correlated, aggregated/consolidated. Build a robust logical model of data.
  • Extract business KPIs. Different business applications see the same data differently, perception matters. Model the data to bring out varying flavours in order to obtain actionable business insights.

II. Thou Shalt Feel Thy User’s Pain

No alt text provided for this image

  • Understand the relevance of consumers. Not all data is applicable to every consumer or business case. Curate the data to specific applications and consumers in a consumption layer, enrich and expand data through derived & calculated attributes.
  • Study and playback user behaviour. Model the data to tell a user story. Call out key business insights by quantifying KPIs through data thus modelled.
  • Tell the cream from sludge. Not all data identifiers add equal value to every business story. Apply blocking and segmentation to prioritise insights for maximum business value. Leverage the power of DB views to control exposure and row level access.

III. Thou Shalt Speak Thy User’s Language

No alt text provided for this image

  • Identify the target audience. The way an application backend consumes data is different than how data analysts and scientists do. Tailor the consumable data model (interface) to the target audience (consumer).
  • Harness the power of DSL. Humans use language to communicate effectively and so do high performing teams. Align with the team’s lingo to describe data and build a terminology that resonates with the consumers, their culture and language. Adopt domain specific language.
  • Communicate like the consumers do. The way a marketing executive would present insights differs from how a data scientist would visualise. Tailor the presentation to the target audience/team. Needs of humans differ from those of application backends, such as readable and memorisable key attributes. Think human centred design.

IV. Thou Shalt Embrace Evolution

No alt text provided for this image

  • Flexible design for plug-n-play analytics. Data is just the beginning. As insights speak for themselves, more sources and consumers will come thundering down to the platform. Design for expansion (data pipelines, model layers, consumers, etc.) through simple configuration. Plan for stackable pipelines, model layers and transformations.
  • Isolation of storage, compute and analytics. Segment the platform so that its stages/components can securely scale independently. A consumer of data doesn’t need to (and shouldn’t) access the raw data and/or computational modules. The less you expose, the easier it is to manage and scale. Decouple for resilience.
  • Scalability & optimisation through serverless. Big Data grows exponentially and so does growth in sources and consumers. A data platform is only as useful as effectively it can support evolution. Embrace fully managed products & services to leverage the power of scalability, resilience and cost effectiveness of the underlying cloud platform. Design everything to be blown out of proportions - quickly - on demand - then shrink.

V. Thou Shalt Hear The Unsaid

No alt text provided for this image

  • Plan for data freshness & latency. It is not enough to just get the right data. While data analysts and scientists need to be able to query quickly and easily, application backends expect the APIs to scale and cope with a barrage of requests. An end user expects data to be fresh (how fresh is fresh enough?) while a dashboard expects APIs to be super low latency (how fast is fast enough?). Plan for super demanding non functional requirements.
  • Plan for explosion of data volume & throughput. Speaking of data, plan for only incoming, relentlessly compounding and never ending growth. As more and more data gets ingested, design for queries and access to scale in a linear fashion. Consumers onboard more data only to be able to access all that extra data equally quickly and efficiently. Watch out for the non functional requirements hidden in plain sight.
  • Plan for operational excellence. As the data platform grows, the number of data pipelines, model layers and consumer interfaces only grows in numbers, size and complexity. A robust operational model built on strong SRE principles is indispensable in order to support sustainable growth and the ability to honour agreed SLAs.

VI. Thou Shalt Be Generous

No alt text provided for this image

  • Generalise platform capability. As a data platform excels and begins to hit the headlines, more and more consumers flock to take advantage of the invaluable insights available. Plan for generic interfaces to be re-shared. Embrace an Open Data mentality to democratise data through open sourced interfaces and common standards.
  • Democratise Data. As more and more consumers line up to consume the outcomes from the data platform, the need for a common language becomes critical. Embrace popular languages such as SQL and JSON to make the data available to the masses.
  • Promote data-market and prosumerism - proliferation & convergence of producers & consumers. Consumers do weird and beautiful things with data that the data platform product team can’t imagine upfront. Almost always, consumer behaviour shapes the future of a product or service beyond the first iteration. Outcomes produced by one user are often exactly what the other user wants, and the effort of development can be greatly optimised by adopting a marketplace mentality. Consider a data mesh ecosystem where data can be traded.

VII. Thou Shalt Ensure Thy Data Integrity

No alt text provided for this image

  • Ingest data raw and maintain history. Nothing goes as planned - systems fail, networks collapse, components are buggy, data gets corrupted. Everything that is built by humans, is prone to errors and slippages. Ensure strong data lineage to identify every bit of information and trace its path back to source for audit.
  • Track, scan and sanitise data. The most important consideration when dealing with data is to never trust data that has been handed out. Incorporate strong validations in the backend and in every component that processes data. Speaking of data, there’s no such thing as too many validations. Reconcile all data against the source.
  • Support rectification, bootstrapping and replay. In the software world, everything that has been developed by humans, can and does fail eventually. Data pipelines crash, events/messages get missed, notifications get logs and signals get overwritten. There are frequent needs to rerun all or part of data processing activity, bootstrap a new system from existing data, or to replay past events & messages in order to fix inconsistencies. Plan for idempotency and repeatability.

VIII. Thou Shalt Not Expose Thy Data

No alt text provided for this image

  • Encryption, Segregation & handling of PII. It is needless to mention that one man’s data is another man’s leverage. All sorts of sensitive information exists in data that must be secured and protected against unauthorised access at all times. Apply strong data security practices of segmentation, cryptography and de identification.
  • Access groups, roles and personas. While not everyone needs access to all the data, it is also not scalable to map each consumer to the correct data. One-to-one mapping between consumer and insights or sources does not scale and leaves behind a web of dependencies. Streamline all access controls through robust IAM practices that can be automated and audited. Group consumers by their roles and purpose.
  • Segmentation, Isolation and access limitation. The data that can’t be seen, can’t be exploited either. Segment data by business lines, consumers and their duties. Put hurdles in place to isolate caches of data to discourage accidental breaches. Provide consumers with exactly one path to data and govern it through automation. For your eyes only - take this literally & seriously.

IX. Thou Shalt Be The All Seeing Eye

No alt text provided for this image

  • Tag and track all data packets, messages and records. While the platform serves all, it must also keep an eye on all happenings. Mishaps can be averted only if they can be sensed and studied. Intercept and record all events - incoming data, outgoing data, all data manipulations, schema changes, version upgrades, everything. Plan to be able to explain the entire timeline of any piece of data in the system.
  • Monitor all data operations, and errors. A robust data platform is not only able to identify issues with data, but it is also able to trace the genesis of the problem down to the faulty component/process/source. All data discrepancies must be traceable to the exact version of the component that generated the discrepancy.
  • Log all access requests. A data platform must be able to explain all data access requests. Unauthorised access requests are often disguised as legitimate and can happen at any time. It must be possible to attribute all data access requests to individuals/systems and build a timeline of activities.

X.Thou Shalt Share and Care

No alt text provided for this image

  • Enable data classification, cataloguing, and discovery. Make all the awesome data easy to discover and consume. If a consumer has to wade through a zoo of data to understand which sets are useful, the platform has lost an opportunity to shine. Plan for a data catalogue coupled with a self-service model to make data discoverable intuitively and effortlessly.
  • Plan for Internationalisation. Consumers come in all forms and as the data is democratised, it is only as useful as easily understood. Plan for standardisation through Internationalisation practices such as locales, languages, dialects and Geographical attribution. Make it easier for users to customise data for their use.
  • Plan for Compliance. There’s always a need to record, archive and audit when dealing with large amounts of data and a large user base at the same time. Plan for staged data backup, archival and audit requirements. Consider securing logs for extended periods of time and securely making them available on demand to interested auditors. Plan for regulatory governance and compliance.

0. Thou Shalt Build on The Google Cloud Platform

No alt text provided for this image

Wait a second, what? - an 11th of The Ten Commandments???

Why not; speaking of Cloud and Big Data, even The Commandments should scale with the times!

Commandment Zero: Implement the 10 Big Data Commandments on the most mature Cloud Platform out there for Big Data - GCP.

  1. Ingest all the data raw in BigQuery. Use Cloud Data Fusion for interactive pipelines or use GCS tools and Dataflow Templates. Leverage Dataprep to wrangle your data on its way in. Transform your data with Dataflow or Dataproc and BigQuery Views.
  2. Perform in-place transformations within BigQuery with Dataform or dbt. Create physical, logical, curated and BI data models with Dataform/dbt and BigQuery Views to enrich your data by extracting business KPIs.
  3. Leverage Looker to build deeply nested models that truly represent your business. Build interactive dashboards and visualisations to drive business reporting & process. Take advantage of Data Studio for simple and efficient story-telling with data. Let humans and machines access data from BigQuery through their respective models and views.
  4. Go serverless - fully managed products & services on GCP such as Cloud Functions, PubSub, Firestore, BigTable, CloudSql and much more. Make use of Google Cloud Storage to port over HDFS with Dataproc. Build pluggable pipelines with Dataflow & Data Fusion, provide stackable modelling & analytics with BigQuery and Looker. Define infrastructure as code with Google Deployment Manager and/or terraform and bring your favourite SCM tools such as puppet and chef. Apply strong version control with the world’s favourite tool git with Cloud Source Repositories, and set up world class DevOps doctrine with Cloud Build.
  5. Leverage the massive scalability and ultra low latency features of Pub/Sub, Cloud Functions and BigTable to build real time solutions with streaming and live data. Take advantage of linear scalability with BigQuery, Dataflow and Cloud Storage to grow & scale predictably. Empower your teams with debugging and troubleshooting capabilities of Cloud Debugging. Implement strong SRE, and NoOps capabilities with Cloud Monitoring to drive thresholds, quotas, alerts, notifications and Ops workflows.
  6. Democratise big data with open source tools & technologies using BigQuery. Share your data through APIs by making use of Cloud Endpoints / Apigee. Make your business insights available to the masses through Data Studio and Looker. Enable a data marketplace by opening-up a safe sandbox of your business data models through Looker. Let your consumers enjoy free-trade of data with Dataplex, Datamesh and Datashare
  7. Implement data lineage, data quality monitoring, and validations with Dataflow, Cloud Functions and Pub/Sub. Build unit, integration and regression test suites with your favourite open source tools & libraries with all GCP products & services. Build resilient solutions on GCP by automating maintenance with Cloud Composer, Cloud Scheduler, Cloud Logging and Cloud Monitoring.
  8. Put strong security practices in place with Cloud IAM across all GCP products & services. Apply strong encryption with Cloud KMS, scan & secure PII with Google DLP API. Manage segmentation with BigQuery datasets and GCS policies. Apply isolation through BigQuery Authorised Views. Administer access through Google Groups, Service Accounts, Tokens & Roles.
  9. Build world-class traceability and observability with Cloud Logging (formerly Stackdriver) and leverage Cloud Monitoring to build a nerve centre, or a single pane monitoring capability across environments for quotas, processes, health checks, metrics, access and much more.
  10. Take advantage of Data Catalog to drive all data intelligence capabilities - scan, classify, tag/flag/label your data. Build a data dictionary and glossary. Describe your data and provide a self-service interface to consumers. Make use of GCS storage classes and object lifecycle management policies to phase & archive your data for compliance and cost optimisation. Enable auditing features within Cloud Logging and govern secure access to auditors through Cloud IAM and Cloud KMS. Drive regulatory reporting, governance and compliance through automation at planet scale.

No alt text provided for this image


要查看或添加评论,请登录

Dheerendra Nath的更多文章

社区洞察

其他会员也浏览了