Data Mesh
What is Data Mesh
A set of principles to build a modern data architecture. Like microservice is a set of principles to build a modern software.
We can think about data mesh as a network (include nodes and connections) to exchange data about the business. The nodes in the mesh are data products, which are grouped into domains. The nodes will produce and consume high-quality data within the mesh. The interoperability between nodes is under centralized governance and standardization and enabled by a shared centralized self-serve data infrastructure
Why we need data mesh
Generally, an enterprise always has a centralized data platform, with a centralized team who owns and curates the data from all domains. However, this approach has some disadvantages:
- The centralized data team has less understanding about the product data than the product domain team.
- The increasing sources need to be ingested by the centralized data platform team.
- The increasing consumer needs need to be satisfied by the centralized data platform team
The ingest, process and serve are coupled. So the monolithic platform, is the smallest unit that must change to cater for a new functionality: unlocking a new dataset and making it available for new or existing consumption. This limits our ability to achieve higher velocity and scale in response to new consumers or sources of the data.
The motivation behind breaking a system down into its architectural quantum is to create independent teams who can each build and operate an architectural quantum. Parallelize work across these teams to reach higher operational scalability.
Data mesh principles:
In short:
- Decentralized Ownership: Data ownership by domain. Data in data mesh is broken down around a specific domain. Data is closely related with the microservice that produce that data.
- Data as a product: Data is considered as a product by each domain team that publish it. The domain team has to engage in product thinking for that data.
- Centralized Infrastructure: Data is available and self-serve anywhere in the company using the centralized infrastructure.
- Centralized Governance
Decentralized Ownership
Data ownership by domain
Distributed data products owned by independent domain teams who have embedded data engineers and data product owners, using common data infrastructure as a platform to host, prep and serve their data assets.
Data as a product
In operational domains,Domain teams provide their capabilities as APIs to the rest of the developers in the organization, as building blocks of creating higher order value and functionality. The teams strive for creating the best developer experience for their domain APIs; including discoverable and understandable API documentation, API test sandboxes, and closely tracked quality and adoption KPIs. For a distributed data platform to be successful, domain data teams must apply product thinking with similar rigor to the datasets that they provide; considering their data assets as their products and the rest of the organization’s data scientists, ML and data engineers as their customers.
Centralized Governance
From a single platform team extracting and owning the data for its use, to each domain team providing its data as a product in a discoverable fashion.
- Discoverability: Each domain data product must register itself with this centralized data catalog for easy discoverability.
- Inter-operable and governed by global standards: For example, to be able to correlate the data about an tenant across different domain data products, we need to consider ‘tenant’ with a federated entity and a unique global federated entity identifier for the ‘tenant’.
Centralized Infrastructure
A data infrastructure team can own and provide the necessary technology that the domains need to capture, process, store and serve their data products.
The key to building the data infrastructure as a platform is:
- Not to include any domain specific concepts or business logic
- Make sure the platform hides all the underlying complexity and provides the data infrastructure components in a self-service manner.
A success criteria for self-serve data infrastructure is lowering the time to create a new data product on the infrastructure.