Centralize or decentralize, what cycle are we in?
As an industry analyst at Gartner, we would often discuss whether people were in a centralized or decentralized cycle. In business, it’s normal to investigate options for creating innovation and moving quickly, or focus on reducing cost and optimizing teams and technologies.
The current cycle we have been in has lasted roughly10 years, since DevOps culture change began where we began breaking up IT Operations teams and building more efficient ways of managing the entire lifecycle of a service. These autonomous teams have their choice of what tools, languages, and technologies they use. The downside is, with all the diversity of available technologies, there is also a lot of waste. Work is frequently duplicated within each team, causing challenges with consistency that creates burden on the team when onboarding new employees, tackling compliance requirements, or moving engineers between different teams. These are issues that inevitably may end up slowing down progress toward overall business objectives.
Adopting a Team-based Approach
To solve some of these challenges, we’ve seen the rise, or re-emphasis, of platform teams to create underlying components which can be used as foundational technologies that other teams can use. These capabilities do not deliver business facing functionality, but are required for business critical applications to be built, operated, and managed. This team or multiple teams are often responsible for oversight of components like Kubernetes, Orchestration, CI/CD, Observability, Monitoring, databases, and data platforms. These technologies regularly require additional code to be built and maintained.
The most important deliverable from these teams is helping other teams implement technology, training them on usage, operating, and maintaining the platform services. This also typically drives decision-making whether an organization buys many tools from a single vendor or purchases best of breed tools; typically a mix is used. Decisions about whether open-source tools should be used, which of these tools are sufficiently community supported, the use of commercially supported open-source tools, or where buying proprietary tools is best suited, are all relevant considerations. Usually, a mix of these are also selected.
At the end of the day, the platform is a product, but it’s an internal product, and thus it must have a roadmap and strategy, like any other product. Platform technologies should be built in a manner where these tools can be rolled out in a multi-tenant manner and yet provide visibility across the organization for complete visibility.
When it comes to these platform tools, many of the leading technologies today are lacking the ability to create shared service delivery. One of the biggest challenges is not only creating different views of the data, but also allowing data to be analyzed across teams. Additionally, controlling costs is a major challenge where quotas and usage must be allocated to specific teams to ensure organizational costs are not overrun.
This means the following capabilities are required for platform teams to be successful:
- Multiple accounts or sub-tenant capabilities allowing for centralized management
- Management role which can control the sub-tenants and see data across tenants
- Access control to disable platform features and avoid teams from implementing new capabilities
- Ability to allocate quota to sub-tenants and manage internal billing or chargeback
Specific to observability:
- Search, view, and analyze data across the tenants
- Multiple API keys per sub-tenant, allowing for granular control of data ingested
- Organizing the data per the sub-tenants
AWS, Azure, and other leading cloud providers check these boxes and allow platform teams to implement this type of model. More code is typically required to handle additional requirements, since the cloud provider platforms are pretty minimalistic.
Most leading observability tools do not check these boxes, unfortunately, including Datadog, Elastic, New Relic and Splunk. Grafana Cloud and Enterprise support a subset of these capabilities. Other tools that have all of these capabilities include AppDynamics, Dynatrace, and Logz.io.
In the open-source world, most of the platforms out there do not support these models. That includes the ELK Stack, Prometheus, Jaeger, and open-sourced Grafana. Interestingly enough, Cortex, Thanos, and OpenSearch have multi tenant support, but lack quota controls.
In my experience, rolling out tools which do not support a scalable deployment model that matches an organization’s unique makeup will create major challenges and cause cost overruns and additional work for the platform team. Make sure you understand your organizational requirements before you settle on an observability platform or any shared services platform.