Duda is the leading web design platform for companies that offer web design services to small businesses.
The Observability Journey
With multiple design and development programs in production, the Duda engineering team required a solution to help search and visualize their cloud data, in the event that performance issues occurred that would impact the end-user experience. As an open source-centric organization, Duda first turned to Logz.io for the company’s ELK-based Log Management solution to help improve their log monitoring process. Overtime, as the company’s development process increased in complexity with a microservices architecture, there emerged a need for a more end-to-end method for monitoring and profiling events in problems in their service architecture. Once again, Duda turned to an open source-based solution, but quickly saw the need for the support and scale of a managed-service.
Step 1: Accelerating the Development Cycle and Optimizing Production with Log Management
To proactively monitor the health of their production environment, and more quickly and easily diagnose and troubleshoot issues after finding them, Duda’s development team uses Logz.io to centralize and monitor their logs.
The team has created a number of alerts to help engineers more quickly identify and triage potential issues, and has also built a number of custom dashboards to increase visibility into their logging pipeline.
In addition to the development team, Duda also leverages Logz.io for troubleshooting and alerting by aggregating logs across many sources, including Cloudwatch, AWS Lambda, servlet containers, applicative data, and a general funnel for alerts. Operationally, all of the teams appreciate the ability of Logz.io to create more cost efficient logs through data filtering, archive and restore capabilities.
Step 2: Leveraging Traces (and Open Source) to Improve Analysis and Debugging Across Microservices
As the engineering team continued to deploy an architecture built on microservices, there emerged a need for a new strategy for monitoring the environment and improving visibility across all services.
According to David Barda, Backend Architect for Duda, the engineering team opted to instrument a process based on distributed tracing.
“We wanted to get more information about our environment. Distributed Tracing allows our team to trace incoming request flow through our application. This gives us more information about the latency of the services along the request path so that we can understand the root cause of bottlenecks and failures and collect data for future debugging and analysis.”
David and the team deployed a Distributed tracing program built around the OpenTracing project with Jaeger as the open source tool that receives, processes and visualizes tracing telemetry data. For Duda, Jaeger offered “the best UI when compared to other open source projects,” however it also required serious maintenance and upkeep.
Duda set out to deploy the same robust tracing visualization, monitoring and analysis capabilities of Jaeger, but without the maintenance challenges of running and scaling an open source project.
Step 3: Distributed Tracing from Logz.io to Streamline Jaeger Maintenance, Correlate Logs and Traces and Optimize Root Cause Analysis
Fortunately, Duda was able to leverage its existing operational deployment with Logz.io to unify logs and traces. The engineering team set up a Docker integration to connect the traces to Logz.io’s new Jaeger-based Distributed Tracing service, providing an even more seamless integration than what would have been available using siloed log and tracing solutions.
“With the new Logz.io Distributed Tracing offering, we did not need to manage the database and could begin storing our traces, along with our logs, in a central platform. This provides us with strong log correlation functionality so that we can access our traces directly from a suspected or interesting log. Moving forward, we have better visibility into application requests and events and can monitor all activity dispatched from microservices to improve analysis and troubleshooting.”
For Duda, managing Jaeger to this degree would have been a daunting challenge that required significant knowledge and investment from the DevOps team.
“By instrumenting our traces directly into the Logz.io Distributed Tracing service, we have seen an improvement in our overall user experience. Even better, because Logz.io removes the burden of managing Jaeger, our DevOps team can spend less time maintaining the cluster or upgrading the service, and more time on more critical production-oriented tasks.”
In addition, David’s team instrumented the solution to use the baggage of the trace to retain data about feature flag evaluations. This helps Duda understand not only the high-level flow of a trace, but also the more detailed internal state of the underlying service.
Finally, with Distributed Tracing, Duda has a better awareness of their ecosystem of microservices. The team is able to trace the path of a request across microservices, recognize points of failure in the request, and improve debugging and analysis for future events.
“Before we instrumented tracing, I may have woken up at 2 AM with a critical production issue and would have had to search the logs for hours without finding what I was looking for related to the request. But with Distributing Tracing, I can find all the dependencies we have within our microservices environment to get closer to that root cause of the issue. I can then find the specific link between a log and a trace, and use Logz.io to visualize the trace, uncover the issue, flag it and solve the issue. This is very valuable for our team.”