SpareRoom is the UK’s leading flat and house share site with over 7 million registered users. Founded in the UK in 2004, the company expanded into the US market in 2011 and has currently helped over half a million Americans find a room or flatmate
With teams in both Manchester and London, using a wide range of technologies including Perl, NodeJS, MySQL, AWS, Google Cloud Platform and more, the SpareRoom Technology Team support millions of customers to ensure they get the best experience, as well as driving the platform forward in new directions using the latest technologies.
SpareRoom’s engineering teams rely on centralized logging for gaining visibility into their application’s performance, monitoring and troubleshooting. Before Logz.io, an internal ELK Stack was deployed on premises, but as the business grew it became difficult to maintain. Handling upgrades to newer versions of the stack was challenging and the team encountered performance issues, with heavy queries failing as memory was often pushed over the limit.
SpareRoom strongly believed in the aggregation and analysis capabilities that the ELK Stack has to offer, but at the same time did not want to continue to invest an ever growing amount of resources in maintaining it. They therefore considered looking into hosting their own instance of the stack on the cloud but decided the efforts involved were still too substantial.
Other hosted ELK solutions were also explored, but none were as robust and feature-rich as the solution offered by Logz.io. Ease-of-use and the overall service-oriented approach provided by Logz.io were key considerations in selecting this option over others.
SpareRoom’s application is based primarily on a monolith architecture, but a growing amount of core services are being developed as microservices.
Moving to a newer architecture allowed the team to strategize logging. With a clearer understanding of the value of consistent, well-formatted log messages for analysis, SpareRoom implemented structured logging in the new services from the get-go. The generated log data is used by developers to monitor the performance of these specific services and quickly identify and troubleshoot events when they occur.
Below is an example of a JSON log generated by one of the services and, structured to allow easier analysis:
{ "user_identifier": "12345", "transaction_amount": "10.99", "description": "E2E test purchase", "source_system_country": "GBR", "serverName": "848699df6d-hznhs", "pid": 1, "type": "order-service", "timestamp_completed": "2018-06-27T11:40:01.000Z", "transaction_identifier": "ch_fsdfwd22f32ewqe, "host": "10.154.0.9", "transaction_currency": "gbp", "level": "info", "card_type": "credit", "message": "Transaction recorded", "tags": [ "_logz_json_tcp_5050" ], "transaction_classification": "Sale", "@timestamp": "2018-06-27T11:40:01.995Z", "request_id": "0cefcde5-967f-43e5-a95e-a9879b273295", "territory": "GBR" }
Steve Elliott, Head of Engineering at SpareRoom, explains: “the goal was to have fields extracted at origin, without having to use string grokking. So when logging, we include the message, which allows us to split based on that, but also a hash of properties, such as: description, user_identifier, and others. This allows us to be able to aggregate and filter much easier than if the log data was just contained within message.”
SpareRoom also ships Apache server logs which are used by members of the platform team to monitor inbound traffic to the website and the general health of the application. The generated log data has been useful on a number of occasions in identifying malicious behavior, such as phishing attacks, penetration tests and bots.
SpareRoom uses Logz.io to not only identify and troubleshoot production events in real-time but also to implement a more proactive monitoring approach. Based on experience and learning from previous incidents, a series of dashboards and alerts have been defined to notify the team should the same event transpire again.
One such event involved an abnormal number of user registrations. Looking into the application logs, SpareRoom was able to identify the origin of the traffic and block it quickly and effectively. Logz.io was then used in post-incident reviews, allowing the team to quickly build a timeline of what happened based on data rather than recollection. Using this data, SpareRoom created alerts for rapidly identifying issues in the future, which have since have proven to be invaluable in catching subsequent occurrences.
Image: Dashboard used for monitoring SpareRoom’s microservices. Displays response time for each class of request handled and gauges of thresholds to indicate expected/warning/critical timings.
During a recent Visa outage in the UK, the dashboards and alerts built by SpareRoom for monitoring microservices, pro-actively informed the team of an issue causing failed transactions and poor response time. Since Kibana dashboards are displayed on screens throughout the engineering department, the team was able to pick up on the issue even before an alert was fired. The root cause was then quickly identified using the detailed error messages available as a result of the structured logging implemented.
Image: Dashboard used for monitoring the general health of SpareRoom’s microservices. Used to verify expected transactions are being processed and response times are within expected limits.
The different alerts configured by SpareRoom in Logz.io send notifications to the relevant team members using endpoints integrations. Alerts have been defined to notify via Slack and OpsGenie on low number of payments, abnormal logging volume, healthcheck URL calls and suspicious website traffic.
Logz.io is used by developers, platform and QA engineers to monitor over 10 million requests made to SpareRoom’s different services, per day. Using Logz.io has enabled SpareRoom’s engineering efforts to focus on improving the platform’s performance and stability instead of on maintaining a logging architecture.
The ability to use one pane of glass to monitor all the log data generated by transactions has had a dramatic effect on the speed in which critical issues are resolved by SpareRoom, contributing in turn to the stability of the application and the overall productivity of the engineering team:
Logz.io has been invaluable in reducing our time to resolution and fixing live issues, without having to log into servers to find logs.
Steve Elliott, Head of Engineering at SpareRoom