How Photobox keeps site reliability in the picture

Photobox’s site reliability head discusses how the photo book and personalised gifts site manages a complex microservices architecture

Cliff Saran

Cliff Saran,
Managing Editor

Published: 23 Nov 2022 11:22

Over the past few years, Photobox has been on a journey to unify its e-commerce platform. At the start of 2022, the company merged with Albelli, and, says Alex Hibbitt, director of site reliability engineering at Photobox, hopes to build out a solid base for the different brands in the group.

Photobox’s IT is based on a microservices architecture, running on the Amazon Web Services (AWS) public cloud. Over the Black Friday and Cyber Monday weekend each year, the company’s absolute peak of trading is five to six times its normal activity.

Peak shopping events run over an extended period due to the nature of Photobox’s business. Customers wishing to buy personalised photo-based products, such as books, calendars, prints and gifts, upload digital images to the website and, over an extended period of time, customise the layout of their chosen product, then proceed to the checkout.

This puts significantly more strain on the back-end platforms that run Photobox’s business, compared with other retailers where the customer journey from product selection to checkout occurs in a matter of minutes.

Pulling together puzzle pieces

Monitoring every aspect of the platform is key, but when Hibbitt joined Photobox four years ago, each developer team used its own monitoring tools. “When I joined, we had 10 separate monitoring tools in place,” he says.

In terms of getting an overall view of the reliability of the platform, he says each tool covered an individual part of the full picture, which is one of the challenges of a microservices architecture. “You want to give teams the freedom to pick their tools, but this often can lead to tool proliferation across the organisation, which is what happened within Photobox,” he says.

According to Hibbitt, in isolation, an observability tool that is wrapped around a specific microservice can work perfectly well. “The challenge,” he says, “is when you cross boundaries between different microservices.” For instance, the customer experience journey at Photobox touches at least three different front-end services. It also requires another dozen or so back-end services.

Often in site reliability engineering, the team looks at the end-to-end customer experience. But, as Hibbitt points out, a customer’s journey on Photobox occurs over a protracted period of time.

“If you need to build a photo book, you dedicate your time to creating it,” he says. “You could do this within a couple of hours, but if you really want to create something special, where you’re putting a lot of love and effort into producing a photo book, it may take a week of working a couple of hours each night.”

This is the challenge Photobox faces when it comes to observability with teams using different tools. “It becomes impossible to track a customer journey like this, that runs over a long period of time across 10 different tools,” he says.

This was what Hibbitt faced when he experienced his first Black Friday at Photobox four years ago. “I was practically pulling my hair out because I couldn’t have enough windows open across our different tools,” he says.

Whenever he needed to check out a particular problem, such as if a customer raised an issue with the site, Hibbitt found he had to use the monitoring tools the developers had originally deployed for observability of the microservices they had developed. This manual tracing of the customer journey would be impossible to scale, and is a problem that cannot be solved simply by hiring more site reliability engineers.

“You couldn’t expect a relatively new engineer to understand a customer journey when it’s so challenging to instrument across our stack,” he says. “You might have data coming in from one tool that is different to another tool, and you have no way of comparing this data. It’s an apples and oranges problem.”

Looking at the big picture

Photobox has now introduced Dynatrace to provide standardisation for observability of its microservices. Hibbitt says the tool enables Photobox to have a common approach to looking at different microservices.

The company is also using the artificial intelligence (AI) in Dynatrace for automating alerts when a threshold level on site reliability is breached.

“We do not have to build out custom alerts and custom thresholds,” says Hibbitt. “Davis, the AI in Dynatrace, is very good at automatically understanding what our baseline for particular services looks like. It assesses error rates and the number of calls passing through different services to create a picture of the overall state of the Photobox platform.”

One of the challenges a site reliability engineer faces when dealing with multiple alerts is deciding which areas of performance degradation to prioritise. “Our approach is to try to make decisions based on data,” says Hibbitt.

When preparing for the peak in e-commerce activity during Black Friday and Cyber Monday, he says Photobox runs a load test at 150% of the volume of activity it expects. “We ramp up our site and see what happens. We do this on the live side, so it has the potential to impact customers, but we’re very careful in terms of making sure we protect the customer experience,” says Hibbitt.

Dynatrace provides Photobox with the ability to measure in real time what is happening for customers as they upload images and create photo books and other photo gifts. “The peak helps us really target where we want to be optimising things,” says Hibbitt. “So, in the case of this peak, we found that our shop service was beginning to slow down, which is obviously quite impactful to a customer.”

By using the observability data from Dynatrace, Photobox was able to understand how much of an impact this slowdown was having. Given that the team responsible for the shop service had a full backlog of work, Dynatrace enabled the site engineering team to demonstrate the impact of this particular problem. The team could then estimate how many customers would be affected, giving the business the ability to assess the commercial impact and allow decision-makers to prioritise the work required.