Rally: from benchmarking to continuous improvement

Our aim to maintain high quality levels while continuously improving our offers means we have to be able to define and measure quality, detect variations and investigate any deteriorations.

To achieve this, we have identified two major points on OpenStack (the solution behind our Public Cloud) that we consider essential to the customer experience:

  • using the OpenStack API via OpenStack clients, libraries or the OVH API v6;
  • performance guarantees for instances (processor, RAM, disk, network).

This article focuses on the first point: how we measure the performance of Public Cloud APIs at OVH. First, I will set out the solution we have implemented and show you how it fits into the OVH ecosystem. I will finish by giving you a concrete case study showing how we have been able to improve the response time of some API queries by up to 75 percent.

Rally: OpenStack's customer-oriented benchmarking tool

Rally is a building block in the OpenStack project. It is defined as a "Benchmarking as a Service" solution. Its role is to test an OpenStack platform from a customer perspective and measure execution times.

The project was developed in Python and launched in 2013. Version 1.0.0 was released in July 2018. The decision to use this project at OVH was relatively easy, as it is part of the OpenStack ecosystem and provides features that meet our needs.

Rally can run scenarios, i.e. sets of sequential tests that can be configured with varying degrees of complexity. For example, it is possible to simply test the creation of an authentication token and confirm that it works. Other more complex manipulations are possible. For example testing, in a single scenario, the authentication and creation of several instances by assigning volumes to them. This flexibility allows us to easily dream up limitless numbers of really specific tests. Rally natively provides a comprehensive set of scenarios categorised according to functional components such as Nova, Neutron, Keystone and Glance.

Rally measures response times at each stage of the scenario, as well as the total time. The data is saved in databases and can be exported as reports in HTML or JSON. The tool is able to iterate the same scenario a number of times and calculate the average values, as well as other statistics (median, 90th percentile, 95th percentile, minimum, maximum) per iteration and for all of them together.

Rally test report generated in HTML

Rally test report generated in HTML

Rally also supports the concept of a Service Level Agreement (SLA), i.e. the possibility of defining an acceptable error rate on the number of iterations in order to evaluate whether the overall test is a success.

Another point in this project that attracted us is the possibility to run the tests as an end customer, without having the role of administrator. We can therefore put ourselves squarely in the shoes of our Public Cloud customers.

Our use cases

Performance measurement

Our first need is to qualify the API of an existing platform. We therefore run, several times an hour, a number of iterations of the Rally tests for each of OpenStack’s functional components, for all regions.

Software qualification

Another use case is when we need to patch code or make security or software updates. In each case, it is difficult to measure the impacts of these changes without tools to help us. For example, take the kernel updates relating to the latest security vulnerabilities (Spectre and Meltdown) that were said to lead to a decrease in performance. Rally now allows us to easily assess the potential impacts.

Hardware qualification

It may also be that we want to test a new range of physical servers to be used on OpenStack's “control plane”. Rally allows us to check if there is a variation in performance.

Measuring is all well and good, but....

Let's not forget that we want to visualise the change in response times over time. Rally can provide an HTML report on the execution of one scenario, in other words over a very short period of time. However, it is unable to aggregate the reports of all its executions.

We therefore needed a way to extract the data from the implementation reports and summarise them in a graph. This is where our internal metrics platform comes in, based on Warp10 for storage and Grafana for dashboards.

We used the JSON export function built into Rally to extract the values measured during the tests and push them onto the metrics platform. We then created a dashboard that allows us to visualise these response times over time for each test and by region. This allows us to easily visualise their changes over time and compare response times for different regions. In regions close to one another (in France, for example: GRA, RBX and SBG), we should get approximately the same response times. If we don’t, we can look for the source of the difference to correct the problem.

Internal dashboard aggregating Rally test results

Internal dashboard aggregating Rally test results

Concrete case study

After setting up all the components, we compared the change in response times between the different regions. We noticed that, over time and in some regions, performance was deteriorating for certain tests in our project. For example, one test consists of listing all the instances of the Rally project. The average time was 600 ms, but in some regions it took 3 seconds.

We started by checking whether this fault was related only to our project and not to all customers. Thankfully, it was.

After investigating further, we found that the bottleneck was at the database level of the Juno version of OpenStack. It’s helpful to know that OpenStack performs a soft delete when deleting data. This means that it marks the data as deleted, but does not actually delete it from the table. In our case, the "instances" table is composed of a column "project_id" and "deleted". When Rally lists the project servers, the query is of the type:

SELECT * FROM instances WHERE project_id=’xxx’ AND deleted = 0 ;

Unfortunately, on Juno versions of OpenStack, there is no index ("project_id", "deleted") on this table, unlike the Newton version of OpenStack. For the Rally project, the tests launch about 3,000 new instances every day in each region. After three months, we had about 270,000 instances in soft delete in our databases. This large amount of data in the database, combined with the lack of indexes on the table, explains the latencies we saw in some regions (only those with Juno).

We therefore took corrective measures, setting up a mechanism on our internal projects to permanently delete the data marked soft delete. The result was immediate, with a four-fold reduction in the response time on the test that consists of listing the servers of the Rally project.

Significant improvement in response times in a Juno region for the Rally project

Significant improvement in response times in a Juno region for the Rally project

In this specific case, we will set up an automatic archive of soft deleted data in the OpenStack shadow tables provided for this purpose. This will help any of our customers who could be affected by the same issue.

Thanks to this benchmarking tool, we are now able to spot any anomalies that may exist between regions and that result in different user experiences. We can then implement the necessary solutions to eliminate these differences, in order to achieve the best possible experience for all users in regions close to one another. With this tool, we naturally enter into a continuous improvement process to maintain and increase the quality of the experience of using our OpenStack APIs.