Behind the Scenes in an OVH Datacentre
Roubaix Valley is still covered in darkness. A distant voice echoes in the silence of the early morning. A robotic voice that OVH staff know so well – it's MARCEL (1), the robot that monitors the hosting provider's infrastructures and announces the slightest abnormality that might lead to a matter of urgency, using loudspeakers and visual alarms. Grégory's ears prick up as he gets out of his car, out of the curiosity that comes with the job. After passing through three sets of security doors, respectively having video, biometric and magnetic controls, he pulls on his waistcoat and enters the control room of RBX1, one of five OVH datacentres in Roubaix. He greets his colleagues (who have been on duty since 22:00) before the night debriefing begins. This hand-over period lasts about an hour, and often gets interrupted by the first alarms of the day.
On this particular day, a server in room 08 suddenly livens things up. Having being alerted by the monitoring system, Grégory quickly makes his way over. When he enters the room, fluorescent lights illuminate the dark sea of servers, shimmering in their thousands with red and green diodes. He rushes to aisle H3, where he identifies the faulty server, seizes the VGA and USB cables hanging on the rail above and slides along them to reach the machine front. He plugs in the cables so that he can access the server from a computer located in the middle of the aisle. Diagnosis: the disk is showing the first signs of weakness. Using the management interface, Grégory opens a ticket for the customer concerned, so that they schedule in a disk replacement with the support team. This enables the customer to carry out the necessary backups and to choose a convenient time slot for the intervention, which will require a machine reboot (as hot-swapping and active backplanes are only available with the HG servers).
Back to the control room. Ten workstations face a wall covered with control screens. Some of them continuously display images from the video surveillance cameras at the building perimeters, which can detect even the slightest movement. The other screens display the temperature curves recorded in the various rooms, and figures that reflect the status of the servers in real time. A little higher, above the many monitors, the red figures of the digital counters indicate the electrical consumption of the 17 rooms of the datacentre. Lastly, a flashing blue emergency light is on standby, ready to flag up any damage to the various electrical inlets of RBX1, which on average occurs four times a year, and which OVH is, of course, very well prepared for. If necessary, UPS devices are also ready to supply servers with the required power during switchover to the backup power line. In the unfortunate event of that failing too, five electrical generators would then power up to stand in for EDF until power is restored.
Grégory's terminal is showing that five servers have ceased to ping. He's racing through the noisy aisles of the datacentre once again. As he knows the place like the back of his hand, he can group the interventions by room, in order to make the most of each visit. The first two servers only need a soft reboot (a system reboot) to start pinging again. After a hard reboot (cutting the power supply for 15 seconds), the 3rd server is no longer showing any sign of weakness. Grégory notes the details of the purpose of each intervention and the actions carried out on the server in the management interface, which immediately triggers an email report to be sent to the customer. Most of the time, he doesn't even have time to record the brief service interruption. The 4th and 5th servers require hard disk replacements, which the support team will quickly arrange with the affected customers.
Grégory goes back to the control room to find his barely-touched coffee is now stone cold. Consolation: the monitoring system is all green. Is it time to breath a sigh of relief? Not quite - the technician now consults the operations scheduled for the day. On the agenda, there's curative measures, which are mainly hard disk changes, and preventive measures, such as replacing the cartridges that filter the server cooling system water. This weekly operation aims to preserve the quality of the demineralised liquid circulating in a closed loop, which is also subjected to regular chemical screening. While the old filter is being replaced by the new one, a diversion is set up to ensure that the water keeps circulating and therefore evacuating the heat released by the machines. Maintenance works are totally transparent for the customers.
Reinforcements arrive! Not all technicians work 3 x 8 shift patterns. During the day, additional technicians that move from one datacentre to another support them as needed. These permanent members of staff are mainly responsible for heavy maintenance tasks, such as testing and routine maintenance of the generators, the UPS devices and the water cooling system heat exchangers. They can also be assigned to constructing and fitting out new rooms, installing new servers and wiring, and they sometimes have to support the mobile teams when issues arise.
The customer that was advised at 06:45 that their hard disk was on its last legs has just agreed to an intervention, which will take place during the next two hours. Before attending to the machine, Grégory makes a detour for spare parts, picks up a replacement disk and makes a record of it in the management interface inventory. Once on site, he cuts off the power supply to the server and slides it out of the compartment so that he can operate on it. Once the new disk is connected, he starts to reinstall the operating system on the machine, then he writes his report and returns to the storeroom. He packs up the faulty hard drive, attaches a label to it mentioning the server number and the date, before storing it away in a secure cabinet. Defective disks are kept there for two months, in case customers want to retrieve them so that specialised companies can extract the data. If they are not sent out to the customer, they will be sent to the retrofitting department for processing, like all faulty parts extracted from the servers. OVH can then make use of the manufacturer guarantees, or proceed to recycle defective components that are out of warranty.
A maintenance engineer will cover a distance of 5-10 km everyday across the datacentre walkways. So when it's time to grab a bite to eat at OVH Kanteen, Grégory can allow himself a starter, main course and dessert without worrying about his waistline. While a colleague stands in for him, he makes the most of this break to chat to his counterparts from the four other Roubaix datacentres. This is also an opportunity to the mention demands of the job: “Our work requires us to be in good physical condition, to have precision in handling the servers and a real sense of organisation. You also have to adapt your way of life to irregular working hours. Loving IT and action is one thing but spending Christmas Eve in a datacentre is another!” However, Grégory is not lacking motivation and given his past experience of military life in the barracks, where he maintained the telecommunications infrastructure, he actually finds that he is better off with his current schedule: “We work afternoon shifts for five consecutive days, followed by three days' rest. Then five morning shifts and two days off, and finally five nights followed by four days off. It actually leaves you with some free time!” As the job is demanding and working shifts can be difficult to harmonise with family life, some engineers change position within the company after a few years. OVH facilitates this type of internal career move as often as possible.
Grégory logs into his terminal and finds out about any actions taken in his absence. While his colleague is busy dealing with alerts on servers that have ceased to ping, he can get on with the interventions scheduled by the support team; installing a KVM switch on one server [a device that enables direct access to the server via the graphical user interface, as though the machine was at home], and setting up a private firewall on another. Typical routine! The technicians are usually able to perform interventions very quickly, as everything has been thought out to optimise maintenance work and installations, from the design of the datacentre to the servers themselves. All connections are gathered on one side of the server. The servers are also placed in simple frames rather than in closed racks, as opening them up would slow down access to the internal components. The cable ducts that are usually hidden behind dropped ceilings can also be accessed directly here. So many sacrifices have been made to the aesthetic aspect of the datacentres which increase the quality of service delivered to OVH customers.
It's time for Grégory to dedicate himself to a background task which technicians spend several hours on every day - server updates. The task consists of unracking servers that customers have had done with, usually to get a more recent machine or to move to a higher range. The “obsolete” servers are brought back to the datacentre workshop to be refurbished, by way of partial or total replacement of the components. The parts stripped away will be sent off to retrofitting for recycling. Counting all the refurbished servers and the new ones fresh off the Roubaix production line, several hundred servers are set up in OVH datacentres every day.
Grégory's replacement has arrived. It's time to report to the team leader on the last eight hours shift his shift began. During these discussions, technicians regularly put forward ideas on how to optimise their work. These details often save precious time, such as the decision to reduce the casing of 0.5 U servers, following comments from technicians that pointed out an uneasy sliding motion during interventions. Or even the pipework of the liquid cooling system, which now passes in front of the servers, which simplifies detachment from the machine. All optimisations made can be observed by wandering around the various rooms of the datacentres. This serves as a reminder that R&D is practised on a daily basis in OVH datacentres.
Grégory's working day is over but his colleagues on the afternoon shift have another busy day ahead of them. On average, 750 interventions are carried out every day on the 150,000 servers hosted in the 5 OVH datacentres in Roubaix, within just half an hour of the alert being triggered. This performance far outstrips OVH's contractual obligations and is only made possible by an army of devoted technicians that ceaselessly keep watch over the net.
(1)MARCEL is the French acronym for "Monitoring Audio des Réseaux Composants Équipements et Locaux" (Audio Monitoring of Networks, Components, Devices and Premises).