PaaS Logs: an inside look at our log management infrastructure

Businesses from the industrial sector and startups are at a crossroads: as the demand for cloud infrastructures and services continues to grow, (IDC estimates the market will reach 113 billion dollars by 2019) the needs for log management solutions are evolving and becoming increasingly industrialised.

At OVH, these needs have manifested themselves quickly and with good reason: with more than 240,000+ physical servers over 2 continents, more than 1 million+ customers around the world and a network and infrastructure covering all the major points of presence on the planet, the ability to follow, store and analyse relevant information regarding the state and performance of the infrastructure is becoming a crucial challenge.

Out of these needs was born project Thoth, an integrated, scalable and massively distributed log management solution, which gathers, processes and visualises data from machine log files, services or products. In March 2015, at the start of this project, a special team was created to oversee the development of Thoth. Learn what Pierre De Paepe, the team’s manager has to say about the project.

What is a log?

A log is, to use the most commonly accepted definition, a computer representation of an event. It must be human readable but also sufficiently optimised so that it may also be parsed by a machine. A log can be created by various entities: a connected device with sensors, a jet engine - to check the wear rate of critical parts or the measurement of the click rate on a product page. If there’s an event, there is potentially a log of it somewhere.

At OVH, what we call a “log” is a computer event where only relevant data necessary for the maintenance of the infrastructure are taken into account: we do not log any personal or sensitive data in our service.

The logs managed by the Thoth solution are strictly limited to the technical aspects of our infrastructure: bandwith, hard drives health states, CPU temperatures etc.. The solution does not cover the logs concerning our legal obligations in compliance with CNIL (Commission Nationale de l'Informatique et des Libertés).

What are some of the challenges encountered when dealing with huge volumes of logs on a large scale infrastructure such as OVH’s?

Log management is indeed a question of scale. Not all companies necessarily need it: dealing with the extraction of the log files of 10 machines in a small advertising agency is not something technically complex. Everything can be stored on one hard drive and a human can easily analyse the logs which is not the case when you wish to keep an eye on the health of hundreds of thousands of physical servers running a fluctuating number of virtual machines. At this scale, logs analysis makes it possible for us to quickly identify any issues before they propagate, anticipating failures before they arise through a predictable or already encountered pattern or centralising the reporting of remote machines, the possibilities are endless.

Did you have any specifications prior to developing the solution?

When discussing the issues of logs and internal processing, especially in the departments which generate extremely high volumes of logs, we realised that our colleagues had to manually perform query operations and there was no way to automate tasks on scale for over a billion files per day. Naturally, we looked into the commercial solutions available and found nothing appealing. Not only were the offers expensive to set up, but they were also proprietary, so we could not consider adapting any solution to our specific needs.

The requirement for security was also a major reason that convinced us that the internal development of our platform was the way to go, knowing that several hundreds of thousands of logs per second pass each day behind our firewall and our Anti-DDoS protection undeniably contributes to the peace of mind of the team: we isolate and analyse only the relevant data to maintain the infrastructure.

"We noticed that our colleagues had to manually perform queries and that there was no way of automating certain tasks on a scale which goes beyond 1 billion files per day."

What are the technologies behind Thoth?

Simply put, we quickly realised that we needed three main elements that we’ll call “the cornerstone”: a log retrieving solution (INPUT), a log parsing and indexing service and a data visualisation, alerting and reporting tool to leverage the data from those logs (OUTPUT).

To go deeper in detail, some departments use Syslog-ng for the log retrieving process, Docker containers host the services responsible for the validation of the logs from Syslog-ng as well as making sure they are formatted correctly and enriched with additional fields if needed. Those services are also responsible for the scalability of the infrastructure and their self healing properties ensure the high availability of the solution thus future proofing our long-term needs.

Last but not least, Graylog is the most widely used option to create real time data visualisation dashboards accessible through a web browser via https. It should be noted that in OUTPUT, we chose Elastic Search in order to provide our authorised personel with an advanced logs search solution in real time: they are able to query the system either via a command line interface or through several visualisation methods at their disposal (Grafana, Kibana etc…).

Thot architecture

Which departments are the biggest Thoth “clients”?

Without a doubt, the CDN and shared hosting teams lead the pack when it comes to volume: with more than 200,000 logs per second everyday, their usage is exactly what one would expect from an industrial structure where such volumes allow for new workflow dynamics: within the CDN department, all they need to do now is sort their logs by region to determine where their biggest data users are located. With that information, the teams are then able to focus their decision making process on ways to react to these “hotspots” on the infrastructure. Within the shared hosting department, the ability to monitor the health status logs of hosts through hundreds of thousands of logs, gives the authorised members of the team the opportunity to have a strategic vision of their clusters. By analysing trends and identifying failure points before they become an issue, collaborators are no longer reacting : they become pro-active and the entire department benefits from this increased flexibility.

Last but not least, the Over The Box (OTB) department that manages the logs of the hardware that was either sold or came from our clients DIY kits plugged into our infrastructure. Right from a single dashboard that centralises all the hardware’s health data from our solution, our authorised engineers are able to identify recurring issues and review the history of the hardware involved so that they can identify the root causes and develop corresponding corrective actions.

The end game is ambitious: make Thoth available across the board for every department that has identified a need for data management, event prediction or trend projection on complex issues.

"The ability to monitor the logs of hundreds of thousands of servers provides the team with a strategic vision of the system."

When we talk about queries or questions, what are they exactly?

Building a log management infrastructure is nice, but it is basically useless if you don’t know what to look for and how to structure the question for a specific answer. This is why it is important to set up the meaningful fields properly (date, time, CPU cycles, hard drive temperatures) and separate the valuable logs from those of lesser interest.

Then, best practices allow us to come up with queries accurate enough to enable the system to provide the best possible answer that will help us solve an issue : standardise date and time conventions compliant with the ISO 8601 regulation, limit the number of maximum fields, prevent password and sensible data logging. The deeper our best practices are, the easier it will be down the road to manage and query our logs.

An example Grafana dashboard

Since logs are managed and available in real time, is dashboard monitoring a second job?

Hopefully not! We don’t need to keep our eyes on the billions of logs that transit on the infrastructure everyday because our solution acts like a sentinel: Thoth and Graylog allow us to set up alerts based on specific indicators that we designed. The system can then inform us when specific criteria is matched by some logs, which once again validates the importance of log indexing best practices.

Is a full fledged log management solution needed for small businesses?

To answer this question, we first need to define the context and identify what we mean by “needs”. Depending on its activity, size and goals, being able to aggregate and analyse data could be crucial for small business.

For instance, let’s take an online shopping company. We’ve already established that a log was more or less a human readable, machine parsable representation of an event. There’s virtually no limit to what can be logged: it can be time spent on the page (thanks to dedicated fields in the naming framework of the log files), mouse actions (mouse over, click...), or queries in the internal search engine. All of those inputs are valuable for a traffic acquisition manager or a user experience expert who can then analyse the data and determine the best course of action for a problem that was previously merely suggested by slowing sales revenue numbers.

"All of those inputs are valuable for a traffic acquisition manager or a user experience expert who can then analyse the data and determine the best course of action."

Thanks to a real time analysis tool like Thoth, a traffic acquisition expert could be able to query the system in order to find out which customers failed to complete their order at the checkout page. After further investigation, he/she would find out that this page is loaded with so many images that on average it takes 12 seconds for the page to load, probably because the database cannot keep up (information provided by the customers’ waiting time logs). The 50% improvement in database performance alongside some adjustments to the size of the pictures would allow the teams to lower page loading time to just 3 seconds, effectively lowering the frustration of customers.

In a real case scenario and as a conclusion, the data provided by Thoth can help identify the root causes of a problem: is it the database that is responsible for the slow load time or the image size? Thanks to an advanced log management tool, decision makers have the proper means to efficiently act upon such issues with the appropriate responses.

PaaS Logs available in open Beta on Runabove

A log management solution can become a decision making tool: whether they are called “Dashboard Analytics” or “Predictive Data Mastermind Infrastructure”, the bottom line is the same: to display under a comprehensive and visual layout the right information upon which one can take informed decisions. So, to answer the question: the size of a business doesn’t necessarily reflect its needs. Its goals and usages are the primary elements to take into account: IoT startups are amongst the first to have realised this.

The Thoth infrastructure in numbers

164 nodes within an Elastic Search cluster
180 connected machines
507 TB of logs stored within the infrastructure
Between 100,000 and 300,000 logs processed per second
12 billion logs handled through our solution each day
211 billion stored documents

Since March, we have implemented the PaaS Logs project on our Runabove page, it is the fully integrated, managed and public version of our internal tool called Thoth. Whether you are a DevOps, System Administrator, Data Analyst or Traffic Acquisition expert, PaaS Logs is a versatile tool that can help leverage your day to day work. Our teams which are currently improving the system are developing new features on an open source basis. The community is very active: it is on our page for instance, that you can browse user-made and in-house scripts, APIs and widget templates databases to get started with your own implementation of PaaS Logs.

Our engineers and public beta testers exchange feedback and experiences daily, to join the community, head to our PaaS Logs page: RunAbove

Editor’s Note: In Egyptian mythology, Thoth is the God of time and their scribe. Consequently, he is the one who possesses the knowledge hence he’s also responsible for sharing that knowledge. Thoth’s roles in Egyptian mythology was crucial: he is said to be the inventor of language and writing.

source: IDC