Our real-time log delivery system

When I started at Transparent Edge, one of my first tasks was to implement a log delivery system in real-time. It needed to be capable of distributing billions of logs or records daily, with very low latency and high performance. We all agreed that Apache Kafka should be at the core of this.

The best thing about working for a company that offers a live, constantly developing product, is that you can see it grow firsthand, often being part of the process. If we add a team that makes you a part of every stage, you get a global vision that allows you to learn a wide range of new technologies, knowing also how and when to apply them.

HOW TO IMPLEMENT -STEP BY STEP- A REAL-TIME LOGS SYSTEM?

DOCUMENTATION AND FIRST TOUCHPOINT WITH APACHE KAFKA

If anything characterizes the world of technology and particularly, the world of software, is that it’s constantly changing. Each day there’s a new app, a new framework that promises to be better than the existing ones, and even a new programming language.

This is why it is essential to know how to consult the existing documentation in all its forms, whether it would be in their webpage, a Git repository’s README, or on a manpage. Kafka was no exception.

Soon, questions started to arise: what are the requirements? How many nodes do we need to achieve high availability? How do I authenticate and authorize users? Thanks to the documentation I was able to solve these questions fairly quickly, and we soon had an operational cluster, prepared to authenticate and authorize not only our clients, but also future “producers” that would ingest real-time logs from the different CDN servers.

INGESTING LOGS INTO KAFKA

The next phase was to implement a system that would ingest the logs that were generated on the platform in the lightest way possible. And this implied a couple of challenges.

Our platform has various types of servers based on their purpose, and not all of them run the same software nor generate the same type of log. It’s not the same to have a server on our media platform, optimized for live streaming, that one on our delivery platform, optimized for web caching and low latency.

The system had to be flexible enough to adapt with ease to the new potential kind of servers or log formats, and also very light, since it would be run at the edge, where each CPU cycle counts.

Regarding the performance, luckily, there’s the librdkafka library that implements the Kafka protocol in C, and confluent-kafka-python that uses said library to get an optimal performance even with Python, a very flexible language.

Again, we turned to the documentation, in this case from librdkafka and concluent-kafka-python, and our producer slowly started to take shape. It’s a name that honors the concept of producer and consumer in Kafka.

To retrieve the logs, the process detects the type of platform it’s been running on and uses the appropriate methodology. For example, on our media platform, it uses named pipes and on our delivery platform it executes a utility directly (a logger), that dumps the Varnish logs directly from memory.

IDENTIFYING THE CLIENT FOR REAL-TIME LOG DELIVERY

All logs on our platform have a field that represents the client’s identifier. The producer, when it receives one of these logs, must decide which client it belongs to, so that it can send it to the correct queue, (in Kafka this is referred to as “topic” )

The process of determining the client for each log is executed in real-time for each and every one of the millions of logs that our platform generates daily. And we’re using Python, an interpreted high-level language that doesn’t offer high performance (although it does compile certain parts to bytecode )

To avoid bottlenecks and minimize CPU usage, what we do is we add the client’s identifier at the start of the logs that the producer receives, and obtain it through this simple way:

for data in self._varnishncsa.stdout:

  clientid, logstring = data.split(" ", 1)

  self._produce_to_kafka(clientid, logstring)

Meaning, you perform a split once, and we already have separated the client’s identifier and the log itself.

The rest of the work is delegated to a library on Kafka, which is in charge of collecting, compressing and sending the logs to Kafka. All of this with a little tweak in the configuration:

"compression.codec": "gzip",

"retry.backoff.ms": 3000,

"queue.buffering.max.messages": 250000,

"queue.buffering.max.ms": 1000,  # alias of linger.ms

"batch.num.messages": 10000,

"topic.metadata.refresh.interval.ms": 150000,

"socket.timeout.ms": 45000,

"socket.keepalive.enable": True,

It’s what in the end allows us to ingest the logs into the Kafka cluster without taking away precious CPU cycles from the edge nodes.

AUTOMATIZATION AND FINAL STEPS FOR REAL-TIME LOG DELIVERY

Lastly, we only need to join all of these components and automate the onboarding process of all the clients that want to consume their logs in real-time.

For this, we implemented on our panel a new section inside the log delivery service. We already had the daily delivery service, and the streaming service.

When one of our clients subscribes (in a fully transparent and cost-free way) to our log-streaming service, there’s a request sent to our API that includes them in the list of clients with active service. This triggers a series of processes:

An exclusive topic is created for said client
ACLs are established in the Kafka Cluster that will allow them to consume from said topic.
On each one of the edge servers, the producer also updates their list of clients with the service enabled, so that when they receive a log that belongs to a certain client, it sends it to the corresponding topic.

All of this doesn’t take longer than five minutes. Once the service is active, the panel displays the following to our client:

Here they can download a zip file that will be generated automatically and it will include everything that’s needed to start consuming their logs in real-time. This contains examples already configured for different kinds of Kafka compatible applications, such as Filebeat, Logstash or a custom script in Python, and all the required certificates to authenticate. That’s how we got everything ready and Transparent Edge finally had their log delivery service in real-time.

Manu Sánchez Pinar is a Linux SysAdmin – DevOps at Transparent Edge.

If instead of T-1000, Manu was put to chase Terminator, the movie would’ve lasted five minutes and he would run NixOS. Analytical, methodical and an enthusiast of open source software, he administers and grows the infrastructure and services of Transparent Edge, keeping it as clean as the sky in summer.