Introduction
Starting from this article, we’re going to talk about Apache KAFKA. The purpose of this series of articles is to help you to understand and use KAFKA in your real projects along with any programming languages, especially with the .NET platform.
Before diving into details, let’s try to understand what is KAFKA and why KAFKA.
Everything starts from a Source and Target definition
If we look at the issue in the simplest form when developing a program, in most cases, writing a program is developing something that can handle the requirements of the business, process (create, modify, and save) them, and store processed data in certain storage. It doesn’t depend on what type of application you’re developing. In the final stage, you have some type of Source/interface/input canvas, and you have some type of storage(Target) to store the data.
For making things more real-world-oriented, let’s say you have a simple web API. The API, in most cases, acts as an input accepter, validation, and transformation role and stores the data in some storage (DB, file, another web service, etc.)
In simple terms, the programs we’re using in our daily life act as a Source, and their storages are just simply Targets. ( Photoshop, Notepad++, etc type of apps)
The thing you need to understand is that Source and Target depend on your perspective. From the backend perspective, API is a Source, and Database is a Target. From the Frontend perspective, UI is a Source, and API can be a Target.
When looking at the things from 10.000 fits, the skeleton of any application consists of Source and Target. If it would possible to take an X-ray of any application, you will see that the main building blocks of any application are Target and Source.
But as business requirements evolve, these processes can’t be solved between 1 Source and 1 Target.
The issue jumps out into a larger context, and that context is that of multiple Sources and Targets, leading to an increase in complexity.
Now suppose there is 1 Source and several Targets, or even in the microservice environment, you have multiple Sources( every API acts as a Source) and multiple Targets (every API can have multiple Sources to write to )
The fact that Sources and Targets are scaled is always accompanied by a communication problem. In this case, the problem is that we as a programmer should solve the complexities created by Sources and Target rather than focus on business requirement implementations.
So, what exactly are these complexities/issues :
- Communication complexity – Now you have multiple Sources and Targets that can create the below issues:
- Every Target requires a different protocol to communicate
- Every Target has its data format
- Every different Target requires maintenance and support
In simple terms, say you have a microservice application, and every service has its own Target. Besides than that, every service can have multiple Sources, and the services can use common Sources.
- Communication complexity duplication - whenever similar systems are developed, we have to rewrite such communication processes again and again. Let's imagine that we are working on several different projects. Although the domain of these projects is different, although they solve different problems at an abstract level, the common aspect of these projects is communication complexity. So, it means we’re repeating ourselves and every time try to resolve the same issue.
- Fault tolerance – the System should be able to continue functioning and providing reliable data processing and message delivery even in the presence of various types of failures, such as hardware failures, network issues, or software crashes.
- High performance - In most cases, such a communication problem (Sources - Targets) causes application performance to drop. Regardless of dynamic changes in the number of Targets and Sources in the application, the program should always support the high-performance attribute
- Scalability- It should be possible horizontally scale Sources and Targets.
- Real-time communication- One of the possible Target and Source communication attributes is real-time communication. Depending on the Use Cases, the system should allow real-time data exchange between Source and Target
- Log and Data aggregation - the ability to combine and process Log and Data in certain aggregates
- Data Transformation and Processing - The communication between Target and Source is not only in the form of data exchange but also of information should be based on the possibility of transformation.
Think how many people there are in the world who want to solve these processes like us, and each of us is separate. Although we solve different business processes, we try to solve the same things in terms of complexity above.
Since the IT field does not like to solve the same issue over and over again, the above is readily presented in the form of solutions. One of these off-the-shelf solutions is Kafka.
Kafka is an event streaming platform initially created by LinkedIn, and it is maintained as an open-source project under the Apache Software Foundation. Confluent, a company founded by the original creators of Kafka, provides commercial products and services related to Kafka while also contributing to its development and the Kafka ecosystem.
As you can see from the above image, KAFKA encapsulated the communication of sources and targets and makes their communication fairly easy.
Long story short, KAFKA incorporates all the steps mentioned above.
Attributes of Kafka
Here are the attributes of KAFKA.
- Easy scaling
- High Performance
- Real-time communication
- Data aggregation
- Data transformation
- Fault-tolerant
- Decoupling Source and Target and providing loosely coupling
- Mediating communication complexity
KAFKA has several elements like topics, brokers, producers, consumers, consumer groups, metadata, offsets, etc., and we will learn them in our next posts one by one with an easy and practical approach.