Notes on Kafka: Kafka on EC2 (Single-node)

Before we head to the central concepts of Kafka, we will need to install Kafka on our system first. These are just a few of the pre-requisites:

  • machine running Linux operating system
  • Java 8 JDK installed
  • Scala 2.11x installed - this is optional

For the rest of the course, I'll be doing the labs in an AWS EC2 instance

You can also proceed to the sections that you're interested.

Installation

I've installed Java SDK and Apache Kafka on my instance months before I actually started this documentation. I've also created a script which does all the installation steps. This just needs to be paste on the EC2 User data field during the EC2 instance creation.

If you're using Linux running on virtual machine, that's also fine. You can simply take the commands in the script below and run them in your terminal.

Included are some comments as well - I had lot of troubleshooting done before I was able to make the script work. Ha! 😆

Also, looking back at this script I made 4 months ago, I somehow felt I've grown a new perspective during that four months, I've actually modified and optimized the script below.

Went back and updated this by Aug 2021. This is the script for a single-node kafka cluster.

#!/bin/bash
#----------------------------------------------------------------------------------------------------------------------------------------#
# 02-Startup_script-Kafka_Install
# Maintainer: Jose Eden ([email protected])
# 2021-01-05 04:41:20
#-------------------------------------------------START OF SCRIPT------------------------------------------------------#
# Update instance
yum update -y

# Install java
yum install -y java-1.8.0-openjdk.x86_64

# Install wget, in case it is not installed.
yum install -y wget

# We'll keep all installs in /usr/local/bin
cd /usr/local/bin

# Download Scala 2.13.3 and untar file
# You can check the other scala binaries at 
# https://www.scala-lang.org/files/archive/
wget  https://www.scala-lang.org/files/archive/scala-2.13.3.tgz
tar -xvzf scala-2.13.3.tgz

# Download kafka and untar file
wget https://downloads.apache.org/kafka/2.7.0/kafka_2.13-2.7.0.tgz -v 2> ./wget_output.log
tar -xvf kafka_2.13-2.7.0.tgz 

# Remove tgz files
rm -f kafka_2.13-2.7.0.tgz scala-2.13.3.tgz 

# Renames the kafka folder
sudo mv kafka_2.13-2.7.0/ kafka

# Make 2 folders for data - one for kafka, one for zookeeper
cd kafka
mkdir -p data/kafka
mkdir -p data/zookeeper

# Edit .bashrc - put the kafka and scala path in $PATH
echo "export JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk" >> /etc/profile
echo "export JRE_HOME=/usr/lib/jvm/jre" >> /etc/profile
echo "export PATH=/usr/local/bin/scala-2.13.3/bin:/usr/local/bin/kafka/bin:$PATH" >> /root/.bashrc
# sudo su -
# source /etc/profile

# Update zookeeper properties and kafka proeprties
cd /usr/local/bin/kafka
sudo sed -i -e "s/dataDir=\/tmp\/zookeeper/dataDir=\/usr\/local\/bin\/kafka\/data\/zookeeper/g" config/zookeeper.properties 
sudo sed -i -e "s/log\.dirs=\/tmp\/kafka-logs/log\.dirs=\/usr\/local\/bin\/kafka\/data\/kafka/g" config/server.properties 

# Checks if .bashrc and profile is edited, forwards to log file
# Checks if properties files are edited, forwards to log file
tail -5 /etc/profile > /usr/local/bin/edit-properties.log
tail -5 /root/.bashrc >> /usr/local/bin/edit-properties.log
grep dataDir /usr/local/bin/kafka/config/zookeeper.properties >> /usr/local/bin/edit-properties.log
grep log.dirs /usr/local/bin/kafka/config/server.properties >> /usr/local/bin/edit-properties.log
# exit

# OPTIONAL: 
# Creating my user and setting user as root
sudo useradd -m -G root eden 
#
# Changes the hostname to hcptstkafka1
sudo sed -i "s/.*/hcptstkafka1/" /etc/hostname
sudo sed -i "s/localhost/hcptstkafka1" /etc/hosts
sudo hostname hcptstkafka1
#
# Updates db for the locate command to immediately work
sudo updatedb

Once you see that the instance is in RUNNING status, log-in to the EC2-instance. Make sure to replace the my-key.pem with your public key and the 1.2.3.4 with the ip of your instance.

ssh -i "my-key.pem" [email protected]

To test if the script worked, run the following commands. Note that since we placed the Kafka path in the root's .bashrc file, we can only run the following commands as root. Switch to root by running sudo su.

# returns version of the java installed.
java -version               

# Starts scala REPL shell. Ctrl-Z to exit
scala    

# Run command to return man page/options for the command. 
# You should be able to run this from any directory.
kafka-topics.sh
# it should return something like this:
# OpenJDK 64-Bit Server VM warning: If the number of processors is expected to increase from one, then you should configure the number of parallel GC threads appropriately using -XX:ParallelGCThreads=N
# Create, delete, describe, or change a topic.

# Run command to return the modified profile, .bashrc, and properties file
cat /usr/local/bin/edit-properties.log
# it should return something like this:
#   export JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk
#   export JRE_HOME=/usr/lib/jvm/jre
#   export PATH=/usr/local/bin/scala-2.13.3/bin:/usr/local/bin/kafka/#   bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin
#   dataDir=/usr/local/bin/kafka/data/zookeeper
#   log.dirs=/usr/local/bin/kafka/data/kafka

From the script above, we created a /data in the Kafka directory and inside the /data, we also created two folders. These two folders are currently empty but they'll soon be filled with files when we start the zookeeper and kafka later.

ec2-user:data $ pwd
/usr/local/bin/kafka/data
ec2-user:data $
ec2-user:data $ ll
total 0
drwxr-xr-x 2 root root 187 Jun 27 18:21 kafka    
drwxr-xr-x 3 root root  23 Jun 27 18:09 zookeeper
ec2-user:data $
ec2-user:data $ ll kafka 
total 0
ec2-user:data $ ll zookeeper
total 0

We've also changed the dataDir. in the zookeeper properties file to our data/zookeeper. Any logs regarding the zookeeper will be stored in our folder. We did the same thing with kafka properties and instructed it to use our data/kafka folder.

ec2-user:config $ pwd
/usr/local/bin/kafka/config
ec2-user:config $ 
ec2-user:config $ grep dataDir zookeeper.properties 
dataDir=/usr/local/bin/kafka/data/zookeeper
ec2-user:config $
ec2-user:config $ grep log.dirs server.properties 
log.dirs=/usr/local/bin/kafka/data/kafka

Now that we've configured everything already, we can now try running zookeeper and a broker. To run zookeeper, issue the command below. Note that you should be in the /usr/local/bin/kafka/ when you reference properties files.

If all goes well, you should see a INFO binding to port 0.0.0.0/0.0.0.0:2181 at the bottom of the output.

root:config $ zookeeper-server-start.sh zookeeper.properties

# Some parts of output is omitted 
[2021-06-27 18:31:52,028] INFO binding to port 0.0.0.0/0.0.0.0:2181 (org.apache.zookeeper.server.NIOServerCnxnFactory)
[2021-06-27 18:31:52,042] INFO zookeeper.snapshotSizeFactor = 0.33 (org.apache.zookeeper.server.ZKDatabase)

Open a second terminal, proceed to the same directory, and switch to root. Run the command below to start Kafka. If successful, you should see (at the bottom of the output) a KafkaServer id and confirmation that it started.

root:config $ kafka-server-start.sh server.properties

# Some parts of output is omitted 
[2021-06-27 18:36:27,975] INFO [KafkaServer id=0] started (kafka.server.KafkaServer)
[2021-06-27 18:36:28,004] INFO [broker-0-to-controller-send-thread]: Recorded new controller, from now on will use broker 0 (kafka.server.BrokerToControllerRequestThread)

Open a third terminal and go to the data directory, we'll see that both kafka and zookeeper now have files inside.

root:data $ pwd
/usr/local/bin/kafka/data
root:data $ 
root:data $ ll kafka/
total 12
-rw-r--r-- 1 root root  0 Jun 27 18:11 cleaner-offset-checkpoint
-rw-r--r-- 1 root root  4 Jun 27 18:40 log-start-offset-checkpoint
-rw-r--r-- 1 root root 88 Jun 27 18:36 meta.properties
-rw-r--r-- 1 root root  4 Jun 27 18:40 recovery-point-offset-checkpoint
-rw-r--r-- 1 root root  0 Jun 27 18:11 replication-offset-checkpoint
root:data $
root:data $ ll zookeeper/
total 0
drwxr-xr-x 2 root root 70 Jun 27 18:31 version-2

We can see the metadata of the broker by reading the meta.properties file inside the data/kafka/ directory.

[root@hcptstkafka1 data]# cat kafka/meta.properties
#
#Thu Aug 12 13:21:00 UTC 2021
cluster.id=UqNaEvX8RW2Wn9A4fwuS8g
version=0
broker.id=0

Important Note

Zookeeper is mandatory in the Kafka cluster and broker will not be started if Zookeeper is unavailable. This is because the brokers will try to connect to the zookeeper when you run the command, kafka-server-start.sh.

There's been an update just this year (2021) about Kafka clusters deployment without Zookeepers where the Zookeeper outside the cluster is replaced by a quorum controller from within the Kafka cluster itself. It's still an early access release and is not recommended for critical or production workloads.

As such, we won't be using a Zookeeper-less environment for the rest of this series but it can have it's own series in the upcoming months - of course once I got to play with it!

You may check out the ongoing development in their Github repository

Oh, if you get any errors

As a rule of thumb, the first thing to do when you get errors is to look at the logs. It will almost always shows the error and possibly point you to the code or function that's throwing the exception.

To check the logs of kafka, look for server.log inside the /kafka/log directory. If you're not able to find it, you can simply run locate.

[root@hcptstkafka1 ~]# locate server.log
/usr/local/bin/kafka/logs/server.log

You can view the entire logs or you could simply view the last few lines using the tail command and then specify the number of lines after it.

[root@hcptstkafka1 ~]# cd /usr/local/bin/kafka/logs
[root@hcptstkafka1 logs]# tail -10 server.log
[root@hcptstkafka1 logs]# tail -10 server.log
[2021-08-12 18:40:38,180] INFO Socket error occurred: localhost/127.0.0.1:2181: Network is unreachable (org.apache.zookeeper.ClientCnxn)
[2021-08-12 18:40:39,605] INFO Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn)
[2021-08-12 18:40:39,605] ERROR Unable to open socket to localhost/127.0.0.1:2181 (org.apache.zookeeper.ClientCnxnSocketNIO)
[2021-08-12 18:40:39,605] INFO Socket error occurred: localhost/127.0.0.1:2181: Network is unreachable (org.apache.zookeeper.ClientCnxn)
[2021-08-12 18:40:40,714] INFO Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn)
[2021-08-12 18:40:40,714] ERROR Unable to open socket to localhost/127.0.0.1:2181 (org.apache.zookeeper.ClientCnxnSocketNIO)
[2021-08-12 18:40:40,714] INFO Socket error occurred: localhost/127.0.0.1:2181: Network is unreachable (org.apache.zookeeper.ClientCnxn)
[2021-08-12 18:40:42,615] INFO Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn)
[2021-08-12 18:40:42,615] ERROR Unable to open socket to localhost/127.0.0.1:2181 (org.apache.zookeeper.ClientCnxnSocketNIO)
[2021-08-12 18:40:42,616] INFO Socket error occurred: localhost/127.0.0.1:2181: Network is unreachable (org.apache.zookeeper.ClientCnxn)

You can also filter the output to just show the error or exception by using the grep command.

[root@hcptstkafka1 logs]# tail -10 server.log | grep ERROR
[2021-08-12 18:40:39,605] ERROR Unable to open socket to localhost/127.0.0.1:2181 (org.apache.zookeeper.ClientCnxnSocketNIO)
[2021-08-12 18:40:40,714] ERROR Unable to open socket to localhost/127.0.0.1:2181 (org.apache.zookeeper.ClientCnxnSocketNIO)
[2021-08-12 18:40:42,615] ERROR Unable to open socket to localhost/127.0.0.1:2181 (org.apache.zookeeper.ClientCnxnSocketNIO)
[root@hcptstkafka1 logs]# 
[root@hcptstkafka1 logs]# tail -10 server.log | grep Exception
[root@hcptstkafka1 logs]#

In this case, I temporarily stop the Kafka service, thus the error is showing that it's unable to connect.

Your machine is now setup. You're good to proceed to the next chapter!😃👍 To finish the Kafka Theory, proceed to the next two articles in this series. You could also skip ahead to the Kafka CLI section.

If you find this write-up interesting, I'll be glad to talk and connect with you on Twitter! 😃

jeden image

18