Tuesday 12 September 2017

NTP synchronization


Upon un-expected power shutdown, my system date/time got changed upon reboot. Here is what I did to correct the situation:

My ntpsync.sh script is as follows:

# Instead of ntpdate (which is deprecated - http://linux.die.net/man/8/ntpd), use
# for older systems, you can use ntpdate like:
# $ sudo service ntp stop
# $ sudo ntpdate -s www.nist.gov
# $ sudo service ntp start

$ sudo service ntp stop
$ sudo ntpd -gq
$ sudo service ntp start

# The -gq tells the ntp daemon to correct the time regardless of the offset (g) and exit immediately (q).







Monday 11 September 2017

Installation of Kafka on CentOS 7


Apache kafka is open source stream processing platform developed by Apache/LinkedIN and is written in Scala/Java. The project aims to provide a unified, high throughput, low-latency platform for handling real-time data feeds. One of the strongest point of Kafka is massively scalable pub/sub message queue architecture as a distributed transaction log and is suitable for handling streaming data.

It is possible to deploy kafka on a single server or build a distributed kafka cluster for greater performance.

### Update system
$ sudo yum update -y && sudo reboot

### Install OpenJDK runtime
$ sudo yum install java-1.8.0-openjdk.x86_64

Check java version
$ java -version

### Add JAVA_HOME and JRE_HOME in /etc/profile
export JAVA_HOME = /usr/lib/jvm/jre-1.8.0-openjdk
export JRE_HOME = /usr/lib/jvm/jre

Apply the modified profile
$ sudo source /etc/profile

### Download the latest version of Apache Kafka
$ cd ~
$ wget -c https://archive.apache.org/dist/kafka/0.11.0.0/kafka_2.12-0.11.0.0.tgz

Unzip the archive and move to the preferred location such as /opt
$ tar -xvf kafka_2.12-0.11.0.0.tgz
$ sudo mv kafka_2.12-0.11.0.0 /opt

### Start and test Apache Kafka
Go to kafka directory
$ cd /opt/kafka_2.12-0.11.0.0

#### Start Zookeeper server
$ bin/zookeeper-server-start.sh -daemon config/zookeeper.properties

#### Modify configuration of kafka server
$ vim bin/kafka-server-start.sh

Adjust the memory usage according to your specific system parameters.

By default,
export KAFKA_HEAP_OPTS="-Xmx1G -Xms1G"

Replace it with:
export KAFKA_HEAP_OPTS="-Xmx512M -Xms256M"

### Start kafka server
$ bin/kafka-server-start.sh config/server.properties

If everything went successfully, you will see several messages about the Kafka server's status, and the last one will read:

INFO [Kafka Server 0], started (kafka.server.KafkaServer)

Congratulations!! you have started kafka server. Press CTRL + C to stop the server.

Now, run kafka in daemon mode like this
$ bin/kafka-server-start.sh -daemon config/server.properties

### Create a topic "test" on Kafka server
$ bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test

If you wish to view the topics, you can view like this:
$ bin/kafka-topics.sh --list --zookeeper localhost:2181

In this case, the output will be:
test

### Produce messages using topic "test"

$ bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
Now, on command(console) prompt, you can input any number of messages as you wish, such as:
Welcome Joshi
Enjoy Kafka journey!

Uset CTRL + C to stop the messages.

If you receive an error similar to "WARN Error while fetching metadata with correlation id" while inputting a message, you'll need to update the server.properties file with the following info:

port = 9092
advertised.host.name = localhost

### Consume messages
$ bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test --from-beginning

Hola! Whatever you have typed earlier will now be visible on console. Effectively, you have consumed the messages.

### Role of Zookeeper

ZooKeeper coordinates and synchronizes configuration information of distributed nodes. Kafka cluster depends on ZooKeeper to perform operations such as electing leaders and detecting failed nodes.

### Testing zookeeper

Type 'ruok' as telnet console input and the response will be 'imok'

$ telnet localhost 2181
Connected to localhost
Escape character is '^]'.
ruok
imok

### Counting Number of messages stored in a kafka topic
$ bin/kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list localhost:9092 --topic test --time -1

This sum up all the counts for each partition.

Tuesday 5 September 2017

tldextract python package Error - "ERROR:tldextract:Exception reading Public Suffix List url"

After installing "tldextract" package on a intranet machine for getting domain/subdomain information, I encountered an error:

ERROR:tldextract:Exception reading Public Suffix List url https://raw.githubusercontent.com/publicsuffix/list/master/public_suffix_list.dat - HTTPSConnectionPool(host='raw.githubusercontent.com', port=443): Max retries exceeded with url: /publicsuffix/list/master/public_suffix_list.dat (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd627bfd690>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution',)).
ERROR:tldextract:No Public Suffix List found. Consider using a mirror or constructing your TLDExtract with `suffix_list_urls=None`.

After looking through "tldextract" github repository page on advanced usage (https://github.com/john-kurkowski/tldextract#advanced-usage),
I realized that I have to manually download the public suffix list url for intranet machine during tldextract instance initialization. Basically, you have to construct your own public suffix list manually.

So, I downloaded public suffix list file "public_suffix_list.dat" from url - https://raw.githubusercontent.com/publicsuffix/list/master/public_suffix_list.dat
and passed on this as an argument to suffix_list_urls.



After setting suffix_list_urls to a file based scheme, it worked without any issue.

Here is the sample script I wrote for my intranet machine testing.

#!/usr/bin/env python
import tldextract
from tldextract.tldextract import LOG
import sys
import logging

# setup logging
logging.basicConfig(stream=sys.stdout,level = logging.DEBUG)
logger = logging.getLogger(__name__)
# If you do not setup logging, you will encounter warning: No handlers could be found for logger "tldextract"


#Download public_suffix_list.dat file from url - https://raw.githubusercontent.com/publicsuffix/list/master/public_suffix_list.dat

no_fetch_extract = tldextract.TLDExtract(suffix_list_urls=["file:///home/psj/Development/public_suffix_list.dat"],cache_file='/tmp/.tld_set')

print no_fetch_extract('http://www.google.com')
sys.exit(1)

Ref urls:
https://github.com/john-kurkowski/tldextract#advanced-usage
https://github.com/john-kurkowski/tldextract/tree/1.3.1#specifying-your-own-url-or-file-for-the-suffix-list-data