Tuesday 3 September 2019

Installation of nltk data in Offline mode

## Installation of nltk data in offline mode
NTLK is popular library for naturual language processing in python and is used to carry out many
text processing tasks like classification, stemming, tagging, parsing etc.
I wanted to gain first hand experience of using this powerful library for my in-house application.

Installation of basic nltk python package is easy:
```
 $ sudo pip install nltk
```
The next step was to download and install nltk data. If your system is on Internet, then, you can do:

```
$ python
Python 2.7.12 (default, Nov 12 2018, 14:36:49)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import nltk
>>> nltk.download()
NLTK Downloader
---------------------------------------------------------------------------
d) Download l) List u) Update c) Config h) Help q) Quit
---------------------------------------------------------------------------
Download which package (l=list; x=cancel)?
Downloader> l
Packages:
[*] maxent_ne_chunker... ACE Named Entity Chunker (Maximum entropy)
[*] abc................. Australian Broadcasting Commission 2006
[*] alpino.............. Alpino Dutch Treebank
[*] biocreative_ppi..... BioCreAtIvE (Critical Assessment of Information
Extraction Systems in Biology)
[*] brown............... Brown Corpus
[*] brown_tei........... Brown Corpus (TEI XML Version)
[*] cess_cat............ CESS-CAT Treebank
[*] cess_esp............ CESS-ESP Treebank
[*] chat80.............. Chat-80 Data Files
[*] city_database....... City Database
[*] cmudict............. The Carnegie Mellon Pronouncing Dictionary (0.6)
[*] comtrans............ ComTrans Corpus Sample
[*] conll2000........... CONLL 2000 Chunking Corpus
[*] conll2002........... CONLL 2002 Named Entity Recognition Corpus
[*] conll2007........... Dependency Treebanks from CoNLL 2007 (Catalan
and Basque Subset)
[*] dependency_treebank. Dependency Parsed Treebank
[*] europarl_raw........ Sample European Parliament Proceedings Parallel
Corpus
Hit Enter to continue:
....
Downloader> d

Download which package (l=list; x=cancel)?
Identifier> all

```

If you have download everything(corpora, models, grammar) all the NLTK data needed, you can test it by running:

```
Downloader> u
Nothing to update.
```
If the dialog shows “Nothing to update”, assume that everything is ok. However, things are not so smooth in actual world. Since the data to be downloaded is large (600 MB), it may happen that there might be connection timeout or download may not finish. Secondly, I came across a situation where I have to install nltk data on a offline system. The following steps spells out how to install nltk data in offline way.

* Please visit nltk_data git repository - https://github.com/nltk/nltk_data
* You will be greeted with diaglog - NLTK data lives in the gh-pages branch of this repository.
* Visit the branch url - https://github.com/nltk/nltk_data/tree/master
* Download the zip file of this package on github and unzip it, then copy sub-directories in the packages folder into your nltk_data directory, say, /root/nltk_data

```
$ sudo mkdir -p /root/nltk_data
$ unzip nltk_data-gh-pages.zip
$ cd nltk_data-gh-pages
$ sudo cp -R nltk_data-gh-pages/packages/* /root/nltk_data
```

Test if the nltk data is working as desired.

```
$ python
Python 2.7.12 (default, Nov 12 2018, 14:36:49)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import nltk
>>> nltk.data.path.append('/root/nltk_data')
>>> from nltk.corpus import brown
>>> brown.words()[0:10]
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of']
```

Friday 2 February 2018

Install graphite on CentOS 7


1) Install and enable EPEL
sudo wget https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
yum install epel-release-latest-7.noarch.rpm

Tip:
To list all available packages under a repo called epel, enter:
$ sudo yum --disablerepo="*" --enablerepo="epel" list available

2) Install graphite related packages
sudo yum install -y graphite-web python-carbon

3) Modify storage schema file
sudo nano /etc/carbon/storage-schemas.conf

[default]
pattern= .*
retentions = 12s:4h, 2m:3d, 5m:8d, 13m:32d, 1h:1y

Now, restart carbon:
sudo service carbon-cache restart

OR
sudo systemctl enable carbon-cache
sudo systemctl start carbon-cache

4) Modify some settings in django graphite web application file:
sudo nano /etc/graphite-web/local_settings.py

# line 13: uncomment and specify any secret-key you like
SECRET_KEY = 'my_secret_key'

# line 23: uncomment and change to your timezone

TIME_ZONE = 'Asia/Calcutta'

5) Now, create django superuser and create database.
/bin/graphite-manage syncdb

Enter required user/password for superuser etc in the interactive session.

6) Configure apache for graphite:
Remove the default index page from apache:

  echo > /etc/httpd/conf.d/welcome.conf

Next, edit /etc/httpd/conf.d/graphite-web.conf and replace everything in the 'Directory "/usr/share/graphite/"' block with:

    Require all granted
    Order allow,deny
    Allow from all


Check if proper permissions to the Graphite directory:

sudo chown apache:apache /var/lib/graphite-web/graphite.db

permission should be: -rw-r--r-- and owner:apache:apache

And work around a bug related to building indexs with:

  touch /var/lib/graphite-web/index

permission should be: -rw-r--r-- and owner:root:root


Start Apache and enable auto-start:

sudo systemctl start httpd 
sudo systemctl enable httpd 


7) If firewall is enabled, allow port 80 access.

sudo firewall-cmd --permanent --add-service=http
sudo firewall-cmd --reload

Useful links:
https://www.vultr.com/docs/how-to-install-and-configure-graphite-on-centos-7

Sunday 24 December 2017

Process dump the attacker using logstash

Use linux "auditd" to monitor /etc/passwd file and generate a key "password_access".

$ sudo apt install auditd
$ auditctl -w /etc/passwd -p wa -k passwd_access

Force logstash core dump any process that causes auditd to write the "password_access" key.

Install gdb (gcore):
sudo apt install gdb

Modify the output section of your /etc/logstash/conf.d/00-output.conf:

output {
  if [key] == "password_access" {
      stdout {codec => json}
      exec { command => "gcore -o /tmp/dump-%{@timestamp} %{pid}"}
  }
}


Tuesday 19 December 2017

Why you should not run multiple anti-virus products

I liked the advice from Emsisoft Anti-virus about why you should not run multiple anti-virus products.

* Potential conflicts/incompatibility issues

Modern anti-virus/anti-malware software acts like an extra protection layer that sits between the base of the operating system and the apps/programs that run on it. Developing this type of software is always challenging and requires many years of experience.Protection programs are created in different ways and it may cause unexpected crashes or freezes that are very difficult to resolve.

* Who will quarantine first

Since anti-virus products have real-time scanning enabled, it's a race between multiple anti-virus programs about quarantining potential malicious download/file scan and this may give unexpected results/errors.

* High resource usage

Since number of viruses/malwares are growing exponentially, size and complexity of anti-virus programs is also correspondingly growing and these now consume lot of CPU/storage resources. The situation will be further compounded if you use multiple anti-virus products.

So, in nutshell, avoid installing multiple anti-virus/anti-malware products as it’s not worth it. If you are happy with existing anti-virus software, stick with it.If you are unhappy with it, un-install it and then install a new one.

Ref -
* Don't run multiple anti-virus products - https://blog.emsisoft.com/2017/12/18/do-not-run-multiple-antivirus
* Latest independent tests of anti-virus products - https://www.av-comparatives.org/

Sunday 10 December 2017

Fix "this webpage has a redirect loop" error in browser

On a daily basis, Internet as a whole suffers from many problems and one such problem that is sometimes encountered is

"The webpage has a redirect loop"

The other times, the redirect error manifests itself with the following details:

Error 310 (net::ERR_TOO_MANY_REDIRECTS)

As a result, every time you visit Gmail page in Firefox/Chrome, the redirection loop will prevent access to it.

When Chrome or Firefox starts complaining about redirect loops just do three things:

1) Check the clock and set appropriate Date/Time as per your timezone
2) Clear the Browser cache
3) Kill the browser settings and reset it to default state

By the way, the no of redirect vary depending on browser and are listed below:

  • Chrome 64bit version: 49, ↷ Version 62.0.3202.52 (Official Build) beta, 21 redirects
  • Chrome Canary 64bit, version: 49 ↷ 63.0.3239.6 (Official Build) canary, 21 redirects
  • Firefox 32-bit version: 43 ↷ 56.0, 20 redirects
  • Firefox 64-bit version: 43 ↷ 56.0, 20 redirects
  • IE version: 8 11 redirects via webpagetest.org
  • IE version: 9 121 redirects via webpagetest.org
  • IE version: 10 121 redirects via webpagetest.org
  • IE version: 11.0.9600.18792 110 redirects
  • Opera version: 28, ↷ 48.0.2685.35 (PGO) 21 redirects
  • Safari version: 5.1.7, 16 redirects
  • Google Nexus 5, Samsung Galaxy S4..S8, Galaxy Tab 4, 21 redirects
The latest firefox version - Quantum supports up to 40 redirects.

Ref -
https://stackoverflow.com/questions/9384474/in-chrome-how-many-redirects-are-too-many

Sunday 19 November 2017

Save bash command history to syslog


# Increase history size
export HISTSIZE=5000

# In the commands given below - every time a new prompt is issued , bash history is appended to the file, then it is cleared from the current shell's memory,  and current shell reloads the history from the file.

$ export PROMPT_COMMAND="history -a; history -c; history -r; ${PROMPT_COMMAND}"

Another option is to export bash commands to syslog where the bash logs can be centralized and analyzed on demand.

Add the following snipplet to bashrc.

[root@psj]# vim /etc/bashrc

PROMPT_COMMAND=$(history -a)
typeset -r PROMPT_COMMAND

function log2syslog
{
   declare command
   command=$BASH_COMMAND
   logger -p local1.notice -t bash -i -- "$USER : $PWD : $command"

}
trap log2syslog DEBUG



Friday 10 November 2017

Download gz file using python request module

Here is the quick script I wrote to download a gz file using python requests module:

#!/usr/bin/env python
import requests
import gzip
import logging
import sys
import StringIO
import zlib

# setup logging
logging.basicConfig(stream = sys.stdout, level = logging.ERROR)
log = logging.getLogger('threat-feeds-logger')

#proxy configuration
proxy_host='10.1.1.11'
proxy_port=3128
proxy_user = 'xxxx'
proxy_password = 'xxxx'

feed_url = 'https://foo.org/foo.gz'
proxy_dict = {
                'http':'http://%s:%s@%s:%s' % (proxy_user, proxy_password, proxy_host, proxy_port),
                'https':'http://%s:%s@%s:%s' % (proxy_user, proxy_password, proxy_host, proxy_port)
            }
try:
    response = requests.get(feed_url,proxies = proxy_dict)
except Exception as e:
    log.error("Error while getting data from url - %s" %feed_url)

if response.status_code == 200:
    buf_data = StringIO.StringIO(response.content)
    f = gzip.GzipFile(fileobj=buf_data)
    for row in f.readlines():
       print row