Tuesday, 3 September 2019

Installation of nltk data in Offline mode

## Installation of nltk data in offline mode
NTLK is popular library for naturual language processing in python and is used to carry out many
text processing tasks like classification, stemming, tagging, parsing etc.
I wanted to gain first hand experience of using this powerful library for my in-house application.

Installation of basic nltk python package is easy:
```
 $ sudo pip install nltk
```
The next step was to download and install nltk data. If your system is on Internet, then, you can do:

```
$ python
Python 2.7.12 (default, Nov 12 2018, 14:36:49)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import nltk
>>> nltk.download()
NLTK Downloader
---------------------------------------------------------------------------
d) Download l) List u) Update c) Config h) Help q) Quit
---------------------------------------------------------------------------
Download which package (l=list; x=cancel)?
Downloader> l
Packages:
[*] maxent_ne_chunker... ACE Named Entity Chunker (Maximum entropy)
[*] abc................. Australian Broadcasting Commission 2006
[*] alpino.............. Alpino Dutch Treebank
[*] biocreative_ppi..... BioCreAtIvE (Critical Assessment of Information
Extraction Systems in Biology)
[*] brown............... Brown Corpus
[*] brown_tei........... Brown Corpus (TEI XML Version)
[*] cess_cat............ CESS-CAT Treebank
[*] cess_esp............ CESS-ESP Treebank
[*] chat80.............. Chat-80 Data Files
[*] city_database....... City Database
[*] cmudict............. The Carnegie Mellon Pronouncing Dictionary (0.6)
[*] comtrans............ ComTrans Corpus Sample
[*] conll2000........... CONLL 2000 Chunking Corpus
[*] conll2002........... CONLL 2002 Named Entity Recognition Corpus
[*] conll2007........... Dependency Treebanks from CoNLL 2007 (Catalan
and Basque Subset)
[*] dependency_treebank. Dependency Parsed Treebank
[*] europarl_raw........ Sample European Parliament Proceedings Parallel
Corpus
Hit Enter to continue:
....
Downloader> d

Download which package (l=list; x=cancel)?
Identifier> all

```

If you have download everything(corpora, models, grammar) all the NLTK data needed, you can test it by running:

```
Downloader> u
Nothing to update.
```
If the dialog shows “Nothing to update”, assume that everything is ok. However, things are not so smooth in actual world. Since the data to be downloaded is large (600 MB), it may happen that there might be connection timeout or download may not finish. Secondly, I came across a situation where I have to install nltk data on a offline system. The following steps spells out how to install nltk data in offline way.

* Please visit nltk_data git repository - https://github.com/nltk/nltk_data
* You will be greeted with diaglog - NLTK data lives in the gh-pages branch of this repository.
* Visit the branch url - https://github.com/nltk/nltk_data/tree/master
* Download the zip file of this package on github and unzip it, then copy sub-directories in the packages folder into your nltk_data directory, say, /root/nltk_data

```
$ sudo mkdir -p /root/nltk_data
$ unzip nltk_data-gh-pages.zip
$ cd nltk_data-gh-pages
$ sudo cp -R nltk_data-gh-pages/packages/* /root/nltk_data
```

Test if the nltk data is working as desired.

```
$ python
Python 2.7.12 (default, Nov 12 2018, 14:36:49)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import nltk
>>> nltk.data.path.append('/root/nltk_data')
>>> from nltk.corpus import brown
>>> brown.words()[0:10]
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of']
```