Tuesday, 5 September 2017

tldextract python package Error - "ERROR:tldextract:Exception reading Public Suffix List url"

After installing "tldextract" package on a intranet machine for getting domain/subdomain information, I encountered an error:

ERROR:tldextract:Exception reading Public Suffix List url https://raw.githubusercontent.com/publicsuffix/list/master/public_suffix_list.dat - HTTPSConnectionPool(host='raw.githubusercontent.com', port=443): Max retries exceeded with url: /publicsuffix/list/master/public_suffix_list.dat (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd627bfd690>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution',)).
ERROR:tldextract:No Public Suffix List found. Consider using a mirror or constructing your TLDExtract with `suffix_list_urls=None`.

After looking through "tldextract" github repository page on advanced usage (https://github.com/john-kurkowski/tldextract#advanced-usage),
I realized that I have to manually download the public suffix list url for intranet machine during tldextract instance initialization. Basically, you have to construct your own public suffix list manually.

So, I downloaded public suffix list file "public_suffix_list.dat" from url - https://raw.githubusercontent.com/publicsuffix/list/master/public_suffix_list.dat
and passed on this as an argument to suffix_list_urls.



After setting suffix_list_urls to a file based scheme, it worked without any issue.

Here is the sample script I wrote for my intranet machine testing.

#!/usr/bin/env python
import tldextract
from tldextract.tldextract import LOG
import sys
import logging

# setup logging
logging.basicConfig(stream=sys.stdout,level = logging.DEBUG)
logger = logging.getLogger(__name__)
# If you do not setup logging, you will encounter warning: No handlers could be found for logger "tldextract"


#Download public_suffix_list.dat file from url - https://raw.githubusercontent.com/publicsuffix/list/master/public_suffix_list.dat

no_fetch_extract = tldextract.TLDExtract(suffix_list_urls=["file:///home/psj/Development/public_suffix_list.dat"],cache_file='/tmp/.tld_set')

print no_fetch_extract('http://www.google.com')
sys.exit(1)

Ref urls:
https://github.com/john-kurkowski/tldextract#advanced-usage
https://github.com/john-kurkowski/tldextract/tree/1.3.1#specifying-your-own-url-or-file-for-the-suffix-list-data

No comments:

Post a Comment