After installing "tldextract" package on a intranet machine for getting domain/subdomain information, I encountered an error:
ERROR:tldextract:Exception reading Public Suffix List url https://raw.githubusercontent.com/publicsuffix/list/master/public_suffix_list.dat - HTTPSConnectionPool(host='raw.githubusercontent.com', port=443): Max retries exceeded with url: /publicsuffix/list/master/public_suffix_list.dat (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd627bfd690>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution',)).
ERROR:tldextract:No Public Suffix List found. Consider using a mirror or constructing your TLDExtract with `suffix_list_urls=None`.
After looking through "tldextract" github repository page on
advanced usage (https://github.com/john-kurkowski/tldextract#advanced-usage),
I realized that I have to manually download the public suffix list url for intranet machine during tldextract instance initialization. Basically, you have to construct your own public suffix list manually.
So, I downloaded public suffix list file "public_suffix_list.dat" from url -
https://raw.githubusercontent.com/publicsuffix/list/master/public_suffix_list.dat and passed on this as an argument to suffix_list_urls.
After setting suffix_list_urls to a file based scheme, it worked without any issue.
Here is the sample script I wrote for my intranet machine testing.
#!/usr/bin/env python
import tldextract
from tldextract.tldextract import LOG
import sys
import logging
# setup logging
logging.basicConfig(stream=sys.stdout,level = logging.DEBUG)
logger = logging.getLogger(__name__)
# If you do not setup logging, you will encounter warning: No handlers could be found for logger "tldextract"
#Download public_suffix_list.dat file from url - https://raw.githubusercontent.com/publicsuffix/list/master/public_suffix_list.dat
no_fetch_extract = tldextract.TLDExtract(suffix_list_urls=["file:///home/psj/Development/public_suffix_list.dat"],cache_file='/tmp/.tld_set')
print no_fetch_extract('http://www.google.com')
sys.exit(1)
Ref urls:
https://github.com/john-kurkowski/tldextract#advanced-usage
https://github.com/john-kurkowski/tldextract/tree/1.3.1#specifying-your-own-url-or-file-for-the-suffix-list-data