CollecTor¶
Descriptor archives are available from CollecTor. If you need Tor's topology at a prior point in time this is the place to go!
With CollecTor you can either read descriptors directly...
import datetime
import stem.descriptor.collector
yesterday = datetime.datetime.utcnow() - datetime.timedelta(days = 1)
# provide yesterday's exits
exits = {}
for desc in stem.descriptor.collector.get_server_descriptors(start = yesterday):
if desc.exit_policy.is_exiting_allowed():
exits[desc.fingerprint] = desc
print('%i relays published an exiting policy today...\n' % len(exits))
for fingerprint, desc in exits.items():
print(' %s (%s)' % (desc.nickname, fingerprint))
... or download the descriptors to disk and read them later.
import datetime
import stem.descriptor
import stem.descriptor.collector
yesterday = datetime.datetime.utcnow() - datetime.timedelta(days = 1)
cache_dir = '~/descriptor_cache/server_desc_today'
collector = stem.descriptor.collector.CollecTor()
for f in collector.files('server-descriptor', start = yesterday):
f.download(cache_dir)
# then later...
for f in collector.files('server-descriptor', start = yesterday):
for desc in f.read(cache_dir):
if desc.exit_policy.is_exiting_allowed():
print(' %s (%s)' % (desc.nickname, desc.fingerprint))
get_instance - Provides a singleton CollecTor used for...
|- get_server_descriptors - published server descriptors
|- get_extrainfo_descriptors - published extrainfo descriptors
|- get_microdescriptors - published microdescriptors
|- get_consensus - published router status entries
|
|- get_key_certificates - authority key certificates
|- get_bandwidth_files - bandwidth authority heuristics
+- get_exit_lists - TorDNSEL exit list
File - Individual file residing within CollecTor
|- read - provides descriptors from this file
+- download - download this file to disk
CollecTor - Downloader for descriptors from CollecTor
|- get_server_descriptors - published server descriptors
|- get_extrainfo_descriptors - published extrainfo descriptors
|- get_microdescriptors - published microdescriptors
|- get_consensus - published router status entries
|
|- get_key_certificates - authority key certificates
|- get_bandwidth_files - bandwidth authority heuristics
|- get_exit_lists - TorDNSEL exit list
|
|- index - metadata for content available from CollecTor
+- files - files available from CollecTor
New in version 1.8.0.
- stem.descriptor.collector.get_instance()[source]¶
Provides the singleton CollecTor used for this module's shorthand functions.
Returns: singleton CollecTor instance
- stem.descriptor.collector.get_server_descriptors(start=None, end=None, cache_to=None, bridge=False, timeout=None, retries=3)[source]¶
Shorthand for get_server_descriptors() on our singleton instance.
- stem.descriptor.collector.get_extrainfo_descriptors(start=None, end=None, cache_to=None, bridge=False, timeout=None, retries=3)[source]¶
Shorthand for get_extrainfo_descriptors() on our singleton instance.
- stem.descriptor.collector.get_microdescriptors(start=None, end=None, cache_to=None, timeout=None, retries=3)[source]¶
Shorthand for get_microdescriptors() on our singleton instance.
- stem.descriptor.collector.get_consensus(start=None, end=None, cache_to=None, document_handler='ENTRIES', version=3, microdescriptor=False, bridge=False, timeout=None, retries=3)[source]¶
Shorthand for get_consensus() on our singleton instance.
- stem.descriptor.collector.get_key_certificates(start=None, end=None, cache_to=None, timeout=None, retries=3)[source]¶
Shorthand for get_key_certificates() on our singleton instance.
- stem.descriptor.collector.get_bandwidth_files(start=None, end=None, cache_to=None, timeout=None, retries=3)[source]¶
Shorthand for get_bandwidth_files() on our singleton instance.
- stem.descriptor.collector.get_exit_lists(start=None, end=None, cache_to=None, timeout=None, retries=3)[source]¶
Shorthand for get_exit_lists() on our singleton instance.
- class stem.descriptor.collector.File(path, types, size, sha256, first_published, last_published, last_modified)[source]¶
Bases: object
File within CollecTor.
Variables: - path (str) -- file path within collector
- types (tuple) -- descriptor types contained within this file
- compression (stem.descriptor.Compression) -- file compression, None if this cannot be determined
- size (int) -- size of the file
- sha256 (str) -- file's sha256 checksum
- start (datetime) -- first publication within the file, None if this cannot be determined
- end (datetime) -- last publication within the file, None if this cannot be determined
- last_modified (datetime) -- when the file was last modified
- read(directory=None, descriptor_type=None, start=None, end=None, document_handler='ENTRIES', timeout=None, retries=3)[source]¶
Provides descriptors from this archive. Descriptors are downloaded or read from disk as follows...
- If this file has already been downloaded through :func:`~stem.descriptor.collector.CollecTor.download' these descriptors are read from disk.
- If a directory argument is provided and the file is already present these descriptors are read from disk.
- If a directory argument is provided and the file is not present the file is downloaded this location then read.
- If the file has neither been downloaded and no directory argument is provided then the file is downloaded to a temporary directory that's deleted after it is read.
Parameters: - directory (str) -- destination to download into
- descriptor_type (str) -- descriptor type, this is guessed if not provided
- start (datetime.datetime) -- publication time to begin with
- end (datetime.datetime) -- publication time to end with
- document_handler (stem.descriptor.__init__.DocumentHandler) -- method in which to parse a NetworkStatusDocument
- timeout (int) -- timeout when connection becomes idle, no timeout applied if None
- retries (int) -- maximum attempts to impose
Returns: iterator for Descriptor instances in the file
Raises : - ValueError if unable to determine the descirptor type
- TypeError if we cannot parse this descriptor type
- DownloadFailed if the download fails
- download(directory, decompress=True, timeout=None, retries=3, overwrite=False)[source]¶
Downloads this file to the given location. If a file already exists this is a no-op.
Parameters: - directory (str) -- destination to download into
- decompress (bool) -- decompress written file
- timeout (int) -- timeout when connection becomes idle, no timeout applied if None
- retries (int) -- maximum attempts to impose
- overwrite (bool) -- if this file exists but mismatches CollecTor's checksum then overwrites if True, otherwise rases an exception
Returns: str with the path we downloaded to
Raises : - DownloadFailed if the download fails
- IOError if a mismatching file exists and overwrite is False
- class stem.descriptor.collector.CollecTor(retries=2, timeout=None)[source]¶
Bases: object
Downloader for descriptors from CollecTor. The contents of CollecTor are provided in an index that's fetched as required.
Variables: - retries (int) -- number of times to attempt the request if downloading it fails
- timeout (float) -- duration before we'll time out our request
- get_server_descriptors(start=None, end=None, cache_to=None, bridge=False, timeout=None, retries=3)[source]¶
Provides server descriptors published during the given time range, sorted oldest to newest.
Parameters: - start (datetime.datetime) -- publication time to begin with
- end (datetime.datetime) -- publication time to end with
- cache_to (str) -- directory to cache archives into, if an archive is available here it is not downloaded
- bridge (bool) -- standard descriptors if False, bridge if True
- timeout (int) -- timeout for downloading each individual archive when the connection becomes idle, no timeout applied if None
- retries (int) -- maximum attempts to impose on a per-archive basis
Returns: iterator of ServerDescriptor for the given time range
Raises : DownloadFailed if the download fails
- get_extrainfo_descriptors(start=None, end=None, cache_to=None, bridge=False, timeout=None, retries=3)[source]¶
Provides extrainfo descriptors published during the given time range, sorted oldest to newest.
Parameters: - start (datetime.datetime) -- publication time to begin with
- end (datetime.datetime) -- publication time to end with
- cache_to (str) -- directory to cache archives into, if an archive is available here it is not downloaded
- bridge (bool) -- standard descriptors if False, bridge if True
- timeout (int) -- timeout for downloading each individual archive when the connection becomes idle, no timeout applied if None
- retries (int) -- maximum attempts to impose on a per-archive basis
Returns: iterator of RelayExtraInfoDescriptor for the given time range
Raises : DownloadFailed if the download fails
- get_microdescriptors(start=None, end=None, cache_to=None, timeout=None, retries=3)[source]¶
Provides microdescriptors estimated to be published during the given time range, sorted oldest to newest. Unlike server/extrainfo descriptors, microdescriptors change very infrequently...
"Microdescriptors are expected to be relatively static and only change about once per week." -dir-spec section 3.3
CollecTor archives only contain microdescriptors that change, so hourly tarballs often contain very few. Microdescriptors also do not contain their publication timestamp, so this is estimated.
Parameters: - start (datetime.datetime) -- publication time to begin with
- end (datetime.datetime) -- publication time to end with
- cache_to (str) -- directory to cache archives into, if an archive is available here it is not downloaded
- timeout (int) -- timeout for downloading each individual archive when the connection becomes idle, no timeout applied if None
- retries (int) -- maximum attempts to impose on a per-archive basis
Returns: iterator of :class:`~stem.descriptor.microdescriptor.Microdescriptor for the given time range
Raises : DownloadFailed if the download fails
- get_consensus(start=None, end=None, cache_to=None, document_handler='ENTRIES', version=3, microdescriptor=False, bridge=False, timeout=None, retries=3)[source]¶
Provides consensus router status entries published during the given time range, sorted oldest to newest.
Parameters: - start (datetime.datetime) -- publication time to begin with
- end (datetime.datetime) -- publication time to end with
- cache_to (str) -- directory to cache archives into, if an archive is available here it is not downloaded
- document_handler (stem.descriptor.__init__.DocumentHandler) -- method in which to parse a NetworkStatusDocument
- version (int) -- consensus variant to retrieve (versions 2 or 3)
- microdescriptor (bool) -- provides the microdescriptor consensus if True, standard consensus otherwise
- bridge (bool) -- standard descriptors if False, bridge if True
- timeout (int) -- timeout for downloading each individual archive when the connection becomes idle, no timeout applied if None
- retries (int) -- maximum attempts to impose on a per-archive basis
Returns: iterator of RouterStatusEntry for the given time range
Raises : DownloadFailed if the download fails
- get_key_certificates(start=None, end=None, cache_to=None, timeout=None, retries=3)[source]¶
Directory authority key certificates for the given time range, sorted oldest to newest.
Parameters: - start (datetime.datetime) -- publication time to begin with
- end (datetime.datetime) -- publication time to end with
- cache_to (str) -- directory to cache archives into, if an archive is available here it is not downloaded
- timeout (int) -- timeout for downloading each individual archive when the connection becomes idle, no timeout applied if None
- retries (int) -- maximum attempts to impose on a per-archive basis
Returns: iterator of :class:`~stem.descriptor.networkstatus.KeyCertificate for the given time range
Raises : DownloadFailed if the download fails
- get_bandwidth_files(start=None, end=None, cache_to=None, timeout=None, retries=3)[source]¶
Bandwidth authority heuristics for the given time range, sorted oldest to newest.
Parameters: - start (datetime.datetime) -- publication time to begin with
- end (datetime.datetime) -- publication time to end with
- cache_to (str) -- directory to cache archives into, if an archive is available here it is not downloaded
- timeout (int) -- timeout for downloading each individual archive when the connection becomes idle, no timeout applied if None
- retries (int) -- maximum attempts to impose on a per-archive basis
Returns: iterator of :class:`~stem.descriptor.bandwidth_file.BandwidthFile for the given time range
Raises : DownloadFailed if the download fails
- get_exit_lists(start=None, end=None, cache_to=None, timeout=None, retries=3)[source]¶
TorDNSEL exit lists for the given time range, sorted oldest to newest.
Parameters: - start (datetime.datetime) -- publication time to begin with
- end (datetime.datetime) -- publication time to end with
- cache_to (str) -- directory to cache archives into, if an archive is available here it is not downloaded
- timeout (int) -- timeout for downloading each individual archive when the connection becomes idle, no timeout applied if None
- retries (int) -- maximum attempts to impose on a per-archive basis
Returns: iterator of :class:`~stem.descriptor.tordnsel.TorDNSEL for the given time range
Raises : DownloadFailed if the download fails
- index(compression='best')[source]¶
Provides the archives available in CollecTor.
Parameters: compression (descriptor.Compression) -- compression type to download from, if undefiled we'll use the best decompression available
Returns: dict with the archive contents
Raises : If unable to retrieve the index this provide...
- ValueError if json is malformed
- IOError if unable to decompress
- DownloadFailed if the download fails
- files(descriptor_type=None, start=None, end=None)[source]¶
Provides files CollecTor presently has, sorted oldest to newest.
Parameters: - descriptor_type (str) -- descriptor type or prefix to retrieve
- start (datetime.datetime) -- publication time to begin with
- end (datetime.datetime) -- publication time to end with
Returns: list of File
Raises : If unable to retrieve the index this provide...
- ValueError if json is malformed
- IOError if unable to decompress
- DownloadFailed if the download fails