Stem Docs

CollecTor

CollecTor

Descriptor archives are available from CollecTor. If you need Tor's topology at a prior point in time this is the place to go!

With CollecTor you can either read descriptors directly...

import datetime
import stem.descriptor.collector

yesterday = datetime.datetime.utcnow() - datetime.timedelta(days = 1)

# provide yesterday's exits

exits = {}

for desc in stem.descriptor.collector.get_server_descriptors(start = yesterday):
  if desc.exit_policy.is_exiting_allowed():
    exits[desc.fingerprint] = desc

print('%i relays published an exiting policy today...\n' % len(exits))

for fingerprint, desc in exits.items():
  print('  %s (%s)' % (desc.nickname, fingerprint))

... or download the descriptors to disk and read them later.

import datetime
import stem.descriptor
import stem.descriptor.collector

yesterday = datetime.datetime.utcnow() - datetime.timedelta(days = 1)
cache_dir = '~/descriptor_cache/server_desc_today'

collector = stem.descriptor.collector.CollecTor()

for f in collector.files('server-descriptor', start = yesterday):
  f.download(cache_dir)

# then later...

for f in collector.files('server-descriptor', start = yesterday):
  for desc in f.read(cache_dir):
    if desc.exit_policy.is_exiting_allowed():
      print('  %s (%s)' % (desc.nickname, desc.fingerprint))
get_instance - Provides a singleton CollecTor used for...
  |- get_server_descriptors - published server descriptors
  |- get_extrainfo_descriptors - published extrainfo descriptors
  |- get_microdescriptors - published microdescriptors
  |- get_consensus - published router status entries
  |
  |- get_key_certificates - authority key certificates
  |- get_bandwidth_files - bandwidth authority heuristics
  +- get_exit_lists - TorDNSEL exit list

File - Individual file residing within CollecTor
  |- read - provides descriptors from this file
  +- download - download this file to disk

CollecTor - Downloader for descriptors from CollecTor
  |- get_server_descriptors - published server descriptors
  |- get_extrainfo_descriptors - published extrainfo descriptors
  |- get_microdescriptors - published microdescriptors
  |- get_consensus - published router status entries
  |
  |- get_key_certificates - authority key certificates
  |- get_bandwidth_files - bandwidth authority heuristics
  |- get_exit_lists - TorDNSEL exit list
  |
  |- index - metadata for content available from CollecTor
  +- files - files available from CollecTor

New in version 1.8.0.

stem.descriptor.collector.get_instance()[source]

Provides the singleton CollecTor used for this module's shorthand functions.

Returns:singleton CollecTor instance
stem.descriptor.collector.get_server_descriptors(start=None, end=None, cache_to=None, bridge=False, timeout=None, retries=3)[source]

Shorthand for get_server_descriptors() on our singleton instance.

stem.descriptor.collector.get_extrainfo_descriptors(start=None, end=None, cache_to=None, bridge=False, timeout=None, retries=3)[source]

Shorthand for get_extrainfo_descriptors() on our singleton instance.

stem.descriptor.collector.get_microdescriptors(start=None, end=None, cache_to=None, timeout=None, retries=3)[source]

Shorthand for get_microdescriptors() on our singleton instance.

stem.descriptor.collector.get_consensus(start=None, end=None, cache_to=None, document_handler='ENTRIES', version=3, microdescriptor=False, bridge=False, timeout=None, retries=3)[source]

Shorthand for get_consensus() on our singleton instance.

stem.descriptor.collector.get_key_certificates(start=None, end=None, cache_to=None, timeout=None, retries=3)[source]

Shorthand for get_key_certificates() on our singleton instance.

stem.descriptor.collector.get_bandwidth_files(start=None, end=None, cache_to=None, timeout=None, retries=3)[source]

Shorthand for get_bandwidth_files() on our singleton instance.

stem.descriptor.collector.get_exit_lists(start=None, end=None, cache_to=None, timeout=None, retries=3)[source]

Shorthand for get_exit_lists() on our singleton instance.

class stem.descriptor.collector.File(path, types, size, sha256, first_published, last_published, last_modified)[source]

Bases: object

File within CollecTor.

Variables:
  • path (str) -- file path within collector
  • types (tuple) -- descriptor types contained within this file
  • compression (stem.descriptor.Compression) -- file compression, None if this cannot be determined
  • size (int) -- size of the file
  • sha256 (str) -- file's sha256 checksum
  • start (datetime) -- first publication within the file, None if this cannot be determined
  • end (datetime) -- last publication within the file, None if this cannot be determined
  • last_modified (datetime) -- when the file was last modified
read(directory=None, descriptor_type=None, start=None, end=None, document_handler='ENTRIES', timeout=None, retries=3)[source]

Provides descriptors from this archive. Descriptors are downloaded or read from disk as follows...

  • If this file has already been downloaded through :func:`~stem.descriptor.collector.CollecTor.download' these descriptors are read from disk.
  • If a directory argument is provided and the file is already present these descriptors are read from disk.
  • If a directory argument is provided and the file is not present the file is downloaded this location then read.
  • If the file has neither been downloaded and no directory argument is provided then the file is downloaded to a temporary directory that's deleted after it is read.
Parameters:
  • directory (str) -- destination to download into
  • descriptor_type (str) -- descriptor type, this is guessed if not provided
  • start (datetime.datetime) -- publication time to begin with
  • end (datetime.datetime) -- publication time to end with
  • document_handler (stem.descriptor.__init__.DocumentHandler) -- method in which to parse a NetworkStatusDocument
  • timeout (int) -- timeout when connection becomes idle, no timeout applied if None
  • retries (int) -- maximum attempts to impose
Returns:

iterator for Descriptor instances in the file

Raises :
  • ValueError if unable to determine the descirptor type
  • TypeError if we cannot parse this descriptor type
  • DownloadFailed if the download fails
download(directory, decompress=True, timeout=None, retries=3, overwrite=False)[source]

Downloads this file to the given location. If a file already exists this is a no-op.

Parameters:
  • directory (str) -- destination to download into
  • decompress (bool) -- decompress written file
  • timeout (int) -- timeout when connection becomes idle, no timeout applied if None
  • retries (int) -- maximum attempts to impose
  • overwrite (bool) -- if this file exists but mismatches CollecTor's checksum then overwrites if True, otherwise rases an exception
Returns:

str with the path we downloaded to

Raises :
  • DownloadFailed if the download fails
  • IOError if a mismatching file exists and overwrite is False
class stem.descriptor.collector.CollecTor(retries=2, timeout=None)[source]

Bases: object

Downloader for descriptors from CollecTor. The contents of CollecTor are provided in an index that's fetched as required.

Variables:
  • retries (int) -- number of times to attempt the request if downloading it fails
  • timeout (float) -- duration before we'll time out our request
get_server_descriptors(start=None, end=None, cache_to=None, bridge=False, timeout=None, retries=3)[source]

Provides server descriptors published during the given time range, sorted oldest to newest.

Parameters:
  • start (datetime.datetime) -- publication time to begin with
  • end (datetime.datetime) -- publication time to end with
  • cache_to (str) -- directory to cache archives into, if an archive is available here it is not downloaded
  • bridge (bool) -- standard descriptors if False, bridge if True
  • timeout (int) -- timeout for downloading each individual archive when the connection becomes idle, no timeout applied if None
  • retries (int) -- maximum attempts to impose on a per-archive basis
Returns:

iterator of ServerDescriptor for the given time range

Raises :

DownloadFailed if the download fails

get_extrainfo_descriptors(start=None, end=None, cache_to=None, bridge=False, timeout=None, retries=3)[source]

Provides extrainfo descriptors published during the given time range, sorted oldest to newest.

Parameters:
  • start (datetime.datetime) -- publication time to begin with
  • end (datetime.datetime) -- publication time to end with
  • cache_to (str) -- directory to cache archives into, if an archive is available here it is not downloaded
  • bridge (bool) -- standard descriptors if False, bridge if True
  • timeout (int) -- timeout for downloading each individual archive when the connection becomes idle, no timeout applied if None
  • retries (int) -- maximum attempts to impose on a per-archive basis
Returns:

iterator of RelayExtraInfoDescriptor for the given time range

Raises :

DownloadFailed if the download fails

get_microdescriptors(start=None, end=None, cache_to=None, timeout=None, retries=3)[source]

Provides microdescriptors estimated to be published during the given time range, sorted oldest to newest. Unlike server/extrainfo descriptors, microdescriptors change very infrequently...

"Microdescriptors are expected to be relatively static and only change
about once per week." -dir-spec section 3.3

CollecTor archives only contain microdescriptors that change, so hourly tarballs often contain very few. Microdescriptors also do not contain their publication timestamp, so this is estimated.

Parameters:
  • start (datetime.datetime) -- publication time to begin with
  • end (datetime.datetime) -- publication time to end with
  • cache_to (str) -- directory to cache archives into, if an archive is available here it is not downloaded
  • timeout (int) -- timeout for downloading each individual archive when the connection becomes idle, no timeout applied if None
  • retries (int) -- maximum attempts to impose on a per-archive basis
Returns:

iterator of :class:`~stem.descriptor.microdescriptor.Microdescriptor for the given time range

Raises :

DownloadFailed if the download fails

get_consensus(start=None, end=None, cache_to=None, document_handler='ENTRIES', version=3, microdescriptor=False, bridge=False, timeout=None, retries=3)[source]

Provides consensus router status entries published during the given time range, sorted oldest to newest.

Parameters:
  • start (datetime.datetime) -- publication time to begin with
  • end (datetime.datetime) -- publication time to end with
  • cache_to (str) -- directory to cache archives into, if an archive is available here it is not downloaded
  • document_handler (stem.descriptor.__init__.DocumentHandler) -- method in which to parse a NetworkStatusDocument
  • version (int) -- consensus variant to retrieve (versions 2 or 3)
  • microdescriptor (bool) -- provides the microdescriptor consensus if True, standard consensus otherwise
  • bridge (bool) -- standard descriptors if False, bridge if True
  • timeout (int) -- timeout for downloading each individual archive when the connection becomes idle, no timeout applied if None
  • retries (int) -- maximum attempts to impose on a per-archive basis
Returns:

iterator of RouterStatusEntry for the given time range

Raises :

DownloadFailed if the download fails

get_key_certificates(start=None, end=None, cache_to=None, timeout=None, retries=3)[source]

Directory authority key certificates for the given time range, sorted oldest to newest.

Parameters:
  • start (datetime.datetime) -- publication time to begin with
  • end (datetime.datetime) -- publication time to end with
  • cache_to (str) -- directory to cache archives into, if an archive is available here it is not downloaded
  • timeout (int) -- timeout for downloading each individual archive when the connection becomes idle, no timeout applied if None
  • retries (int) -- maximum attempts to impose on a per-archive basis
Returns:

iterator of :class:`~stem.descriptor.networkstatus.KeyCertificate for the given time range

Raises :

DownloadFailed if the download fails

get_bandwidth_files(start=None, end=None, cache_to=None, timeout=None, retries=3)[source]

Bandwidth authority heuristics for the given time range, sorted oldest to newest.

Parameters:
  • start (datetime.datetime) -- publication time to begin with
  • end (datetime.datetime) -- publication time to end with
  • cache_to (str) -- directory to cache archives into, if an archive is available here it is not downloaded
  • timeout (int) -- timeout for downloading each individual archive when the connection becomes idle, no timeout applied if None
  • retries (int) -- maximum attempts to impose on a per-archive basis
Returns:

iterator of :class:`~stem.descriptor.bandwidth_file.BandwidthFile for the given time range

Raises :

DownloadFailed if the download fails

get_exit_lists(start=None, end=None, cache_to=None, timeout=None, retries=3)[source]

TorDNSEL exit lists for the given time range, sorted oldest to newest.

Parameters:
  • start (datetime.datetime) -- publication time to begin with
  • end (datetime.datetime) -- publication time to end with
  • cache_to (str) -- directory to cache archives into, if an archive is available here it is not downloaded
  • timeout (int) -- timeout for downloading each individual archive when the connection becomes idle, no timeout applied if None
  • retries (int) -- maximum attempts to impose on a per-archive basis
Returns:

iterator of :class:`~stem.descriptor.tordnsel.TorDNSEL for the given time range

Raises :

DownloadFailed if the download fails

index(compression='best')[source]

Provides the archives available in CollecTor.

Parameters:

compression (descriptor.Compression) -- compression type to download from, if undefiled we'll use the best decompression available

Returns:

dict with the archive contents

Raises :

If unable to retrieve the index this provide...

  • ValueError if json is malformed
  • IOError if unable to decompress
  • DownloadFailed if the download fails
files(descriptor_type=None, start=None, end=None)[source]

Provides files CollecTor presently has, sorted oldest to newest.

Parameters:
  • descriptor_type (str) -- descriptor type or prefix to retrieve
  • start (datetime.datetime) -- publication time to begin with
  • end (datetime.datetime) -- publication time to end with
Returns:

list of File

Raises :

If unable to retrieve the index this provide...

  • ValueError if json is malformed
  • IOError if unable to decompress
  • DownloadFailed if the download fails