Stem Docs



Package for parsing and processing descriptor data.

Module Overview:

parse_file - Parses the descriptors in a file.

Descriptor - Common parent for all descriptor file types.
  |- get_path - location of the descriptor on disk if it came from a file
  |- get_archive_path - location of the descriptor within the archive it came from
  |- get_bytes - similar to str(), but provides our original bytes content
  |- get_unrecognized_lines - unparsed descriptor content
  +- __str__ - string that the descriptor was made from

Ways in which we can parse a NetworkStatusDocument.

Both ENTRIES and BARE_DOCUMENT have a 'thin' document, which doesn't have a populated routers attribute. This allows for lower memory usage and upfront runtime. However, if read time and memory aren't a concern then DOCUMENT can provide you with a fully populated document.

DocumentHandler Description
ENTRIES Iterates over the contained RouterStatusEntry. Each has a reference to the bare document it came from (through its document attribute).
DOCUMENT NetworkStatusDocument with the RouterStatusEntry it contains (through its routers attribute).
BARE_DOCUMENT NetworkStatusDocument without a reference to its contents (the RouterStatusEntry are unread).
stem.descriptor.__init__.parse_file(descriptor_file, descriptor_type=None, validate=False, document_handler='ENTRIES', normalize_newlines=None, **kwargs)[source]

Simple function to read the descriptor contents from a file, providing an iterator for its Descriptor contents.

If you don't provide a descriptor_type argument then this automatically tries to determine the descriptor type based on the following...

  • The @type annotation on the first line. These are generally only found in the CollecTor archives.
  • The filename if it matches something from tor's data directory. For instance, tor's 'cached-descriptors' contains server descriptors.

This is a handy function for simple usage, but if you're reading multiple descriptor files you might want to consider the DescriptorReader.

Descriptor types include the following, including further minor versions (ie. if we support 1.1 then we also support everything from 1.0 and most things from 1.2, but not 2.0)...

Descriptor Type Class
server-descriptor 1.0 RelayDescriptor
extra-info 1.0 RelayExtraInfoDescriptor
microdescriptor 1.0 Microdescriptor
directory 1.0 unsupported
network-status-2 1.0 RouterStatusEntryV2 (with a NetworkStatusDocumentV2)
dir-key-certificate-3 1.0 KeyCertificate
network-status-consensus-3 1.0 RouterStatusEntryV3 (with a NetworkStatusDocumentV3)
network-status-vote-3 1.0 RouterStatusEntryV3 (with a NetworkStatusDocumentV3)
network-status-microdesc-consensus-3 1.0 RouterStatusEntryMicroV3 (with a NetworkStatusDocumentV3)
bridge-network-status 1.0 RouterStatusEntryV3 (with a BridgeNetworkStatusDocument)
bridge-server-descriptor 1.0 BridgeDescriptor
bridge-extra-info 1.1 or 1.2 BridgeExtraInfoDescriptor
torperf 1.0 unsupported
bridge-pool-assignment 1.0 unsupported
tordnsel 1.0 TorDNSEL
hidden-service-descriptor 1.0 HiddenServiceDescriptor

If you're using python 3 then beware that the open() function defaults to using text mode. Binary mode is strongly suggested because it's both faster (by my testing by about 33x) and doesn't do universal newline translation which can make us misparse the document.

my_descriptor_file = open(descriptor_path, 'rb')
  • descriptor_file (str,file,tarfile) -- path or opened file with the descriptor contents
  • descriptor_type (str) -- descriptor type, this is guessed if not provided
  • validate (bool) -- checks the validity of the descriptor's content if True, skips these checks otherwise
  • document_handler (stem.descriptor.__init__.DocumentHandler) -- method in which to parse the NetworkStatusDocument
  • normalize_newlines (bool) -- converts windows newlines (CRLF), this is the default when reading data directories on windows
  • kwargs (dict) -- additional arguments for the descriptor constructor

iterator for Descriptor instances in the file

Raises :
  • ValueError if the contents is malformed and validate is True
  • TypeError if we can't match the contents of the file to a descriptor type
  • IOError if unable to read from the descriptor_file
class stem.descriptor.__init__.Descriptor(contents, lazy_load=False)[source]

Bases: object

Common parent for all types of descriptors.


Provides the absolute path that we loaded this descriptor from.

Returns:str with the absolute path of the descriptor source

If this descriptor came from an archive then provides its path within the archive. This is only set if the descriptor came from a DescriptorReader, and is None if this descriptor didn't come from an archive.

Returns:str with the descriptor's path within the archive

Provides the ASCII bytes of the descriptor. This only differs from str() if you're running python 3.x, in which case str() provides a unicode string.

Returns:bytes for the descriptor's contents

Provides a list of lines that were either ignored or had data that we did not know how to process. This is most common due to new descriptor fields that this library does not yet know how to process. Patches welcome!

Returns:list of lines of unrecognized content