A Compatible Way of Getting Package-Paths in Python

The Python community is always looking to improve things in the way of consistency, and that’s its best and worst feature because from time to time, one method isn’t finished before another method is begun.

For example, when it comes to something, like packaging, you may need to account for what you need in several different ways, such as identifying non-Python files in both the “package data” clauses of your setup attributes and listing them in a MANIFEST.in (one is considered only when building source distributions, and another is considered only when building binary distributions). However, in order to do this, you also have to embed the non-Python files within one of your actual source directories (the “package” directories), because the package-data files are made to belong to particular packages. We won’t even talk about the complexities of source and binary packages when packaging into a wheel. Such divergences are the topic of many entire series of articles.

With similar compatibility problems, in order to use one of the module loaders, for the purposes of reflection, the recommended package will vary depending on whether you’re running 2.x, 3.2/3.3, and 3.4 .

It’s a pain. For your convenience, this is such a flow, used to determine the path of the package:

_MODULE_NAME = 'module name'
_APP_PATH = None

# Works in 3.4

try:
    import importlib.util
    _ORIGIN = importlib.util.find_spec(_MODULE_NAME).origin
    _APP_PATH = os.path.abspath(os.path.dirname(_ORIGIN))
except:
    pass

# Works in 3.2

if _APP_PATH is None:
    try:
        import importlib
        _INITFILEPATH = importlib.find_loader(_MODULE_NAME).path
        _APP_PATH = os.path.abspath(os.path.dirname(_INITFILEPATH))
    except:
        pass

# Works in 2.x

if _APP_PATH is None:
    import imp
    _APP_PATH = imp.find_module(_MODULE_NAME)[1]

Python Function Annotations

Python 3.x introduced a relatively unknown feature called “function annotations”. This introduces a way to tag parameters and your return value with arbitrary information at the function-definition level.

You can annotate using strings or any other type that you’d like:

>>> def some_function(parm1: "Example parameter"):
...   pass
... 
>>> some_function.__annotations__
{'parm1': 'Example parameter'}
>>> x = 5
>>> def some_function_2(parm1: x * 20):
...   pass
... 
>>> some_function_2.__annotations__
{'parm1': 100}

You can also annotate the return:

>>> def some_function_3() -> 'return-value tag':
...   pass
... 
>>> some_function_3.__annotations__
{'return': 'return-value tag'}

It’s important to note that there are already strong conventions in how to document your parameters, thanks to Sphinx. Therefore, the utility of annotations will most likely be entirely in terms of functionality. For example, you can annotate closures on-the-fly:

import random

c_list = []
for i in range(10):
    def closure_() -> random.random():
        pass

    c_list.append(closure_)

list(map(lambda x: print(x.__annotations__), c_list))

This is the output:

{'return': 0.9644971188983055}
{'return': 0.8639746158842893}
{'return': 0.18610468531065305}
{'return': 0.8528801446167985}
{'return': 0.3022338513329076}
{'return': 0.6455491244718428}
{'return': 0.09106740460937834}
{'return': 0.16987808849543917}
{'return': 0.9136478506241527}
{'return': 0.41691681086623544}

Absolutely nothing in the Python language or library is dependent on annotations, so they’re yours to play with or implement as you see fit.

Python 3: Opening for Write, but Failing if it Already Exists

Python 3.3 added a new file mode that allows you to create a new file and open it for write only if it does not already exist.

>>> with open('new_file', 'x') as f:
...   pass
... 
>>> with open('new_file', 'x') as f:
...   pass
... 
Traceback (most recent call last):
  File "", line 1, in 
FileExistsError: [Errno 17] File exists: 'new_file'

Easy and Loveable Cartesian Products in Python

Use more than one for in the same list comprehension:

[(i, j, k) 
 for i in (11, 22, 33) 
 for j in (44, 55, 66) 
 for k in (77, 88, 99)]

This results in all possible permutations (a cartesian product), where enumeration starts on the right:

[(11, 44, 77), (11, 44, 88), (11, 44, 99), 
 (11, 55, 77), (11, 55, 88), (11, 55, 99), 
 (11, 66, 77), (11, 66, 88), (11, 66, 99), 

 (22, 44, 77), (22, 44, 88), (22, 44, 99), 
 (22, 55, 77), (22, 55, 88), (22, 55, 99), 
 (22, 66, 77), (22, 66, 88), (22, 66, 99), 

 (33, 44, 77), (33, 44, 88), (33, 44, 99), 
 (33, 55, 77), (33, 55, 88), (33, 55, 99), 
 (33, 66, 77), (33, 66, 88), (33, 66, 99)]

Using ZeroMQ With Coroutines (gevent) Under Python

ZeroMQ (0MQ) is a beautiful library that basically replaces the socket layer with a very thin, pattern-based wrapper. Aside from removing this overhead from your code, 0MQ also usually gives you the guarantee that one read will return one message (or one part of a multipart message).

gevent is a coroutine-based networking library for Python. Coroutines allow you to leverage the blocking that certain types of operations, like network requests, to perform other operations while waiting (works best when you’re doing a number of similar operations in parallel). It’s a compromise that allows you to speed up synchronous operations to the point of being comparable to multithreading (at least in the case of network operations).

There was a point at which ZeroMQ didn’t support this (and a package named gevent_zmq had to be used), but it has since become compatible with it.

For example, a server:

import gevent

import zmq.green as zmq

_BINDING = 'ipc:///tmp/test_server'

context = zmq.Context()

def server():
    server_socket = context.socket(zmq.REP)
    server_socket.bind(_BINDING)

    while 1:
        received = server_socket.recv()
        print("Received:\n[%s]" % (received))
        print('')

        server_socket.send('TestResponse')

server = gevent.spawn(server)
server.join()

The corresponding client:

import gevent

import zmq.green as zmq

_BINDING = 'ipc:///tmp/test_server'

context = zmq.Context()

def client():
    client_socket = context.socket(zmq.REQ)
    client_socket.connect(_BINDING)

    client_socket.send("TestMessage")

    response = client_socket.recv()
    print("Response:\n[%s]" % (response))
    print('')

client = gevent.spawn(client)
client.join()

Displaying the output here would nearly be redundant, given that the result should be plainly obvious.

Python Modules Under OSX (10.9.2)

Recently, Apple released an update that broke every C-based Python package, and probably more (the “-mno-fused-madd” compilation error). See here for more information.

To fix, add these to your environment:

export CFLAGS=-Qunused-arguments
export CPPFLAGS=-Qunused-arguments

Then, add this to your sudoers file (via visudo), which will allow sudo to have access to those variables:

Defaults        env_keep += "CFLAGS CPPFLAGS"

Adding Custom Data to X.509 SSL Certificates

Signed SSL certificates have a feature known as “extensions”. In order for them to be there, they must be in the CSR. Therefore, CSR’s support them too. Although X.509 certificates are not meant for a lot of data and were never meant to act as databases (rather, an identity with associated information), they act as a great solution when you need to store secured information alongside your application at a client site. Though the data is viewable, you have the following guarantees:

  • The data (including the extensions) can not be interfered with, or it’ll fault its signatures.
  • The certificate will expire at a set time (and can be renewed if need be).
  • A certificate-revocation list (CRL) can be implemented (using a CRL distribution point, or “CDP”) so that you can invalidate a certificate remotely.

As long as you don’t care about keep the data secret, this makes extensions an ideal solution to a problem like on-site client licenses, where your software needs to regularly check whether the client still has permission to operate. You can also use a CRL to disable them if they stop paying their bill.

These extensions accommodate data that goes beyond the distinguished-name (DN) fields (locality, country, organization, common-name, etc..), chain of trust, key fingerprints, the signatures that guarantee the trustworthiness of the certificate (using the signature of the CA), and the integrity of the certificate (the signature of the certificate contents). Extensions seem relatively easy to add to certificates, whether you’re creating CSRs from code or from command-line. They’re just manageably-sized strings (though it technically seems like there is no official length limit) of human-readable text.

If you own the CA, then you might also create your own extensions. In this case, you’ll refer to your extension with a unique dotted identifier called an “OID” (we’ll go into this in the ASN.1 explanation below). Libraries might have trouble if you just refer to your own extension without properly registering it with your library prior. For example, OpenSSL has the ability to register and use custom extensions, but the M2Crypto SSL library doesn’t expose the registration call, and, therefore, can’t use custom extensions.

Unsupported extensions might be skipped or omitted from the signed certificate by a CA that doesn’t recognize/support them, so beware that you’ll need to stick to the popular extensions if you can’t use your own CA. Extensions that are mandatory for you requirements can be marked as “critical”, so that signing won’t precede if any of your extensions aren’t recognized.

The extension that we’re interested in, here, is “subjectAltName”, and it is recognized/supported by all CAs. This extension can describe the “alternative subjects” (using DNS-type entries) that you might need to specify if your X.509 needs to be used with more than one common-name (more than one hostname). It can also describe email-addresses and other kinds of identity information. However, it can also store custom text.

This is an example of two “subjectAltName” extensions (multiple instances of the same extensions can be present in a certificate):

DNS:server1.yourdomain.tld, DNS:server2.yourdomain.tld
otherName:1.3.6.1.4.1.99;UTF8:This is arbitrary data.

However, due to details soon to follow, it’s very difficult to pull the extension text back out, again. In order to go further, we have to take a quick diversion into certificate structure. This isn’t necessarily required, but it is information that is obscure-enough to find that you won’t have any coping skills if you encounter issues, otherwise.

Certificate Encoding

All of the standard, human-readable, SSL documents, such as the private-key, public-key, CSR, and X.509, are encoded in a format called PEM. This is base64-encoded data with anchors (e.g. “—–BEGIN DSA PRIVATE KEY—–“) on the top and bottom.

In order to have any use, a PEM-encoded document must be converted to a DER-encoded document. This just means that it’s stripped of the anchors and newlines, and then base64-decoded. DER is a tighter subset of “BER” encoding.

ASN.1 Encoding

The DER-encoding describes an ASN.1 data structure node. ASN.1 combines a tree of data with a tree of grammar specifications, and reduces down to hierarchical sets of DER-encoded data. All nodes (called “tags”) are represented by dot-separated identifiers called OIDs (mentioned above). Usually these are officially-assigned OIDs, but you might have some custom ones if you don’t have to pass your certificates to higher authority that might have a problem with them.

In order to decode the structure, you must walk it, applying the correct specs as required. There is nothing self-descriptive within the data. This makes it fast, but useless until you have enough pre-existing knowledge to descend to the information you require.

The specification for the common grammars (like RFC 2459 for X.509) in ASN.1 is so massive that you should expect to avoid getting involved in the mechanics at all costs, and to learn how to survive with the limited number of libraries already available. In all likelihood, a need for anything outside the realm of popular usage will require a non-trivial degree of debugging.

ASN.1 has been around… for a while (about thirty years, as of this year). It’s obtuse, impossible, and not understood in great deal by very few individuals. However, it’s going to be here for a while.

Extension Decoding

The reason that extensions are tough to decode is because the encoding depends on the text that you put in the extension. Specifically, the “otherName” and “UT8” parts. OpenSSL can’t present these values when it dumps the certificate, because it just doesn’t have enough information to decode them. M2Crypto, since it uses OpenSSL, has the same problem.

Now that we’ve introduced a little of the conceptual ASN.1 structure, let’s go back to the previous subjectAltName “otherName” example:

otherName:1.3.6.1.4.1.99;UTF8:This is arbitrary data.

The following is the breakdown:

  1. “otherName”: A classification of the subjectAltName extension that indicates custom-data. This has an OID of its own in the RFC 2459 grammar.
  2. 1.3.6.1.4.1.99: The OID of your company. The first eight parts comprise the common-prefix, combined with a “private enterprise number” (PEN). You can register for your own.
  3. Custom data, prefixed with a type. The “UTF8” prefix determines the encoding of the data, but is not itself included.

I used the following calls to M2Crypto to add these extensions to the X.509:

ext = X509.new_extension(
        'subjectAltName',
        'otherName:1.3.6.1.4.1.99;UTF8:This is arbitrary data.'
    )

ext.set_critical(1)
cert.add_ext(ext)

Aside from the extension information itself, I also indicate that it’s to be considered “critical”. Signing will fail if the CA doesn’t recognize the extension, and not simply omit it. When this gets encoded, it’ll be encoded as three separate “components”:

  1. The OID for the “otherName” type.
  2. The “critical” flag.
  3. A DER-encoded sequence of the PEN and the UTF8-encoded string.

It turns out that it’s quicker to use a library that specializes in ASN.1, rather than trying to get the information from OpenSSL. After all, it’s out-of-scope as it’s colocated with cryptographical data while not being cryptographical itself.

I used pyasn1.

Decoding Our Extension

To decode the string from the previous extension:

  1. Enumerate the extensions.
  2. Decode the third component (mentioned above) using the RFC 2459 “subjectAltName” grammar.
  3. Descend to the first component of the “SubjectAltName” node: the “GeneralName” node.
  4. Descend to the first component of the “General Name” node: the “AnotherName” nerve.
  5. Match the OID against the OID we’re looking for.
  6. Decode the string using the RFC 2459 UTF8 specification.

This is a dump of the structure using pyasn1:

SubjectAltName().
   setComponentByPosition(
       0, 
       GeneralName().
           setComponentByPosition(
               0, 
               AnotherName().
                   setComponentByPosition(
                       0, 
                       ObjectIdentifier(1.3.6.1.5.5.7.1.99)
                   ).
                   setComponentByPosition(
                       1, 
                       Any(hexValue='0309006465616462656566')
                   )
           )
   )

The process might seem easy, but this took some work (and collaboration) to get right, with the primary difficulty coming from obscurity meeting unfamiliarity. However, the process should be somewhat set in stone, every time.

This is the corresponding code. “cert” is an M2Crypto X.509 certificate:

cert, rest = decode(cert.as_der(), asn1Spec=rfc2459.Certificate())

extensions = cert['tbsCertificate']['extensions']
for extension in extensions:
    extension_oid = extension.getComponentByPosition(0)
    print("0 [%s]" % (repr(extension_oid)))

    critical_flag = extension.getComponentByPosition(1)
    print("1 [%s]" % (repr(critical_flag)))

    sal_raw = extension.getComponentByPosition(2)
    print("2 [%s]" % (repr(sal_raw)))

    (sal, r) = decode(sal_raw, rfc2459.SubjectAltName())
    
    gn = sal.getComponentByPosition(0)
    an = gn.getComponentByPosition(0)

    oid = an.getComponentByPosition(0)
    string = an.getComponentByPosition(1)

    print("[%s]" % (oid))

    # Decode the text.

    s, r = decode(string, rfc2459.UTF8String())

    print("Decoded: [%s]" % (s))
    print('')

Wrap Up

I wanted to provide an end-to-end tutorial in adding and retrieving “otherName”-type “subjectAltName” extensions because none currently exists. It’s a good solution for keeping data safe on someone else’s assets (as long as you don’t overburden the certificate with extensions, as it’ll decrease the efficiency to verify).

Don’t forget to implement the CRL/CDP, or you won’t have the possibility of faulting the certificate (and its extensions) without having to wait for them to expire.

Writing and Reading 7-Zip Archives From Python

I don’t often need to read or write archives from code. When I do, and I don’t want to call a tool via shell-commands, I’ll use zip-files. Obviously there are better formats out there, but when it comes to library compatibility, tar and zip are the easiest possible formats to manipulate. If you’re desperate, you can even write a quick tar archiver with relative simplicity (the headers are mostly ASCII).

Obviously, the emphasis here has been on availability. My preferred format is 7-Zip (which uses LZMA compression). Though you don’t often see 7-Zip archives for download, I’ve been using this format for eight-years and haven’t looked back. The compression is good and the tool is every bit as easy as zip.

Unfortunately, there’s limited support for 7-Zip in Python. To the best of my knowledge, only the libarchive Python package can read and write 7-Zip archives. The libarchive Python package is developed and supported separately from the C library that it implements.

Though the library is structured to support any format that the libarchive library can (all major formats, and probably all of the minor ones), the Python project is outrightly labeled as a work-in-progress. 7-Zip is the only format explicitly supported for both reading and writing. Fortunately, it also supports libarchive‘s autodetection functionality. So, you can read/expand any archive, as long as you can afford the extra couple of milliseconds that the detection will cost you.

The focus of this project is to provide elegant archiving routines. Most of the API functions are implemented as generators.

Example

To enumerate the entries in an archive:

import libarchive

with libarchive.reader('test.7z') as reader:
    for e in reader:
        # (The entry evaluates to a filename.)
        print("> %s" % (e))

To extract the entries from an archive to the current directory (like a normal, Unix-based extraction):

import libarchive

for state in libarchive.pour('test.7z'):
    if state.pathname == 'dont/write/me':
        state.set_selected(False)
        continue

    # (The state evaluates to a filename.)
    print("Writing: %s" % (state))

To build an archive from a collection of files (omit the target for stdout):

import libarchive

for entry in libarchive.create(
                '7z', 
                ['/aa/bb', '/cc/dd'], 
                'create.7z'):
    print("Adding: %s" % (entry))

Reading Keypresses Under Python

An elegant solution for reading a individual keypresses under Python.

import termios, sys, os

def read_keys():
    fd = sys.stdin.fileno()
    old = termios.tcgetattr(fd)
    new = termios.tcgetattr(fd)
    new[3] = new[3] & ~termios.ICANON & ~termios.ECHO
    new[6][termios.VMIN] = 1
    new[6][termios.VTIME] = 0
    termios.tcsetattr(fd, termios.TCSANOW, new)
    try:
        while 1:
            yield os.read(fd, 1)
    finally:
        termios.tcsetattr(fd, termios.TCSAFLUSH, old)

Example:

>>> for key in read_keys():
...   print("KEY: %s" % (key))
... 
KEY: g
KEY: i
KEY: f
KEY: d
KEY: s
KEY: w
KEY: e

Inspired by this.