Simplified Protocol Buffers for Socket Communication

Protocol Buffers (“protobuf”) is a Google technology that lets you define messages declaratively, and then build library code for a myriad of different programming-languages. The way that messages are serialized is efficient and effortless, and protobuf allows for simple string assignment (without predefining a length), arrays and optional values, and sub-messages.

The only tough part comes during implementation. As protobuf is only concerned with serialization/unserialization, it’s up to you to deal with the logistics of sending the message, and this means that, for socket communication, you often have to:

  1. Copy and paste the code to prepend a length.
  2. Copy/paste/adapt existing code that embeds a type-identifier on outgoing requests, and reads the type-identifier on incoming requests in order to automatically handle/route messages (if this is something that you want, which I often do).

This quickly becomes redundant and mundane, and it’s why we’re about to introduce protobufp (“Protocol Buffers Processor”).

We can’t improve on the explanation on the project-page. Therefore, we’ll just provide the example.

We’re going to build some messages, push into a StringIO-based byte-stream (later to be whatever type of stream you wish), read them into the protobufp “processor” object, and retrieve one fully-unserialized message at a time until depleted:

from test_msg_pb2 import TestMsg

from protobufp.processor import Processor

def get_random_message():
    rand = lambda: randint(11111111, 99999999)

    t = TestMsg()
    t.left = rand()
    t.center = "abc"
    t.right = rand()

    return t

messages = [get_random_message() for i in xrange(5)]

Create an instance of the processor, and give it a list of valid message-types (the order of this list should never change, though you can append new types to the end):

msg_types = [TestMsg]
p = Processor(msg_types)

Use the processor to serialize each message and push them into the byte-stream:

s = StringIO()

for msg in messages:
    s.write(p.serializer.serialize(msg))

Feed the data from the byte stream into the processor (normally, this might be chunked-data from a socket):

p.push(s.getvalue())

Pop one decoded message at a time:

j = 0
while 1:
    in_msg = p.read_message()
    if in_msg is None:
        break

    assert messages[j].left == in_msg.left
    assert messages[j].center == in_msg.center
    assert messages[j].right == in_msg.right

    j += 1

Now there’s one less annoying task to distract you from your critical path.

Creating and Controlling OS Services from Python

One important deployment task of server software is to not only deploy the software and then start it, but to enable it to be automatically started and monitored by the OS at future reboots. The most modern solution for this type of management is Upstart. You access Upstart every time you call “sudo service apache2 restart”, and whatnot. Upstart is sponsored by Ubuntu (more specifically, Canonical).

Upstart configs are located in /etc/init (we’re slowly, slowly approaching the point where we might one day be able to get rid of the System-V init scripts, in /etc/init.d). To create a job, you drop a “xyz.conf” file into /etc/init, and Upstart should automatically become aware of it via inotify. To query Upstart (including starting and stopping jobs), you emit a D-Bus message.

So, what about elegantly automating the creation of a job for the service from your Python deployment code? There is exactly one solution for doing so, and it’s a Swiss Army Knife for such a task.

We’re going to use the Python upstart library to build a job and then write it (in fact, we’re just going to share one of their examples, for your convenience). The library also allows for listing the jobs on the system, getting statuses, and starting/stopping jobs, among other things, but we’ll leave it to you to experiment with this, when you’re ready.

Build a job that starts and stops on the normal run-levels, respawns when it terminates, and runs a single command (a non-forking process, otherwise we’d have to add the ‘expect’ stanza as well):

from upstart.job import JobBuilder

jb = JobBuilder()

# Build the job to start/stop with default runlevels to call a command.
jb.description('My test job.').\
   author('Dustin Oprea <dustin@nowhere.com>').\
   start_on_runlevel().\
   stop_on_runlevel().\
   run('/usr/bin/my_daemon')

with open('/etc/init/my_daemon.conf', 'w') as f:
    f.write(str(jb))

Remember to run this as root. The job output looks like this:

description "My test job."
author "Dustin Oprea <dustin@nowhere.com>"
start on runlevel [2345]
stop on runlevel [016]
respawn 
exec /usr/bin/my_daemon

Parsing P12 Certificates from Python

When it comes to working with certificates in Python, no one package has all of the answers. Without considering more advanced schemes (ECC), most of the key and certificate functionality will be in one of the following packages:

In general, ssl can handle SSL sockets and HTTPS connections, M2Crypto can handle RSA/DSA keys and certificates, and pyopenssl can handle P12 certificates. There is some role overlap:

  • pyopenssl and M2Crypto both do X509 certificate deconstruction
  • ssl does PEM/DER conversions

Since the reason that I’m doing this post is because of the obscureness of reading P12 certificates in Python, here’s an example of doing so:

from OpenSSL.crypto import load_pkcs12, FILETYPE_PEM, FILETYPE_ASN1

with open('cert.p12', 'rb') as f:
  c = f.read()

p = load_pkcs12(c, 'passphrase')

certificate = p.get_certificate()
private_key = p.get_privatekey()

# Where type is FILETYPE_PEM or FILETYPE_ASN1 (for DER).
type_ = FILETYPE_PEM

OpenSSL.crypto.dump_privatekey(type_, private_key)
OpenSSL.crypto.dump_certificate(type_, certificate)

# Get CSR fields (as a list of 2-tuples).
fields = certificate.get_subject().get_components()
print(fields)

Using ssl.wrap_socket for Secure Sockets in Python

Ordinarily, the prospect of having to deal with SSL-encrypted sockets would be enough to make the best of us take a long weekend. However, Python provides some prepackaged functionality to accommodate this. It’s called “wrap_socket”. The only reason that I ever knew about this was from reverse engineering, as I’ve never come upon this in a blog/article.

Here’s an example. Note that I steal the CA bundle from requests, for the purpose of this example. Use whichever bundle you happen to have available (they should all be relatively similar, but will generally be located different places on your system, depending on your OS/distribution).

import ssl
import socket

s_ = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s = ssl.wrap_socket(s_, 
                    ca_certs='/usr/local/lib/python2.7/dist-packages/requests/cacert.pem', 
                    cert_reqs=ssl.CERT_REQUIRED)

s.connect(('www.google.com', 443))

# s.cipher() - Returns a tuple: ('RC4-SHA', 'TLSv1/SSLv3', 128)
# s.getpeercert() - Returns a dictionary: 
#
#   {'notAfter': 'May 15 00:00:00 2014 GMT',
#    'subject': ((('countryName', u'US'),),
#                (('stateOrProvinceName', u'California'),),
#                (('localityName', u'Mountain View'),),
#                (('organizationName', u'Google Inc'),),
#                (('commonName', u'www.google.com'),)),
#    'subjectAltName': (('DNS', 'www.google.com'),)}

s.write("""GET / HTTP/1.1\r
Host: www.google.com\r\n\r\n""")

# Read the first part (might require multiple reads depending on size and 
# encoding).
d = s.read()
s.close()

Obviously, your data sits in d, after this code runs.

Using etcd as a Clusterized, Immediately-Consistent Key-Value Storage

The etcd project was one of the first popular, public platforms built on the Raft algorithm (a relatively simple consensus algorithm, used to allow several nodes to remain in sync). Raft represents a shift away from its predecessor, Paxos, which is considerably more difficult to understand, and usually requires shortcuts to implement. As an added bonus, etcd is also implemented in Go.

etcd looks and smells like every other KV store, with three especially-notable differences:

  • You can maintain a heirarchy of keys.
  • You can long-poll for changes on keys.
  • Distributed-locks are built-in.

We’re going to use Python’s etcd package (project is here). This package presents a very intuitive interface that completely manages responses from the server and is built in such a way that future API changes should be backward-compatible (to within reason). These things are important, as other clients have historically allowed the application too much direct access to the actual server requests, and left too much of the interpretation of the responses to the application as well.

To connect the client (assuming the same machine with the default port):

from etcd import Client

c = Client()

To set a value:

c.node.set('/test/key', 5)

To get a value:

r = c.node.get('/test/key')
print(r.node.value)

Which outputs:

5

To wait on a value to change, run this from another terminal:

r = c.node.wait('/test/key')

Try setting the node to something else using a command similar to before. The wait call will return with the same result as the instance of the client that actually made the request.

To work with distributed locks, just wrap the code that needs to be synchronized in a with statements:

with c.module.lock.get_lock('test_lock_1', ttl=10):
    print("In lock 1.")

It’s worth mentioning that the response objects have a consistent and informative interface no matter what the operation. You can see a number properties just by printing it. This is from the set operation above:

<RESPONSE: <NODE(ResponseV2AliveNode) [set] [/test/key] IS_HID=[False] IS_DEL=[False] IS_DIR=[False] IS_COLL=[False] TTL=[None] CI=(2) MI=(2)>>

This is from the get operation:

<RESPONSE: <NODE(ResponseV2AliveNode) [get] [/test/key] IS_HID=[False] IS_DEL=[False] IS_DIR=[False] IS_COLL=[False] TTL=[None] CI=(2) MI=(2)>>

I’ll omit the examples of working with heirarchical keys because the functionality is every bit as intuitive as it should be.

There’s a lot of functionality in the Python etcd package, but it’s built to be lightweight and obvious. The GitHub page is extremely thorough, and the API is also completely documented at ReadTheDocs.

Use TightOCR for Easy OCR from Python

When it comes to recognizing documents from images in Python, there are precious few options, and a couple of good reasons why.

Tesseract is the world’s best OCR solution, and is currently maintained by Google. Unlike other solutions, it comes prepackaged with knowledge for a bunch of languages, so the machine-learning aspects of OCR don’t necessarily have to be a concern of yours, unless you want to recognize for an unknown language, font, potential set of distortions, etc…

However, Tesseract comes as a C++ library, which basically takes it out of the running for use with Python’s ctypes. This isn’t a fault of ctypes, but rather of a lack of standardization in symbol-naming among the C++ compilers (there’s no way to know how to determine the naming for a symbol in the library from Python).

There is an existing Python solution, which comes in the form of a very heavy Python wrapper called python-tesseract, which is built on SWIG. It also requires a couple of extra libraries, like OpenCV and numpy, even if you don’t seem to be using them.

Even if you decide to go the python-tesseract route, you will only have the ability to return the complete document as text, as their support for iteration through the parts of the document is broken (see the bug).

So, with all of that said, we accomplished lightweight access to Tesseract from Python by first building CTesseract (which produces a C wrapper for Tesseract.. see here), and then writing TightOCR (for Python) around that.

This is the result:

from tightocr.adapters.api_adapter import TessApi
from tightocr.adapters.lept_adapter import pix_read
from tightocr.constants import RIL_PARA

t = TessApi(None, 'eng');
p = pix_read('receipt.png')
t.set_image_pix(p)
t.recognize()

if t.mean_text_confidence() < 85:
    raise Exception("Too much error.")

for block in t.iterate(RIL_PARA):
    print(block)

Of course, you can still recognize the document in one pass, too:

from tightocr.adapters.api_adapter import TessApi
from tightocr.adapters.lept_adapter import pix_read
from tightocr.constants import RIL_PARA

t = TessApi(None, 'eng');
p = pix_read('receipt.png')
t.set_image_pix(p)
t.recognize()

if t.mean_text_confidence() < 85:
    raise Exception("Too much error.")

print(t.get_utf8_text())

With the exception of renaming “mean_text_conf” to “mean_text_confidence”, the library keeps most of the names from the original Tesseract API. So, if you’re comfortable with that, you should have no problem with this (if you even have to do more than the above).

I should mention that the original Tesseract library, though a universal and popular OCR solution, is very dismally documented. Therefore, there are many functions that I’ve left scaffolding for in the project, without being entirely sure how to use/test them nor having any need for them myself. So, I could use help in that area. Just submit issues or pull-requests if you want to contribute.

Using Bitly’s NSQ Job Queue

I’ve recently been impressed by Bitly’s NSQ server, written in Go. Aside from the part about Go capturing my attention, the part that most interested me was 1) they claim that it achieves 90,000 messages/second (which is decent), and 2) it’s relatively easy to set-up, and it’s self-managing.

The topology for NSQ is straight forward: N queue servers (nsqd), 0+ lookup servers (nsqlookupd), and an optional admin (dashboard) server (nsqadmin). The lookup servers are optional, but they allow auto-discovery of which hosts are managing which topics. Bitly recommends that a cluster of three are used in production. To start multiple instances, just launch them. You’ll have to pass in a list of nsqlookupd hosts to the consumer client, and a list of nsqd hosts to the producer client.

The message pipeline is intuitive: messages are pushed along with topics/classifiers, and consumers listen for topics and channels. A channel is a named grouping of consumers that work on similar tasks, where the “channel” is presented as a string to the consumer instance. NSQ uses the concepts of topics and channels to drive multicast and distributed delivery.

As far as optimization goes, there are about three dozen parameters for nsqd, but you need not concern yourself with most of them, here.

This example resembles the one from the NSQ website, plus some additional info. All four processes can be run from the same system.

Quick Start

Get and build the primary components. $GOPATH needs to either be set to your Go workspace (mine is ~/.go, below), or an empty directory that will be used for it. $GOPATH/bin needs to be in the path.

go get github.com/kr/godep

godep get github.com/bitly/nsq/nsqd
godep get github.com/bitly/nsq/nsqlookupd
godep get github.com/bitly/nsq/nsqadmin

To start, run each of the following services in a different terminal on the same system.

A lookup server instance:

nsqlookupd

A queue instance:

nsqd --lookupd-tcp-address=127.0.0.1:4160

An admin server instance:

nsqadmin --template-dir=~/.go/src/github.com/bitly/nsq/nsqadmin/templates --lookupd-http-address=127.0.0.1:4161

To push test-items:

curl -d 'hello world 1' 'http://127.0.0.1:4151/put?topic=test'
curl -d 'hello world 2' 'http://127.0.0.1:4151/put?topic=test'
curl -d 'hello world 3' 'http://127.0.0.1:4151/put?topic=test'

The “apps” aren’t built, apparently, by default. We’ll need these so we can get a message-dumper, for testing:

~/.go/src/github.com/bitly/nsq$ make
cd build/apps

To dump data that’s already waiting in the queues:

./nsq_to_file --topic=test --output-dir=/tmp --lookupd-http-address=127.0.0.1:4161

Display queue data:

cat /tmp/test.*.log
hello world 1
hello world 2
hello world 3

Python Library

Matt Reiferson wrote pynsq, which is a Python client that employs Tornado for it’s message-loops. The gotcha is that both the consumers -and- producers both require you to use IOLoop, Tornado’s message-loop. This is because pynsq not only allows you to define a “receive” callback, but a post-send callback as well. Though you don’t have to define one, there is an obscure, but real, chance that a send will fail, per Matt, and should always be checked for.

Because of this design, you should be prepared to put all of your core loop logic into the Tornado loop.

To install the client:

sudo pip install pynsq tornado

A producer example from the “pynsq” website:

import nsq
import tornado.ioloop
import time

def pub_message():
    writer.pub('test', time.strftime('%H:%M:%S'), finish_pub)

def finish_pub(conn, data):
    print data

writer = nsq.Writer(['127.0.0.1:4150'])
tornado.ioloop.PeriodicCallback(pub_message, 1000).start()
nsq.run()

An asynchronous consumer example from the “pynsq” website (doesn’t correspond to the producer example):

import nsq

buf = []

def process_message(message):
    global buf
    message.enable_async()
    # cache the message for later processing
    buf.append(message)
    if len(buf) >= 3:
        for msg in buf:
            print msg
            msg.finish()
        buf = []
    else:
        print 'deferring processing'

r = nsq.Reader(message_handler=process_message,
        lookupd_http_addresses=['http://127.0.0.1:4161'],
        topic='nsq_reader', channel='async', max_in_flight=9)
nsq.run()

Give it a try.

FAQ

(Courtesy of a dialogue with Matt Reiferson)

Q: Most job-queues allow you send messages without imposing a loop. Is the 
   IOLoop required for both receiving -and- sending in pynsq?
A: Yes. pynsq supports the notion of completion-callbacks to signal when a send 
   finishes. Even if you don't use it, it's accounted-for in the mechanics. If 
   you want to send synchronous messages without the loop, hit the HTTP 
   endpoint. However, facilitating both the receive and send IOLoops allows for 
   the fasted possible dialogue, especially when the writers and readers are 
   paired to the same hosts.

Q: An IOLoop is even required for asynchronous sends?
A: Yes. If you want to simply send one-off asynchronous messages, 
   consider opening a worker process that manages delivery. It can apply its 
   own callback to catch failures, and transmit successes, failures, etc.. to 
   an IPC queue (if you need this info).

Q: Are there any delivery guarantees (like in ZeroMQ)?
A: No. It's considered good-practice by the NSQ guys to always check the 
   results of message-sends in any situation (in any kind of messaging, in 
   general). You'd do this from the callbacks, with pynsq.

    The reasons that a send would fail are the following:

    1: The topic name is not formatted correctly (to character/length 
       restrictions). There is no official documentation of this, however.
    2: The message is too large (this can be set via a parameter to nsqd).
    3: There is a breakdown related to a race-condition with a publish and a 
       delete happening on a specific topic. This is rare.
    4: Client connection-related failures.

Q: In scenario (3) of the potential reasons for a send-failure, can I mitigate 
   the publish/delete phenomena if I am either not deleting topics or have 
   orchestrated deletions such that writes eliciting topic creations will never 
   be done until a sufficient amount of time has elapsed since a deletion?
A: Largely. Though, if nowhere else, this can also happen internally to NSQ at 
   shutdown.

Q: How are new topics announced to the cluster?
A: The first writer or reader request for a topic will be applied on the 
   upstream nsqd host, and will then propagate to the nsqlookupd hosts. They will 
   eventually spread to the other readers from there. The same thing applies to 
   a new topic, as well as a previously-deleted one.

Manager Namespaces for IPC Between Python Process Pools

Arguably, one of the functionalities that best represent why Python has so many multidisciplinary uses is its multiprocessing library. This library allows Python to maintain pools of processes and communication between these processes with most of the simplicity of a standard multithreaded application (asynchronously invoking a function, locking, and IPC). This is not to say that Python can’t do threads, too, but, due to being able to quickly run map/reduce operations or asynchronous tasks using a very simple set of functions combined with the disadvantages of having to consider the GIL when doing multithreaded development, I believe the multiprocess design to be more popular by a landslide.

There are mountains of examples for how to use multiprocessing, along with sufficient documentation for most of the IPC mechanisms that can be used to communicate between processes: queues, pipes, “manager”-based shares and proxy objects, shared ctypes types, multiprocessing-based “client” and “listener” sockets, etc..

There is a very subtle IPC mechanism called a “namespace” (which is actual part of Manager), whose presence in the documentation only speaks for a couple of lines of the thousands that are there. It’s easy, and worth special mention.

from multiprocessing import Pool, Manager
from os import getpid
from time import sleep

def _worker(ns):
    pid = getpid()
    print("%d: Worker started." % (pid))

    while ns.is_running is True:
        sleep(1)

    print("%d: Worker terminating." % (pid))

m = Manager()
ns = m.Namespace()
ns.is_running = True

num_workers = 5
p = Pool(num_workers)

for i in xrange(num_workers):
    p.apply_async(_worker, (ns,))

sleep(10)
print("Shutting down.")

ns.is_running = False
p.close()
p.join()

print("All workers joined.")

The output:

52893: Worker started.
52894: Worker started.
52895: Worker started.
52896: Worker started.
52897: Worker started.
Shutting down.
52894: Worker terminating.
52893: Worker terminating.
52895: Worker terminating.
52896: Worker terminating.
52897: Worker terminating.
All workers joined.

A namespace is very much like a bulletin board, where attributes can be assigned by one process, and read by others. This works for immutable values like strings and primitive types. Otherwise, updates can’t be tracked properly:

from multiprocessing import Manager

m = Manager()
ns = m.Namespace()

ns.test_value = 'original value'
ns.test_list = [5]

print("test_value (master, original): %s" % (ns.test_value))
print("test_list (master, original): %s" % (ns.test_list))

ns.test_value = 'new value'
ns.test_list.append(10)

print("test_value (master, updated): %s" % (ns.test_value))
print("test_list (master, updated): %s" % (ns.test_list))

Output:

test_value (master, original): original value
test_list (master, original): [5]
test_value (master, updated): new value
test_list (master, updated): [5]

Though not working for some types, namespaces are a terrific mechanism for sharing counters and flags between processes.

The Highway Guide to Writing gedit Plugins

This is going to be a very quick run-through of writing a “gedit” plugin using Python. This method allows you to rapidly produce plugins that require a minimum of code. Depending on what you want to do, only a few lines might be required. Look at a simple plugin such as FocusAutoSave, for an example.

Overview

A gedit plugin is comprised of extensions, where each extension represents functionality that you’re adding at the application-level, window-level, or view-level (where “view” often refers to a particular document).

A plugin has access to the full might of PyGTK. Regardless of which type of extension(s) you need to implement, each base class requires that a do_activate() and do_deactivate() method be implemented. It is from these methods that you either a) configure signals to be handled, or b) schedule a timer to invoke a callback.

To make gedit see your plugin, you have to store two files in ~/.local/share/gedit/plugins: abc.py, and abc.plugin . The latter file is an INI-type file that tells gedit about your plugin, and how to import it. Note that the plugins/ directory must have a __init__.py file in it (as all Python package directories must). Though the “Module” value in the plugin-info file must agree with the name of your Python module, the actual class names within it can have arbitrary names (they automatically wire themselves into GTK). To make an installer, just use “invoke“, make, etc..

Example

Plugin file:

[Plugin]
Loader=python
Module=dustin
IAge=3
Name=Dustin's Plugin
Description=A Python plugin example
Authors=Dustin Oprea 
Copyright=Copyright © 2013 Dustin Oprea 
Website=http://www.myplugin.com

Module file:

The practical purpose of this code is questionable. It’s really just provided as an example of a few different things. Note that the print()’s will be displayed in the console, which means that you should start gedit from the console if you wish to see them.

from gi.repository import GObject, Gedit, Gio
 
SETTINGS_KEY = "org.gnome.gedit.preferences.editor"
gedit_settings = Gio.Settings(SETTINGS_KEY)
 
class DustinPluginWindowExtension(GObject.Object, Gedit.WindowActivatable):
    "Our extension to the window's behavior."
 
    __gtype_name__ = "DustinPluginWindowExtension"
    window = GObject.property(type=Gedit.Window)
 
    def __init__(self):
        GObject.Object.__init__(self)
  
    def do_activate(self):
        "Called when the when is loaded."
 
        # To get a list of all unsaved documents:
        # self.window.get_unsaved_documents()
 
        # To list all available config options configured in the "preferences" window:
        # gedit_settings.keys()
 
        # The get_boolean() call seems like it can be used to either get a boolean 
        # value, or determine if a configurable is even present (for backwards-
        # compatibility).
        if gedit_settings.get_boolean("auto-save") is True:
            print(gedit_settings.get_uint("auto-save-interval"))
 
        # Schedule a callback to trigger in five seconds.
        self.timer_id = GObject.timeout_add_seconds(5, self.__window_callback)
 
    def __window_callback(self):
 
        print("Trigger.")
         
        # Return True to automatically schedule again.
        return True
 
    def do_deactivate(self):
        "We're being unloaded. Clean-up."
 
        # Clean-up our timer.
        GObject.source_remove(self.timer_id)
        self.timer_id = None
 
class DustinPluginViewExtension(GObject.Object, Gedit.ViewActivatable):
    "Our extension to the document's behavior."
 
    __gtype_name__ = "DustinPluginViewExtension"
    view = GObject.property(type=Gedit.View)
 
    def __init__(self):
        GObject.Object.__init__(self)
 
    def do_activate(self):
        # Get the document.
        self.__doc = self.view.get_buffer()
 
        # To get the name of the document as shown in the tab:
        # self.__doc.get_short_name_for_display()
 
        # To insert something at the current cursor position.
        self.__doc.insert_at_cursor("Hello World.\n")
 
        # Get the text of the document. This works using start/stop iterators 
        # (pointers to the left and right sides of the content to grab).
        # text = self.__doc.get_text(self.__doc.get_start_iter(), 
        #                            self.__doc.get_end_iter(), True)
 
        # Wire a handler to the "saved" signal.
        self.__sig_saved = self.__doc.connect("saved", self.__on_saved)
 
    def do_deactivate(self):
        self.__doc.disconnect(self.__sig_saved)
        del self.__sig_saved
 
    def __on_saved(self, widget, *args, **kwargs):
        print("Saved.")

To enable debug logging, just set the log-level at the top of your Python module. Logging should be printed out to the console.

When I was first looking at writing a gedit plugin, I had no direction for 1) how to retrieve the text, 2) how to properly schedule timers (which is a general GTK task), and 3) how to get values from gedit’s configuration. Hopefully this helps.

Additional resources that might be of some help:

Writing Plugins for gedit 3 with Python

Python Plugin How To for gedit 3

gedit Reference Manual (great reference for signals)

A Practitioner’s Overview to SSL, and Viewing the Certificate Chain from Python

The fundamental principle of SSL is this: a client connects to an SSL-enabled server, and the server returns enough information to a) encrypt the communication channel, and b) authenticate itself enough that you can prove that they’re the intended system. (b) is performed by the server providing both its certificate information as well as the CA (certificate authority) that produced it. This latter item is the area in which you might occasionally encounter problems when you have a client that complains about not being able to verify a hostname. This is where some people recommend just passing a flag to skip the check, thus completely compromising the integrity of SSL.

Depending on how reputable your CA is, they might provide additional CA authorities (referred to as IAs, or “intermediate authorities”), such that these authorities form a “certificate chain” that sufficiently proves that all of the authorities that lend their credibility to your certificate can be traced back to one very well-known authority (the “root CA”) that any client (browser, tool, or OS) would know about.

In special or proprietary situations, you might have to physically go into the configuration for your browser/tool/OS, and add a new root CA that the client did not previously know about. Otherwise, the client might forbid access to your website on that system. Unless your dealing with some hightened security situation regarding the intranet at your place of business, this is rarely necessary.

Sometimes, it’s necessary to physically inspect what CAs are being reported by the server, for as simple a reason as just verifying that you’ve configured it correctly. Until recently, and even, arguably, at the present time, Python has been unable to provide this information, as it comes prepackaged with a natively-compiled SSL module and the underlying mechanics simply don’t expose these calls. If you want this information, you’d be forced to just invoke OpenSSL’s “s_client” subcommand on the command-line.

Just recently, a patch was released that exposes this functionality. As a warning, since this hasn’t yet been introduced into the source-tree, its implementation might change by the time it has.

This is some cut-up and reassembled code, to show it in action:

import socket

from ssl import wrap_socket, CERT_NONE, PROTOCOL_SSLv23
from ssl import SSLContext  # Modern SSL?
from ssl import HAS_SNI  # Has SNI?

from pprint import pprint

def ssl_wrap_socket(sock, keyfile=None, certfile=None, cert_reqs=None,
                    ca_certs=None, server_hostname=None,
                    ssl_version=None):
    context = SSLContext(ssl_version)
    context.verify_mode = cert_reqs

    if ca_certs:
        try:
            context.load_verify_locations(ca_certs)
        # Py32 raises IOError
        # Py33 raises FileNotFoundError
        except Exception as e:  # Reraise as SSLError
            raise SSLError(e)

    if certfile:
        # FIXME: This block needs a test.
        context.load_cert_chain(certfile, keyfile)

    if HAS_SNI:  # Platform-specific: OpenSSL with enabled SNI
        return (context, context.wrap_socket(sock, server_hostname=server_hostname))

    return (context, context.wrap_socket(sock))

hostname = 'www.google.com'

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect((hostname, 443))

(context, ssl_socket) = ssl_wrap_socket(s,
                                       ssl_version=2, 
                                       cert_reqs=2, 
                                       ca_certs='/usr/local/lib/python3.3/dist-packages/requests/cacert.pem', 
                                       server_hostname=hostname)

pprint(ssl_socket.getpeercertchain())

s.close()

The output is a tuple of dictionaries:

({'issuer': ((('countryName', 'US'),),
             (('organizationName', 'Google Inc'),),
             (('commonName', 'Google Internet Authority G2'),)),
  'notAfter': 'Sep 11 11:04:38 2014 GMT',
  'notBefore': 'Sep 11 11:04:38 2013 GMT',
  'serialNumber': '50C71E48BCC50676',
  'subject': ((('countryName', 'US'),),
              (('stateOrProvinceName', 'California'),),
              (('localityName', 'Mountain View'),),
              (('organizationName', 'Google Inc'),),
              (('commonName', 'www.google.com'),)),
  'subjectAltName': (('DNS', 'www.google.com'),),
  'version': 3},
 {'issuer': ((('countryName', 'US'),),
             (('organizationName', 'GeoTrust Inc.'),),
             (('commonName', 'GeoTrust Global CA'),)),
  'notAfter': 'Apr  4 15:15:55 2015 GMT',
  'notBefore': 'Apr  5 15:15:55 2013 GMT',
  'serialNumber': '023A69',
  'subject': ((('countryName', 'US'),),
              (('organizationName', 'Google Inc'),),
              (('commonName', 'Google Internet Authority G2'),)),
  'version': 3},
 {'issuer': ((('countryName', 'US'),),
             (('organizationName', 'Equifax'),),
             (('organizationalUnitName',
               'Equifax Secure Certificate Authority'),)),
  'notAfter': 'Aug 21 04:00:00 2018 GMT',
  'notBefore': 'May 21 04:00:00 2002 GMT',
  'serialNumber': '12BBE6',
  'subject': ((('countryName', 'US'),),
              (('organizationName', 'GeoTrust Inc.'),),
              (('commonName', 'GeoTrust Global CA'),)),
  'version': 3},
 {'issuer': ((('countryName', 'US'),),
             (('organizationName', 'Equifax'),),
             (('organizationalUnitName',
               'Equifax Secure Certificate Authority'),)),
  'notAfter': 'Aug 22 16:41:51 2018 GMT',
  'notBefore': 'Aug 22 16:41:51 1998 GMT',
  'serialNumber': '35DEF4CF',
  'subject': ((('countryName', 'US'),),
              (('organizationName', 'Equifax'),),
              (('organizationalUnitName',
                'Equifax Secure Certificate Authority'),)),
  'version': 3})

The topmost item is the most specific and describes the certificate for the domain itself, whereas the bottommost one is the least specific, and describes the highest, most well known authority involved in the operation (in this case, Equifax).