Very Easy, Pleasant, Secure, and Python-Accessible Distributed Storage With Tahoe LAFS

Tahoe is a file-level distributed filesystem, and it’s a joy to use. “LAFS” stands for “Least Authority Filesystem”. According to the homepage:

Even if some of the servers fail or are taken over by an attacker, the 
entire filesystem continues to function correctly, preserving your privacy 
and security.

Tahoe comes built-in with a beautiful UI, and can be accessed via it’s CLI (using a syntax similar to SCP), via REST (that’s right), or from Python using pyFilesystem (an abstraction layer that also works with SFTP, S3, FTP, and many others). Tahoe It gives you very direct control over how files are sharded/replicated. The shards are referred to as shares.

Tahoe requires an “introducer” node that announces nodes. You can easily do a one-node cluster by installing the node in the default ~/.tahoe directory, the introducer in another directory, and dropping the “share” configurables down to 1.

Installing

Just install the package:

$ sudo apt-get install tahoe-lafs

You might also be able to install directly using pip (this is what the Apt version does):

$ sudo pip install allmydata-tahoe

Configuring as Client

  1. Provisioned client:
    $ tahoe create-client
    
  2. Update ~/.tahoe/tahoe.cfg:
    # Identify the local node.
    nickname = 
    
    # This is the furl for the public TestGrid.
    introducer.furl = pb://hckqqn4vq5ggzuukfztpuu4wykwefa6d@publictestgrid.e271.net:50213,198.186.193.74:50213/introducer
    
  3. Start node:
    $ bin/tahoe start
    

Web Interface (WUI):

The UI is available at http://127.0.0.1:3456.

To change the UI to bind on all ports, update web.port:

web.port = tcp:3456:interface=0.0.0.0

CLI Interface (CLI):

To start manipulating files with tahoe, we need an alias. Aliases are similar to anonymous buckets. When you create an alias, you create a bucket. If you misplace the alias (or the directory URI that it represents), you’re up the creek. It’s standard-operating-procedure to copy the private/aliases file (in your main Tahoe directory) between the various nodes of your cluster.

  1. Create an alias (bucket):
    $ tahoe create-alias tahoe
    

    We use “tahoe” since that’s the conventional default.

  2. Manipulate it:

    $ tahoe ls tahoe:
    $
    

The tahoe command is similar to scp, in that you pass the standard file management calls and use the standard “colon” syntax to interact with the remote resource.

If you’d like to view this alias/directory/bucket in the WUI, run “tahoe list-aliases” to dump your aliases:

# tahoe list-aliases
  tahoe: URI:DIR2:xyzxyzxyzxyzxyzxyzxyzxyz:abcabcabcabcabcabcabcabcabcabcabc

Then, take the whole URI string (“URI:DIR2:xyzxyzxyzxyzxyzxyzxyzxyz:abcabcabcabcabcabcabcabcabcabcabc”), plug it into the input field beneath “OPEN TAHOE-URI:”, and click “View file or Directory”.

Configuring as Peer (Client and Server)

First, an introducer has to be created to announce the nodes.

Creating the Introducer

$ mkdir tahoe_introducer
$ cd tahoe_introducer/
~/tahoe_introducer$ tahoe create-introducer .

Introducer created in '/home/dustin/tahoe_introducer'

$ ls -l
total 8
-rw-rw-r-- 1 dustin dustin 520 Sep 16 13:35 tahoe.cfg
-rw-rw-r-- 1 dustin dustin 311 Sep 16 13:35 tahoe-introducer.tac

# This is a introducer-specific tahoe.cfg . Set the nickname.
~/tahoe_introducer$ vim tahoe.cfg 

~/tahoe_introducer$ tahoe start .
STARTING '/home/dustin/tahoe_introducer'

~/tahoe_introducer$ cat private/introducer.furl 
pb://wa3mb3l72aj52zveokz3slunvmbjeyjl@192.168.10.108:58294,192.168.24.170:58294,127.0.0.1:58294/5orxjlz6e5x3rtzptselaovfs3c5rx4f

Configuring Client/Server Peer

  1. Create the node:
    $ tahoe create-node
    
  2. Update configuration (~/.tahoe/tahoe.cfg).
    • Set nickname and introducer.furl to the furl of the introducer, just above.
    • Set the shares config. We’ll only have one node for this example, so needed represents the number of pieces required to rebuild a file, happy represents the number of pieces/nodes required to perform a write, and total represents the number of pieces that get created:
      shares.needed = 1
      shares.happy = 1
      shares.total = 1
      

      You may also wish to set the web.port item as we did in the client section, above.

  3. Start the node:

    $ tahoe start
    STARTING '/home/dustin/.tahoe'
    
  4. Test a file-operation:
    $ tahoe create-alias tahoe
    Alias 'tahoe' created
    
    $ tahoe ls
    
    $ tahoe cp /etc/fstab tahoe:
    Success: files copied
    
    $ tahoe ls
    fstab
    

Accessing From Python

  1. Install the Python package:
    $ sudo pip install fs
    
  2. List the files:
    import fs.contrib.tahoelafs
    
    dir_uri = 'URI:DIR2:um3z3xblctnajmaskpxeqvf3my:fevj3z54toroth5eeh4koh5axktuplca6gfqvht26lb2232szjoq'
    webapi_url = 'http://yourserver:3456'
    
    t = fs.contrib.tahoelafs.TahoeLAFS(dir_uri, webapi=webapi_url)
    files = t.listdir()
    

    This will render a list of strings (filenames). If you don’t provide webapi, the local system and default port are assumed.

Troubleshooting

If the logo in the upper-lefthand corner of the UI doesn’t load, try doing the following, making whatever path adjustments are necessary in your environment:

$ cd /usr/lib/python2.7/dist-packages/allmydata/web/static
$ sudo mkdir img && cd img
$ sudo wget https://raw.githubusercontent.com/tahoe-lafs/tahoe-lafs/master/src/allmydata/web/static/img/logo.png
$ tahoe restart

This is a bug, where the image isn’t being included in the Python package:

logo.png is not found in allmydata-tahoe as installed via easy_install and pip

If you’re trying to do a copy and you get an AssertionError, this likely is a known bug in 1.10.0:

# tahoe cp tahoe:fake_data .
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/allmydata/scripts/runner.py", line 156, in run
    rc = runner(sys.argv[1:], install_node_control=install_node_control)
  File "/usr/lib/python2.7/dist-packages/allmydata/scripts/runner.py", line 141, in runner
    rc = cli.dispatch[command](so)
  File "/usr/lib/python2.7/dist-packages/allmydata/scripts/cli.py", line 551, in cp
    rc = tahoe_cp.copy(options)
  File "/usr/lib/python2.7/dist-packages/allmydata/scripts/tahoe_cp.py", line 770, in copy
    return Copier().do_copy(options)
  File "/usr/lib/python2.7/dist-packages/allmydata/scripts/tahoe_cp.py", line 451, in do_copy
    status = self.try_copy()
  File "/usr/lib/python2.7/dist-packages/allmydata/scripts/tahoe_cp.py", line 512, in try_copy
    return self.copy_to_directory(sources, target)
  File "/usr/lib/python2.7/dist-packages/allmydata/scripts/tahoe_cp.py", line 672, in copy_to_directory
    self.copy_files_to_target(self.targetmap[target], target)
  File "/usr/lib/python2.7/dist-packages/allmydata/scripts/tahoe_cp.py", line 703, in copy_files_to_target
    self.copy_file_into(source, name, target)
  File "/usr/lib/python2.7/dist-packages/allmydata/scripts/tahoe_cp.py", line 748, in copy_file_into
    target.put_file(name, f)
  File "/usr/lib/python2.7/dist-packages/allmydata/scripts/tahoe_cp.py", line 156, in put_file
    precondition(isinstance(name, unicode), name)
  File "/usr/lib/python2.7/dist-packages/allmydata/util/assertutil.py", line 39, in precondition
    raise AssertionError, "".join(msgbuf)
AssertionError: precondition: 'fake_data' <type 'str'>

Try using a destination filename/filepath rather than just a dot.

See Inconsistent ‘tahoe cp’ behavior for more information.

Convert Syslog Events to a JSON Stream

The syslog-ng project serves as a general replacement for rsyslog (the default syslog daemon incarnation on Ubuntu and other distributions). It allows you to simply defining syslog sources, defining filters, defining destinations, and mapping them together. It also provides the ability to apply a pattern-tree to messages for classification (see “Processing message content with a pattern database” in the “Administrator Guide” PDF, below), as well as translating log messages into different formats.

It’s the latter that we’re concerned with. We can take Syslog output and use the JSON template-plugin to send JSON into a pipe, network destination, etc..

For this example, we’ll simply translate the system syslog events into JSON.

Installing/Configuring

  1. Install packages:
    $ sudo apt-get install syslog-ng-core
    $ sudo apt-get install syslog-ng-mod-json
    
  2. Modify /etc/syslog-ng/syslog-ng.conf:
    destination d_json {
       file("/var/log/messages.json" template("$(format-json --scope selected_macros --scope nv_pairs)\n"));
    };
    
    log {
       source(s_src); destination(d_json);
    };
    
  3. Restart the service:
    $ sudo service syslog-ng restart
    

Now, you’ll see a /var/log/messages.json. Mine shows the following:

{"TAGS":".source.s_src","SOURCEIP":"127.0.0.1","PROGRAM":"sudo","PRIORITY":"notice","MESSAGE":"  dustin : TTY=pts/23 ; PWD=/home/dustin ; USER=root ; COMMAND=/usr/sbin/service syslog-ng restart","LEGACY_MSGHDR":"sudo: ","HOST_FROM":"dustinhub","HOST":"dustinhub","FACILITY":"authpriv","DATE":"Sep 16 04:51:41"}
{"TAGS":".source.s_src","SOURCEIP":"127.0.0.1","PROGRAM":"sudo","PRIORITY":"info","MESSAGE":"pam_unix(sudo:session): session opened for user root by dustin(uid=0)","LEGACY_MSGHDR":"sudo: ","HOST_FROM":"dustinhub","HOST":"dustinhub","FACILITY":"authpriv","DATE":"Sep 16 04:51:41"}
{"TAGS":".source.s_src","SOURCEIP":"127.0.0.1","SOURCE":"s_src","PROGRAM":"syslog-ng","PRIORITY":"notice","PID":"15800","MESSAGE":"syslog-ng shutting down; version='3.5.3'","HOST_FROM":"dustinhub","HOST":"dustinhub","FACILITY":"syslog","DATE":"Sep 16 04:51:41"}
{"TAGS":".source.s_src","SOURCEIP":"127.0.0.1","SOURCE":"s_src","PROGRAM":"syslog-ng","PRIORITY":"notice","PID":"15889","MESSAGE":"syslog-ng starting up; version='3.5.3'","HOST_FROM":"dustinhub","HOST":"dustinhub","FACILITY":"syslog","DATE":"Sep 16 04:51:41"}
{"TAGS":".source.s_src","SOURCEIP":"127.0.0.1","SOURCE":"s_src","PROGRAM":"syslog-ng","PRIORITY":"notice","PID":"15889","MESSAGE":"EOF on control channel, closing connection;","HOST_FROM":"dustinhub","HOST":"dustinhub","FACILITY":"syslog","DATE":"Sep 16 04:51:41"}
{"TAGS":".source.s_src","SOURCEIP":"127.0.0.1","PROGRAM":"sudo","PRIORITY":"info","MESSAGE":"pam_unix(sudo:session): session closed for user root","LEGACY_MSGHDR":"sudo: ","HOST_FROM":"dustinhub","HOST":"dustinhub","FACILITY":"authpriv","DATE":"Sep 16 04:51:41"}

Conclusion

This all enables you to build your own filter in your favorite programming language by using a socket-server and a set of rules. You don’t have to be concerned with parsing the Syslog protocol or the semantics of file-parsing and message formats, and you can avoid the age-old paradigm of parsing log-files, after the fact, by chunks of time, and start to process them in real-time.

Documentation

The syslog-ng website has some “Administrator Guide” PDFs, but the site has very little other usefulness, and, though everyone loves syslog-ng, there is little more than configurations snippets in forum posts. However, those PDFs are thorough, and the configuration file is easy to understand (essentially different incarnations of the commands above).

Dynamically Compiling and Implementing a Function

Python allows you to dynamically compile things at the module-level. That’s why the compile() builtin keyword accepts sourcecode and dictionaries of locals and globals, and doesn’t provide a direct way to call a fragment of dynamic (read: textual) sourcecode with arguments (“xyz(arg1, arg2)”). You also can’t directly invoke compiled-code as a generator (where a function that uses “yield” is interpreted as a generator rather than just a function).

However, there’s a loophole, and it’s very elegant and consistent to Python. You simply have to wrap your code in a function definition, and then pull the function from the local scope. You can then call it as desired:

import hashlib
import random

def _compile(arg_names, code):
    name = "(lambda compile)"
    # Needs to start with a letter.
    id_ = 'a' + str(random.random())
    code = "def " + id_ + "(" + ', '.join(arg_names) + "):n" + 
           'n'.join(('  ' + line) for line in code.replace('r', '').split('n')) + 'n'

    c = compile(code, name, 'exec')
    locals_ = {}
    exec(c, globals(), locals_)
    return locals_[id_]

code = """
return a * b * c
"""

c = _compile(['a', 'b', 'c'], code)
print(c(1, 2, 3))

Serialize a Generator in Python

Simply inherit from the list class, and override the __iter__ method. A great example, from SO:

import json

def gen():
    yield 20
    yield 30
    yield 40

class StreamArray(list):
    def __iter__(self):
        return gen()

a = [1,2,3]
b = StreamArray()

print(json.dumps([1,a,b]))

SSL for Python (M2Crypto) on Windows

M2Crypto is the most versatile and popular SSL library for Python. Naturally, it takes a predictable amount of burden getting it to work under Windows.

If you’re lucky, you can find a precompiled binary online, and circumvent the heartache. Though many pages have come and gone, here is one that works, courtesy of the grr project: M2Crypto.

Not only do they provide a [non-trivial] set of instructions on how to build the binaries yourself, but they present binaries, as well. Though the binaries are hosted on Google Code (and unlikely to go away), I’ve hosted them, too, for brevity:

M2CryptoWindows

Note that these binaries, as given, are not installable Python packages. I have produced and published two such packages to PyPI, for your convenience:

M2CryptoWin32
M2CryptoWin64

SQLAlchemy and MySQL Encoding

I recently ran into an issue with the encoding of data coming back from MySQL through sqlalchemy. This is the first time that I’ve encountered such issues since this project first came online, months ago.

I am using utf8 encoding on my database, tables, and columns. I just added a new column, and suddenly my pages and/or AJAX calls started failing with one of the following two messages, respectively:

  • UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0x96 in position 5: ordinal not in range(128)
  • UnicodeDecodeError: ‘utf8’ codec can’t decode byte 0x96 in position 5: invalid start byte

When I tell the stored procedure to return an empty string for the new column instead of its data, it works. The other text columns have an identical encoding.

It turns out that SQLAlchemy defaults to the latin1 encoding. If you need something different, than you’re in for a surprise. The official solution is to pass the “encoding” parameter to create_engine. This is the example from the documentation:

engine = create_engine("mysql://scott:tiger@hostname/dbname", encoding='latin1', echo=True)

In my case, I tried utf8. However, it still didn’t work. I don’t know if that ever works. It wasn’t until I uncovered a StackOverflow entry that I found the answer. I had to append “?charset=utf8” to the DSN string:

mysql+mysqldb://username:password@hostname:port/database_name?charset=utf8

The following are the potential explanations:

  • Since I copy and pasted values that were set into these columns, I accidentally introduced a character that was out of range.
  • The two encodings have an overlapping set of codes, and I finally introduced a character that was supported by one but not the other.

Whatever the case, it’s fixed and I’m a few hours older.

Using ctypes to Read Binary Data from a Double-Pointer

This is a sticky and exotic use-case of ctypes. In the example below, we make a call to some library function that treats ptr like a double-pointer, and sets ptr to point to a buffer and sets count with the number of bytes that are available there. The data at the pointer may have one or more NULL bytes that should not be interpreted as terminators.

from ctypes import *

ptr = ctypes.c_char_p()
count = ctypes.c_size_t()

r = library.some_call(
        ctypes.cast(ctypes.byref(ptr), 
                    ctypes.POINTER(ctypes.c_void_p)), 
        ctypes.byref(count))

if r != 0:
    raise ValueError("Library call failed.")

data = ctypes.string_at(ptr, count.value)