cassandra

PyCassa vs Lazyboy (updated)

Update

As Hans points out in the comment below, it appears pycassa natively supports authentication with org.apache.cassandra.auth.SimpleAuthenticator. Lazyboy on the other hand doesn’t by default.

It’s not too hard to do it though. Intuitively, we could do something like this.

NB: Untested code!! I might create a patch for this when I get the time, so this is just an outline.

# Add this to lazyboy's connection package
from cassandra.ttypes import AuthenticationRequest

And in lazyboy’s _connect() function, add another parameter called logins, that is a dict of keyspaces and credentials which looks like the following.

# logins format
{'Keyspace1' : {'username':'myuser', 'password':'mypass'}}

def _connect(self, logins):
"""Connect to Cassandra if not connected."""

    client = self._get_server()
    if client.transport.isOpen() and self._recycle:
        if (client.connect_time + self._recycle) > time.time():
            return client
        else:
            client.transport.close()
    
    elif client.transport.isOpen():
        return client
    
    try:
        client.transport.open()
        # Login code 
        # Remember that client is an instance of Cassandra.Client(protocol)
        if logins is not None:
            for keyspace, credentials in logins.iteritems():
                request = AuthenticationRequest(credentials=credentials)
            client.login(keyspace, request)
    
        client.connect_time = time.time()
    except thrift.transport.TTransport.TTransportException, ex:
        client.transport.close()
        raise exc.ErrorThriftMessage(
            ex.message, self._servers[self._current_server])

Original Post
I’ve been looking to answer which Python library is currently more fully featured to use to communicate with Cassandra.

From Reddit,

API-wise, both look like they are pretty much basic wrappers around the Cassandra Thrift bindings. I’d prefer lazyboy over pycassa though, given that firstly, it’s being used in production right now at Digg, and because it looks like lazyboy’s connection code is more featured than pycassa.

and

The connection code (Lazyboy) seems to be much more suited for use in production (use of auto pooling, auto load balancing, integrated failover/retry, etc.) (than PyCassa)

Thanks to GitHub, I was able to do some analysis of their traffic and commits,

Traffic Data


LazyBoy


Pycassa

Commit Data


LazyBoy


Pycassa

A larger number of people know about LazyBoy but code commits on it are currently on a stand still. Pycassa on the other hand seems to be growing at a pretty fast rate.

It looks like LazyBoy is probably a better library to start with, for now. I’ll talk about my experiences with both in another post.

Moving from MySQL to Cassandra – Pros and Cons

Moving on from the question of which NoSQL database you should choose, after reading these excellent posts from Digg and Twitter, I recently asked a question on StackOverflow regarding the pros and cons of moving from MySQL to Cassandra.

Stackoverflow Question is here [http://stackoverflow.com/questions/2332113/switching-from-mysql-to-cassandra-pros-cons]

I got some excellent insight and feedback, primarily from Jonathan Ellis, one of the maintainers of Cassandra, and a systems architect at Rackspace.

He’s also written a post on the Rackspace blog today as a follow up on the question.

I wanted to highlight a great tip he mentions (via Ian Eure of Digg, and also the creator of a Python Cassandra lib called LazyBoy) that was mentioned at the latest PyCon ’10,

Ian Eure from Digg (also switching to Cassandra) gave a great rule of thumb last week at PyCon: “if you’re deploying memcache on top of your database, you’re inventing your own ad-hoc, difficult to maintain NoSQL database,” and you should seriously consider using something explicitly designed for that instead.

Also mentioned are a couple of general caveats in using NOSQL vs Relational databases,

The price of scaling is that Cassandra provides poor support for ad-hoc queries, emphasizing denormalization instead. For analytics, the upcoming 0.6 release (in beta now) offers Hadoop map/reduce integration, but for high volume, low-latency queries you will still need to design your app around denormalization.

Looks like the Cassandra 0.6 beta is coming out tomorrow, and can already be built from repositories in case anyone’s interested in doing so (and telling me about their experiences!).