First, we need to set up Cassandra in your local machine and start doing some basic operations using its Python driver.
You can download open-source Apache Cassandra from here. There are a few steps to check that you have the right read/write permission for the directories that Cassandra might use. However, it looks like it worked out of box for me.
You can start Cassandra running in local mode like most Apache project does by running command
where option -f will start the project in foreground. Then there will be hundreds of lines of logs printed to the screen, giving the user some information about how much memory it has been allocated, where it is loading the configuration from, etc.
Now congratulations! You have your first Cassandra “cluster” (single node) running. Now lets start by using the cqlsh(Cassandra Query Language Shell) to do some basic database operations. Let’s leave the previous terminal session running and run command `bin/cqlsh` in a new tab. So this is place where you can start writing your favorite SQL-ish commands to interact with Cassandra.
First, we might wonder what is the equivalent of ‘show databases’ in SQL so we can create a database to get started if there is no default database. In Cassandra, or even in NOSQL world, they tend to avoid using the name database, like Mongo, HBase, so does Cassandra, they use something called keyspace. The query language for Cassandra is extremely similar like SQL from MySQL. As you can see from the screenshot below, I checked there was no keyspace preexisted before and I just created a new keyspace named mykeyspace and then I created a dummy table with three columns firstname, lastname and ssn where ssn is the primary key.
As you can see, this is basically SQL and you can do create table, insert, update and delete as you have done in SQL… Here is another screen shot of some CRUD (create, read, update and delete.) operations I did.
To learn more about how to map your SQL to CQL, refer to the documentation at Datastax from here.
Now let’s write a Python application to auto generate 1 million records and insert it into Cassandra.
I am using the Cassandra Python driver from Datastax, you can access the source code from github. They have also pushed it to pypi so you can install from pip directly by running `sudo pip install cassandra-driver` (make sure you also have libev installed). You can refer to this tutorial from datastax to get quickly started.
I was totally spoiled by the flexibility in Python that you can enclose single quotes by doubles and vice versa. I was wondering there was something wrong with my CQL command and in the end, I need to enclose the inserted values by single quotes and that is all I need.
It took me about half an hour to insert 1 million rows into Cassandra, you can try it yourself using this Python script.
One interesting is that I clearly have inserted 1 million records in there, however, looks like the count(*) only returns 10K records which is only 1% of what is actually there…
It is confusing me so much.