Cassandra – Beyond SQL

I just wrote a tutorial on how to set up Cassandra locally and do some basic CRUD operation to gain the confidence that Cassandra can do some basic staff we can do in MYSQL, but why need Cassandra?

Here are a few pros and cons I heard might keep in mind when comparing with other big data technologies.

Pros:

1. high scalability (performance and capacity increase linearly as the number of the size of the cluster)

2. high availability (you can sync data centers across different regions and there is no master node)

3. “stupidly fast” write and “NoSQL-ly fast” read. (The hashring method hashes records to nodes using a hashfunctions)

Cons:

1. cannot do JOIN (you technically can do two rounds of select instead of join but there no cmd called join in C*)

2. not designed for analytics (you will be surprised by the lack of analytic functions offered by C*, you can use Spark and Tableau on top of Cassandra and it has been proved to be helpful.)

Now, let’s take a look at how Cassandra stores the data.

Cassandra – Setup, CQLSH and Python

First, we need to set up Cassandra in your local machine and start doing some basic operations using its Python driver.

You can download open-source Apache Cassandra from here. There are a few steps to check that you have the right read/write permission for the directories that Cassandra might use. However, it looks like it worked out of box for me.

You can start Cassandra running in local mode like most Apache project does by running command

bin/cassandra -f

where option -f will start the project in foreground. Then there will be hundreds of lines of logs printed to the screen, giving the user some information about how much memory it has been allocated, where it is loading the configuration from, etc.

Now congratulations! You have your first Cassandra “cluster” (single node) running. Now lets start by using the cqlsh(Cassandra Query Language Shell) to do some basic database operations. Let’s leave the previous terminal session running and run command `bin/cqlsh` in a new tab. So this is place where you can start writing your favorite SQL-ish commands to interact with Cassandra.

First, we might wonder what is the equivalent of ‘show databases’ in SQL so we can create a database to get started if there is no default database. In Cassandra, or even in NOSQL world, they tend to avoid using the name database, like Mongo, HBase, so does Cassandra, they use something called keyspace. The query language for Cassandra is extremely similar like SQL from MySQL. As you can see from the screenshot below, I checked there was no keyspace preexisted before and I just created a new keyspace named mykeyspace and then I created a dummy table with three columns firstname, lastname and ssn where ssn is the primary key.

As you can see, this is basically SQL and you can do create table, insert, update and delete as you have done in SQL… Here is another screen shot of some CRUD (create, read, update and delete.) operations I did.

To learn more about how to map your SQL to CQL, refer to the documentation at Datastax from here.

Now let’s write a Python application to auto generate 1 million records and insert it into Cassandra.

I am using the Cassandra Python driver from Datastax, you can access the source code from github. They have also pushed it to pypi so you can install from pip directly by running `sudo pip install cassandra-driver` (make sure you also have libev installed). You can refer to this tutorial from datastax to get quickly started.

I was totally spoiled by the flexibility in Python that you can enclose single quotes by doubles and vice versa. I was wondering there was something wrong with my CQL command and in the end, I need to enclose the inserted values by single quotes and that is all I need.

It took me about half an hour to insert 1 million rows into Cassandra, you can try it yourself using this Python script.

One interesting is that I clearly have inserted 1 million records in there, however, looks like the count(*) only returns 10K records which is only 1% of what is actually there…

It is confusing me so much.

Cassandra –

Cassandra:
http://cassandra.apache.org/download/
CQL:
https://pypi.python.org/pypi/cql/1.0.4
iPython Notebook – Jon
https://github.com/rustyrazorblade/python-presentation
SQOOP:
http://www.datastax.com/documentation/datastax_enterprise/4.0/datastax_enterprise/ana/anaSqpDemo.html
Spark:
Shark, dramatically reduce IO.
Spark+Cassandra:

When you have to do JOIN in Cassandra, you need to put that logic into your code… so you lookup a few records, and use the result to look up in another table since the performance is really good.

Python – Requests

I probably will never use urllib2 any more because of `requests` Usually I use urllib2 to make simple http requests and parse useful content out of plain html using BeautifulSoup. Since I have done that many times and have pieces of code here and there when dealing with user agent, url encode, cookies…etc. Today I took at look at the requests library (sudo pip install requests) and it is so amazing awesome!

import requests

myheaders = {‘User -Agent’:’Mozilla 5.0′}

myparams = {‘key’:’value’, ‘key1’:’value1’…}

baseurl = “http://example.com/search?”

r = requests.get(baseurl, params=myparams, headers=myheaders)

You can see the encoded url by calling r.url and the response html by r.text

It will encode the query string properly and make the http call in that one line!

WordPress – Change URL and Home IP

My dear girlfriend is hosting her website on AWS. She got the domain name from name.com ages ago and we are trying to set up wordpress on a Ubuntu machine. The whole process was pretty easy and it took only a few minutes.

However, the most confusing and time consuming thing in the end, AWS changes its IP in so many circumstances… however, there are so much information where the IP is included in there… for example, there is a table called wp_options in the database wordpress. We saved the Ubuntu as AMI image and killed the original micro one and brought up a medium one using the image. However, it failed to try to connect to the database at the beginning, and after tinkering around for a while, I got the home page to work, however, all the links are still directing to the killed micro instance.

I went to the wordpress folder and try to find all the posts that my GF has done. And in the end, it turned out only the uploaded images are outside the database, all the posts and write ups are store in the mysql database. And I need to do several mysql update to make sure the they all point to the new IP.

Here is the stackoverflow my GF found which let me go to bed before mid night 🙂

R – Decompose Time Series with more than one seasonality using Bats and Tbasts

Origin

Elastic Search –

http://www.elasticsearch.org/blog/count-elasticsearch/

R – Writing My First R Package

I think you don’t really understand what programming is until you really understand how to package your code. There is a great tutorial from Hilary Parker and another one from Hadley Wickham.

I tried to redo the getout package in R using XML and RCurl package. You can see the source code here.