Python – Profiling cProfile

Recently I have a project that I have lots of raw data (time series data), however, the output need to some higher level statistics which requires some aggregation of the raw data. So here are the choices that I have, either for every request, pull the raw data and calculate it on the fly, or preprocess all the raw data and store the stats somewhere else, so when use needs the data, then it is simply a look up.

Since the data is pretty big and updated at a daily level, batch preprocessing all the data is like boiling the ocean, and what is even worse, we need to reboil the ocean every day. There is also a possibility that the user who requested this service won’t even be able to use all that much data that frequent, which will result in a huge waste of computing power. On the other hand, calculating on the fly faces some other challenges, you need to ensure your logic is so well written and generic and will succeed for all the parts. Mostly important, the performance need to be fast enough to serve as a service. In the history, I wrote my python code in a style that “hmm, it is fast, hmm.. it is taking a long time”. Nothing more than a linux “time python script.py”. Now I face the challenge of turning whatever calculation into something that will be fast at a service level (<100ms). Have a quantitive understanding of how much time each step takes, where the bottle neck is and then we can strategically to improve certain parts without switching to other programming languages (C, Java)..

Then I learned that this type of analysis is called profiling:

“A profile is a set of statistics that describes how often and for how long various parts of the program executed.” – Python Documentation

The cProfile is the de-facto profiling tool to benchmark Python code. It is not the mostly user friendly tool but once you spent some time on it, getting familiar with its syntax, then you will have a tool like linux top command but for your python code.

import cProfile 
cProfile.run('range(10)')

Either the python documentation or pyMOTW can help you get started quickly. Then I came across a blog post from Julien Danjou – Profiling Python using cProfile: a concrete case which introduced me to KCacheGrind.

If you are a Mac user like me, brew install everything following this instruction.

pip install pyprof2calltree, and then you will be good to go.

pyprof2calltree -k -i file_profile

In the end, you will have a beautiful visualized way of how long each step takes.

Screen Shot 2016-07-23 at 6.59.56 PM.png

 

Shell – Exclamation Mark !

The exclamation mark will definitely speed up your history look up. Usually people look up history by hitting the up arrow to go through the history commands. I literally have seen someone who hit the arrow sign more than 20 times and still wouldn’t be able to locate the exact command he was looking for.

Then there are some users there who uses the ‘history’ command to lookup history. One can either by copy paste the command, or you can find the command line number and use “!<number>” to execute the command.

Also you have people use command like ‘history | grep <keyword>’. However, if you happen to know the command you are searching for start with certain prefix or even contains certain keyword. You can use “!<prefix>” or “!?<substring>?” to quickly pull the last executed command that starts or contains the specified keyword.

(note: !xxx show in the history, the command it represents does)

 

$ echo 'hello'
hello
$ cd ~
$ !ec
echo 'hello'
hello
$ !hell
-bash: !hell: event not found
$ !?hell?
echo 'hello'
hello

Here is an answer on stackexchange that contains a more detailed explanation of use cases for exclamation!

Spring-boot: actuator default endpoints

Actuator is a sub-project of Spring-boot, which provides production ready features for spring-boot applications. It provides a number of additional features to monitor and manage the application when it pushes to production.

You can try it out by clone the spring-boot github repo, and navigate to the spring-boot-samples directory which contains plenty of built-in samples, you can find one that is called spring-boot-sample-actuator-log4j2, and run the command `mvn spring-boot:run` to bring up the spring boot application. 

Screen Shot 2016-07-03 at 10.41.36 AM

As you can see, there is not really any code in the project where defines this autoconfig endpoint. This “autoconfig”one might not be the most interesting or straightforward one, but it is actually a really important and sophisticated one which displays the auto-configuration report of all the auto configuration report, also which one is applied or not and why.

You can refer to spring-boot-actuator documentation for a complete list of the available endpoints. Here are a few ones that I tried out myself along with some description and screenshots to help you understand how that works in real life:

1.configprops – configuration properties

Screen Shot 2016-07-03 at 10.53.20 AM.png

2.health – heath status

application health information

Screen Shot 2016-07-03 at 11.05.00 AM

3. metrics – metrics of current application

If you are in a production environment, i think you should care every number in the response.

Screen Shot 2016-07-03 at 11.08.48 AM.png

4. mappings – display a collated list of all paths

Screen Shot 2016-07-03 at 11.14.37 AM

I pasted the response to a site called jsonlint to put it in a better format for human to read.

Screen Shot 2016-07-03 at 11.14.17 AM

5. shutdown – make a post to the server to shutdown

This is quite a dangerous endpoint that a post request to the server will shut it down.

Screen Shot 2016-07-03 at 11.19.02 AM.png

Well… here they are, enjoy the awesome work done and appreciate it.

Serialization and Deserialization

I am curious really how a class or serialized at the byte level. I borrow the example from tutorialspoint, modified it a bit and here is what I have right at this moment.

Here is my Employee class and here is my main function.

serial_employeeserial_main

Clearly, the main function will write a file to my desktop and you can use your favorite editor to take a look at the seralized, i.e. binary file. There is a tool called hexedit which might come handy. Here is a screenshot of how the binary file looks like in the text editor.

hexedit

As you can tell, the binary file is a bit messy but most of the contents are almost in a human readable format, say for example, we have 4 attributes and all the string fields are so easy to tell. However, the goal of this post is to 100% decode every byte there and understand how Java really serialized a object.

This really has nothing to do with intelligence but to read the protocol of Java serialization. Here is where the protocol is and of course, it is the only source I have to decipher this binary file.

By the time that I am writing this post, I have not fully decipher every character yet, but I will say I am almost 80% there and here is my progress.

# raw value
aced 0005 7372 0019 636f 6d2e 6461 7461
6669 7265 6261 6c6c 2e45 6d70 6c6f 7965
65da 231e 1f8f 8a0e 4402 0003 4900 066e
756d 6265 724c 0007 6164 6472 6573 7374
0012 4c6a 6176 612f 6c61 6e67 2f53 7472
696e 673b 4c00 046e 616d 6571 007e 0001
7870 0001 0932 7400 0864 697a 6869 e590
8d74 0006 6d69 6e67 7a69 
------
# decipher
aced: (stream magic) 
0005: (stream version)
73: (object)
72: (class description) 
0019: 
636f 6d2e 6461 7461 6669 7265 6261 6c6c 2e45 6d70 6c6f 7965 65: com.data.fireball.Employee
da 231e 1f8f 8a0e 4402 0003 
49: (I)
00 06: (6 bytes)
6e 756d 6265 72: number
4c: (L)
0007: (7 bytes) 
6164 6472 6573 73: address
74: (string marker)
0012: (18 bytes) 
4c6a 6176 612f 6c61 6e67 2f53 7472 696e 673b: Ljava/lang/String; 
4c: (L)
00 04: (4 bytes)
6e 616d 65: name
71 007e 0001 7870 
0001 0932: 67890
74: (string marker)
00 08: (8 bytes)
64 697a 6869: dizhi 
e590 8d: 名
74: (string marker)
0006: (6 bytes) 
6d69 6e67 7a69: mingzi

 

 

 

ToStringHelper(guava) – SimpleResponse

I came across a handy Java class that is from Google Commons named ToStringHelper. This is where I found the usage of ToStringHelper by reading openscoring source code.

Villu was wrote this tiny SimpleResponse class in openscoring.common which is serializable and only one attribute, three methods, the getter, setter and the toString method, which is implemented using the MoreObjects.toStringHelper class.

First, we need to cover something basic about “toString“.  It is a method that comes with class Object, basically means every class in Java is kind of a object and it should always inherit this default toString method unless overwritten. However, who need a hashcode, right?!. User tend to need something that is more informative and concise, like you might want the first name and last name out of a person object? a title, author in the textual format of a book class so on and so forth. You can overwrite toString in whatever way you prefer, but using the toStringHelper really made this part easy and consistent.

toStringHelper

using tostringhelper

toStringHelper_default

default tostring

Keep this little trick in your toolbox and hopefully it is helpful sometime.

 

Restful Java with Jax-rs 2.0 – How to run Shop App

I am reading this Oreilly book RESTful Java with JAX-RS2.0 from Bill Bourke. I am trying to following the example in Chapter three of how to to deploy a naíve web API that can create/update and get customer information. The book comes with some sample code where you can find from this Github repo. (If you cannot locate the right project, the project should reside in “restful_java_jax-rs_2_0-master/resteasy-jaxrs-3.0.5.Final/examples/oreilly-jaxrs-2.0-workbook/ex03_1″). I tried to run the code in several ways like command line, Eclipse and also tested it out using Junit and tools like postman, I want to list my experience here so others can save some time and get it up and running fast.

There is a README file under the project directory telling you “mvn clean install” is the way to go. Clearly, the author put lots of thoughts into the pom.xml and it will build the war file and deploy the war using jetty maven plugin and run a few unit tests as the client to ensure you can do all the CRUD operations as it supposed to. Here is how the final junit test look like:

ex3_1junit As you can see, it “looks like” the application is working fine, however, after the maven build, it will not keep a server running and you really cannot have a hands-on experience playing around with the app. In that case, we need to deploy the application and keep it running as long as we want. And potentially test it using your own tool set instead that Junit test.

1. Jetty-runner.jar

A few words about jetty if you are new to the whole Java web app thing, like me. Jetty is a web server and servlet container in one sentences, was first created in 1995 and open sourced and has been in sourceforge, codehaus, eclipse and now Github. Instead of diving too much into the installation, integration of how to get a jetty server working, they have a easy version of jetty-runner that packages everything you need into a jar file and you can use it to run java web applications. Here is a more detailed tutorial of how you use jetty-runner. I first went to the target folder of that shop app where there is a ex03_1.war file got generated after the mvn clean install in the previous section. I downloaded the jetty-runner to the target folder so the jar and war file are at the same directory. Then you simply need to run command

java -jar jetty-runner*.jar ex03_1.war

Screen Shot 2016-06-12 at 10.35.14 AM

And you should have the web application running on your localhost at port 8080. Now lets test out without using the Junit test.

2. POSTMAN

In the idea scenario, I want to use a browser to show you this, however, to make the post command in vanilla browser, you cannot do it AFAIK, however, there are tons of browser extensions you can use and the one that I am going to use today is called postman. First, lets take a quick look at how that junit test created a user.

ex3_1juniteclipse

The logic is pretty straightforward, first create a customer in the xml format, and then make a post request to the highlighted URL and you should get the 201 status for successfully creating a user. Now, lets try to do it in our postman.

postman_createuser.png

As you can see, when you run jettyrunner, the default behavior is pretty good and the postman also 100% reenforced the fact that our API creation is running properly, here is another screenshot of GET the newly created user.

postman_get

Jetty runner has also some extra arguments for you to customize, here are two screenshots of how I changed the port, the default root path and I am even running it against the folder instead of the war file.

jettyextrafeaturesnewpostman

3. Eclipse

IDE like Eclipse or IntellJ is always good to have. It not only gives you a heavy duty text editor, but also provides you with all the development features that a plain text editor lacks. Also, all the features that we described above using Jetty to deploy could be configured as one button click, this makes a few difference when you need to do the same thing 100 times!

In this case, I am planning to learn more about Tomcat, so I download Tomcat and uncompressed it to a folder. Since all the examples in the Github repo are maven projects, you can easily import projects as existing maven projects. Then you simply need to right click the jaxrs-2.0-workbook-ex03_1 projects and “run on server”. Find the right Tomcat version and point to the installation folder and you are good to go.

eclipse_tomcat_ex03_01.png

There is only one thing I did not fully understand is why the URL root turned out to be the project folder name, which is “ex03_01”, if you happen to know the answer, please leave a comment below.

Kerberos – Create New User

I followed this tutorial and managed to install Kerberos on an AWS ubuntu box. I did not notice anything extra but until I need to create users.

In the origital Ubuntu environment, creating new users is nothing but two lines of code, useradd and passwd, this will create a new user and change the password if you have root access or the sudoer. However, in a Kerberized environment, you not only need to create a user at the Linux system level, but also need to create a principle in the Kerberos database and set up the password there.

Here is a screenshot of how to create a new user and the extra steps are simply login to the kerberos admin server and add the principals there. For some people, you might be wondering what those extra steps bring us, you are right, if you are talking about one machine, this will add zero benefit. However, consider you have a network of hosts/servers, when you change your password, do you really need to go to every machine and change them one by one? Having a centralized third party authentication software like Kerberos will totally save the time.

keberosadduser

Jenkins – Continuous Integration for Python Flask

Jenkins is the leading automation server where it will automate certain parts of the lifecycle of software development, for example, most people use Github to store the code, after the initial set up, Jenkins can “automatically” pull the code from Github, build it, test it, and deploy it. That part is not rocket science and can save lots of time and inconsistency if can be automated.

Here, I used a brand new Jenkins version to show you how Jenkins does all of that from bitbucket.

The Python application that we are pulling is pretty short, but is using Flask, Pandas and Anaconda Conda virtual environment, so I think it should be a good experience to share with.

First of all, you need to go the deployment server and set up a virtual environment in your home folder, I think this part could also be integrated into the job but in this case, I did it manually. Then Jenkins will take care of the rest of it.

This slideshow requires JavaScript.

Bash – process scheduling sleep & at

There is this one really interesting page from introduction to linux that discussed a few ways to schedule processes, of course, the most commonly used one is crontab which is like the oozie for hadoop, the go-to scheduler. Besides of that, they talked about a command called “at” and a few fun use cases of using “sleep”.

First start with sleep command, it does nothing but sleep. Using it with other commands in combo running in the background kind of built a naive working timer. Like “in 5 minutes, I need to head to the other building”. Then you can type the following command

(sleep 5m; echo "you need to go now!") &

will actually first sleep for 5 minutes and then print out a message to stdout for your reminder. I know you have your iPhone but.. this is kind of cool right? but do you think you iPhone can kick out a map reduce job maybe two hours later? this sure can🙂

Second, there is a command called “at”, where it is actually not installed as default on Redhat, you can easily install it by using “sudo yum install at”. Before you do anything, first you need to run “sudo atd” so the daemon is working and listening. Otherwise, you run at command and it won’t work. Once you are done with that, you can simply run the command like “at HH:MM”, here is the screenshot of how it is working:

at

A few notes regarding the small test:

  1. atq: list all the existing at commands waiting in the queue, empty in this case
  2. after the second echo command, you need to hit Ctrl+D to exit
  3. after 00:29:00, there is a file got generated named output.txt!

After learning this command, I can totally imagine how many pranks I have play for the April Fool’s day next year, ahaha!

Bash – tee and “Here Document”

I was reading some documentation and came across this block of bash code:

$ sudo tee /etc/yum.repos.d/docker.repo <<-EOF 
[dockerrepo] 
name=Docker Repository 
baseurl=https://yum.dockerproject.org/repo/main/centos/7 
enabled=1 
gpgcheck=1 
gpgkey=https://yum.dockerproject.org/gpg 
EOF

It is so intriguing that, first it is using the tee command which I don’t use it in my daily life, second, it has that weird “<<” that I have never seen!

tee:

read from standard input and write to standard output and files

tee

As you can tell, after I entered the command “tee output.txt”, it started waiting for my input, when I typed in “line1”, hit enter, it “tee” out the line1 back to the standard input and so does line 2. Then I hit Ctrl+C to stop the input. Then all my previous inputs have been captured and “tee”ed out to the output.txt.

<< Here document

Someone pointed me to this documentation and I realized there is a professional name assigned to this double smaller than symbol called “Here document”, in one sentence, it will use IO redirection to feed a list of commands(stdin) to a command, including interactive ones.

heredocument

eof is simply a convention for indicate the begining and ending, you can use any character or string you want.