Python Remove Comment – Tokenize

Today while I was doing some code review, I want to gauge the amount of effort by estimating how many lines of code there is. For example, if you are at the root folder of some Python library, like flask, you can easily count the number of lines in each file:

(python37) $ wc -l flask/*
      60 flask/__init__.py
      15 flask/__main__.py
     145 flask/_compat.py
    2450 flask/app.py
     569 flask/blueprints.py
     ...
      65 flask/signals.py
     137 flask/wrappers.py

    7703 total

However, when you open up one of the files, you realize the very majority of the content are either docstrings or comments and the code review isn’t quite as intimidating as it looks like at a first glance.

Then you ask yourself the question, how to strip out the comments and docstrings and count the effective lines of code. I didn’t manage to find a satisfying answer on Stackoverflow but came across this little snippet of gist from Github by BroHui.

At the beginning, I was thinking an approach like basic string manipulation like regular expression but the author totally leverage the built-in libraries to take advantage of lexical analysis. I have actually never used these two libraries – token and tokenize before so it turned out to be a great learning experience.

First, let’s take a look at what a token is.

TokenInfo(type=1 (NAME), string='import', start=(16, 0), end=(16, 6), line='import requests\n')
TokenInfo(type=1 (NAME), string='requests', start=(16, 7), end=(16, 15), line='import requests\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(16, 15), end=(16, 16), line='import requests\n')

For example, one line of python import code got parsed and broken down into different word/token. Each token info not only contain the basic token type, but also contains the physical location of the token start/end with the row and column count.

After understanding tokenization, it won’t be too hard to draw the connection between how to identify comment and docstring and how to deal with those. For comment, it is pretty straightforward and we can identify it by the token type COMMENT-55. For docstring, it is actually a string within its own line/lines of code without any other elements rather than indentations.

Keep in mind that we are parsing through tokens one by one, you really need to retain the original content after your work.

Frankly speaking, I cannot wrap my head around the flags that the author used to keep track of the previous_token and the first two if statement cases. However, I don’t think that matter that much so let’s keep note of it and focus on the application.

Here I created a small quote sample with test docstrings and comment in blue.

Screen Shot 2019-11-26 at 10.23.20 PM.png

This is the output of tokenization and I also helped highlighted the lines that interest us. Screen Shot 2019-11-26 at 10.24.46 PM

This is the final output after the parsing. However, you might want to completely remove the comments or even make it more compact by removing blank lines. We can either modify the code above by replacing mod.write with pass and also identify “NL” and remove them completely.

Screen Shot 2019-11-26 at 10.27.10 PM

One thought on “Python Remove Comment – Tokenize

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s