Small Chat Room using Kafka and Flask

I have always been interested in learning more about consumer applications like building a chat room, this weekend, I managed to build a chat room using Kafka and Python flask just in a few hours.

The technology stack is very simple. It is primarily using Pytho Flask, very limited HTML and JavaScript. I had to use Redit to get the flask SSE working and Kafka as the message queue system, that both of those two components requires close to minimal set up.

The architecture is very straightforward too. When the client entered a message and clicked submit, a JavaScript function will be trigger, pass on the username and message to a python flask endpoint. There must be a JavaScript library to directly connect to Kafka but I feel it is cleaner to keep JavaScript separated from the backend like database directly, rather to hit the middleware API endpoints. That will for sure free the clients from having visibility into any backend structure from the security perspective, also created some flexibility of any modification to the backend without impacting any front end work. Then the Flask Sendmsg endpoint will use the python kafka library to produce a message into Kafka.

As you can tell from a consumer who has been subscribing to the topic, all the messages are now in the queue. The idea is whenever there is a new message, there is a mechanism to detect and then “broadcast” or “publish” to all the “subscribers”, or all corresponding clients who happened to be in the same chat room.

At this moment, I don’t know how to create a long running process within Flask so I created a small Python consumer running in another script. It listens to the topic and push any new message to another Flask endpoint called “broadcast”.

The broadcast endpoints does nothing more than taking in the new message, but it will publish it via server side event. There are plenty of benefits of using server side event comparing with polling or sometimes even a open websocket.

Now, all the clients have an event listener to the SSE stream. The any received new message from the stream will trigger the newly received message be appended to the chat history.

More work can be done around the Kafka part as the user’s chat history could be retrieved from a database or we use Kafka as the database directly. And the offset will determine to how long a history that we will go back.

Anyway, it is a pretty fun project overall. ūüôā

Excel – Workday

Today I learned a few tricks in Excel which are pretty fascinating.

WORKDAY

Today my coworker need to add some delay to an existing date (more like adding the number of days processing the order on top of the order receiving date). It could work as a simple addition by adding a number to any date cell. However, my coworker insisted on using a function called WORKDAY which I have never seen before. It turned out to be such a simple concept while providing a scenario more realistic – most people don’t work during weekends ūüôā

Format Weekday

Another thing that I realized in Excel is the various ways of formatting dates. One way to extract day of the week is by simply using the format and use ddd (Wed) or dddd (Wednesday).

Observable

Today I came across Observable and really like it at first glance. It gave me the impression as being an online IDE like Jupyter Notebook but for JavaScript.

Javascript tend not to be the first programming language that schools teach. For most engineering related majors, they tend to start with strong typed programming language like C++, Java, C and now maybe more schools start by teaching Python too. However, JavaScript gained huge adoption in the community as it is the most popular scripting language for the Internet. However, the first question for all the beginners is “where should I write code”. I frowned upon when my friend opened up the Chrome browser and start writing code there. It was powerful and all but somehow it just did not feel right to me. There was no canvas like Jupyter notebook, you have to add the graph to an existing DOM etc.

In the observable notebook, everything become so much easier to test and visualize, also the code can be executed and the output is reactive in a way that you can execute a block of code and see the output immediately.

You can not only write JavaScript code straight into block, but also use Markdown language to format in a way that you will like.

I am still learning it but I highly recommend people check it out. You might have heard of D3.js, they even use observable to host all their gallery examples. This further demonstrated its popularity in the JavaScript visualization community.

Windows WSL Vim CopyPaste and Auto indent

Recently I changed my work laptop from the old 2015 Macbook Pro to a new DELL XPS 15 7590. The overall experience has been positive despite a few glitches here and there, mostly related to habit of changing rather than subpar functionalities. One of them is the WSL Vim copy and paste.

In Windows, there is no the commonly used Command Key as a Mac. The previously Command+C/Command+V works as Ctrl+C/Ctrl+V in Windows. However, Ctrl keys in the terminal especially the WSL (Ubuntu Linux) means something very different. In that case, I had to get used to use right click as copy and paste, also change the default property of the WSL terminal to use Ctrl+Shift+C and Ctrl+Shift+V as the key combo to copy paste.

As a Python developer, I copy and paste code into Vim quite a lot. However, in the new workflow, this happened quite a few times.

At the beginning, I thought it was some invisible internet whitespaces that mess this up but it worked fine when I pasted it into notepad.

By looking through content in Vim character by character, I realize that there were 8 spaces at the first indent. My first reaction was “not tab/space again!”

Video By None GIF | Gfycat
Silicon Valley Girl Friend Prefers Spaces over Tabs Scene

However, the debate between tab versus space was mostly for aesthetic reasons but both convention, as long as they are consistent, should work, rather than lead to this tilted ladder shape. Then, I took another look and quickly realize that they is an incremental extra four spaces for every indentation. This turned out to be a great default feature Vim provided for Python as it will auto populate four spaces on behalf of Python developers when needed but a nightmare default selection for paste (as spaces already included).

This Stackoverflow posted touched on many solutions by changing the editing mode as:

:set no autoindent
:set noai
:set paste

I tried all three solutions and they all seem to work but the setnoai/set no autoindent somehow still introduces some unwanted whitespaces but the set paste always work as expected.

Happy copy and paste!

Server Side Event – MDN example

One of the best ways to have server side communicate to the front end is the server side event, relatively to other options like a websocket, short/long polling. In this example provided on MDN (source code here), they have implemented a small backend by hosting a php application on a nginx server.

I have very limited experience with either technology but by following this great tutorial, I am able to get php working with a nginx server. After that, you drop both the backend (sse.php) and frontend (index.htmml) script to the same folder, you should be able to recreate the example as below.

Server Side

In the php code, it is surprising to me how easy it is to read through the code as the basic syntax is extremely similar to other programming languages. The echo ping statement will be executed every second and the echo data will be executed every counter time which is a random number between 1 and 10.

The only places to watch out is

header(“Content-Type: text/event-stream”)

php made it pretty clear that header must be set before any output is set. In the documentation of SSE

The server-side script that sends events needs to respond using the MIME¬†type¬†text/event-stream. Each notification is sent as a block of text terminated by a pair of newlines (“\n\n”). For details on the format of the event stream, see¬†Event stream format.

ob_get_level()

Return the nesting level of output buffering which is 0 only when it is inactive.

ob_end_flush()

This function will send the topmost buffer content off and turn off the buffer. The flush command afterwards will flush the system output buffer.

Client Side

The code on the server side is even simpler. There are two functions that demonstrated the power of server side event. One is called onmessage and the other called addEventListener. The only difference is that the onmessage will be called literally every time there is a message from the server side and the addEventListener will only be called when there is a message with an event field.

The addEventListener makes it easier to build several different listeners based on different event types. Otherwise, we will need to use the onmessage method to write more logic inside to tell one event from the other which could add more overhead.

Long story short, the onmessage is a superset of the addEventListener.

XPath Builder – a small snippet

As far as I can tell, there is not a build in function within JavaScript to get the XPath of any element easily. However, there are plenty of code snippet out there but this one code snippet from Stackoverflow totally caught my eyes!

xpath_builder

This small paragraph of code is like a feast for people new to JavaScript like me. Not only because of the functional programming style as the author has mentioned, but also the concise and compact writing style, building anonymous functions, some of the latest programming operators but it is also using recursive functions almost as a one liner!

In this blog post, we will patiently walk through it line by line, character by character and appreciate the beauty together. So bring some popcorn and let’s get started.

 

const func = () => {}

Functions in JavaScript can be defined in many ways. Just like getXPathForElement is created in a function declaration in a traditional way as below:

function func() {}

However, using arrow sign, you can create an anonymous function and then assigned to func via an expression. Here is a great Stackoverflow questions that highlighted the difference between two approaches. In my personal view, inside a function, it is probably better to use const with arrow operator but if you are writing a function that will be called somewhere else. It is the best to create a function declaration so it works in the scope of the whole script.

= condition ? then : else

This is such a concise way to write a if statement. However, once you wrap your head around the syntax and become comfortable with what it means. Just like any language operator, like Python list comprehension, it can make your code more concise, instead of 10 lines, it is now a 1 liner that is not necessarily more difficult to read.

idx logic

idx looks like a recursive function to construt the index for a given element. For any given element, if there is no previous sibling, then it is the first child, or only child, it will return 1. However, if there exists any sibling, or siblings, it will calculate the index of its previous sibling and add a value of sib.localName == name on top of it.

name||sib.localName will check if the element has a tag name or if the name variable has been provided, likely be true in most of the case.

sib.localName == name will check if the element localname will match the variable name, for regular DOM element, localname is simply the tag name.

Let’s just ahead a bit and find how idx gets called in the end, idx(elm). In this case, it is actually only passing in one variable which leaves the name variable to be undefined.

function_undefinedundefined_or

By testing, we and find that undefined or with any defined variable will result in the value of the defined variable. And the equal condition will like not be true so the code likely will be interpreted as

idx = sib ? idx(sib.previousElementSibling, sib.localName) : 1

And the next round will be

idx = previousElementSibling ? idx(sib.previouspreviousElementSibling, previousElementSibling.localName || previousElementSibling.localName) + (previouspreviousElementSibling.localName == previousElementSibling.name) : 1

whenever the previousElementSibling shared the same tag

idx = … ? idx () + 1 : 1

This is a very intelligent way constructing the index due to the XPath definition.

xpath_tag_index

I know the logic might sounds a bit complex but in short, it basically find how many other sibling ahead of its share the same tag name. Much easier to understand right?

!elm || elm.nodeType !== 1

This is like the stop condition for the segs function. It will stop only when the elm doesn’t exist or the elm node type is no longer element, for example, text node, .etc. When this condition happens, it will return an empty string.

id(${elm.id})

The next condition check is to find if a given element has an id, in XPath, id is a very unique way to unique locate an element. If the id exists and also if the element found using that id can match the element (I assume there could be invalid id or duplicated id which are invalid). Then it will return id(${elm.id}). This small text is enclosed within the tick sign where ${} will be evaluated and replace with the value of the variable. For example, if the element has an id and the id’s value is abc. The return string will be id(abc).

… rest, spread

Triple dot is a special operator in JavaScript that can be used to capture of the rest of the input arguments, used as a rest operator, or spread the list like variables.

spread_operator

Now we know that […first_part, second_part] will do the job by having the first_part calculating the XPath for its parent and the second_part be the tag and the index of its siblings.

Anatomy of Selection inside a Browser

One will be so humbled if you don’t take everything for granted. In this post, we will look into a basic yet fundamental Browser operation – Selection.

Selection is probably an operation that most people learned even when they were a child, literally. You were taught to use the mouse to select the area of content by starting to hold the left key, move to cover the area that you like to select and end by releasing the left key.

devmoz_selection

Your selected text is an object called Selection and there are so many ways to access it and one way is by calling getSelection.

By reading the definition of Selection, we also learned that it not only covered the pointer selected range, can also include the text selected by moving the caret. devmoz_getselection

Based on the definition, there are two key concepts that constitute a Selection, one is called the anchor (start) and the other is called the focus (end). And here, we are going to disassemble a selection object and learn that way.

selection_anchor_focus

Here I highlighted some text in Wikipedia (grey area as now the focus is in the developer console from “terial may..” to “November 2”.

anchor

anchor

At first glance, there might seem like overwhelming amount of information but just like any class, once you have seen it more, things become easier and easier. Actually, archorNode is a simple object of the most common class – Node.

Also, anchorOffset stores the value of 14. Because our anchorNode is a text, 14 is simply the number of characters inside the text, to mark the exact position. As previously mentioned, our selected text starts at the charcter “terial..” and the character before it “a” is the 14th character in the string so that makes sense.

anchor_offset

focus

selection_focus

for baseNode and extendNode, they look like concepts only available in Chrome and just aliases for anchor and focus [source].

HTML

Here I took a screenshot of the snippet of the HTML code. The anchor node and focus node are both highlighted in yellow. Now it is even easier to put the previous concepts into the context, like why the anchor node is a text, that its previous and next siblings are both hyperlinks and its parent is a span, etc.

selection_html

Dynamic

One interesting observation when playing with the Selection is that the Selection objection is dynamic or always referenced to the current selection by the user.

For example, in the following exercise, I create a variable called s by calling window.getSelection(). All of attributes are either null or 0 as the default value because I have not selected any content. Then I start clicking around on the web page and the selection object got updated to the position where I clicked.

As I have mentioned previously, a caret movement is also a type of selection and a click will move the focus to certain position, and you can almost think it as a area of selection where the anchor is the same as the focus.

Of course, when I select an area, all the attributes of the selection object got updated accordingly.

selection_dynamic

So do all the roads lead to Rome? I created two variables s and s1 and by using the equal and strict equal, clearly they are referencing to the same object as both condition are true.

“If both operands are objects, return¬†true only if they refer to the same object.”

Mozilla

selection_reference

Selection Constructor

At first glance, it looks like Selection only needs a anchor, focus and the corresponding offsets if they are texts. To confirm our understanding is right, let’s try if we will be able to code a Selection object from scratch, rather than select by mouse.

In the following screenshot. I select a position and stored the anchorNode and its offset as anchor and anchor_offset. They two variables preserve the start position so I can use it later. Then I did the same thing to store the ending positions into focus and focus_offset. In the end, I called the setBaseAndExtent to assign the selection start and end to be those two positions and magically, all the content between those two positions got highlighted. It is working!

selection_setbaseandextend

Through this exercise, clearly there are two types of variables that worth mentioning, variables like s that will change as the user change, instead of the value, it preserves the reference, a linkage to another variable or object that constantly changes. There is another type of variables like anchor, anchor_offset, focus and focus_offset that once being created, it took a “deep copy” of the start and end position, and regardless of how s changes, those variables are intact.

The explanation of that question goes beyond the scope of this article but here are two blog posts that helped explain assign and copy.

Now we have a good idea of how Selection work, even able to recreate a Selection ourselves programmatically, the next question will be can we memorize the selections and store it somewhere for future use? Almost use selection like a highlighter?

Multiple Selection

Multiple selection appear in several scenarios, like you can select multiple icons on your Desktop by holding the Ctrl key and then operate in a batch fashion. In Microsoft Excel, you can do something similar too. In a Browser, can you remember if you have ever selected multiple paragraphs at once? If your answer is no, you are not alone.

A selection object represents the Ranges that the user has selected. Typically, it holds only one range, accessed as follows:

var selObj = window.getSelection();
var range  = selObj.getRangeAt(0);
  • selObj¬†is a Selection object
  • range¬†is a¬†Range¬†object

As the Selection API specification notes, the Selection API was initially created by Netscape and allowed multiple ranges (for instance, to allow the user to select a column from a <table>). However, browsers other than Gecko did not implement multiple ranges, and the specification also requires the selection to always have a single range.

Firefox is a Gecko based Browser and by holding the control key, you are actually able to select multiple ranges and access the different ranges through getRangeat(index) method.

selection_multirange

Given that, we probably need to be careful of how we frame our discovery, in this case, I still will prefer to say that there can only be one Selection at a time, it is updated automatically as the user changes. However, one selection can includes multiple ranges.

Range

Now we understand that Selection is almost a set of Ranges, it can include none, one or many.

The Range interface represents a fragment of a document that can contain nodes and parts of text nodes.

A range can be created by using the Document.createRange() method. Range objects can also be retrieved by using the getRangeAt() method of the Selection object or the caretRangeFromPoint() method of the Document object.

There also is the Range() constructor available.  <- highly recommended example

By reading through the documentation of Range, we see the Range has more methods than just Selection, it not only has the methods of start and end, instead of anchor and focus but also methods related to clone, insert, rectangular and many others. This is to demonstrate that we could store the Range instead due to the benefit that the Selection got auto updated all the time but the Range that it includes can be easily copied.

However, we copy the state of range not only for preservation across the lifecycle of the active tab. We want to put it into a database on disk, send to an API and want to store it for eternity.

range_serialize

Here, we not only have a difficult time preserve the Range object, even the startContainer cannot be serialized. This is new and interesting to me. In my opinion, there should be a way to just save the object as is, all the attributes, everything. It means next time, I don’t even have to worry about the construction process and can just use AS-IS (like Python Pickle). Is it efficient, of course not, is it easy? to certain degree.

The second approach is to keep going down the dependency tree.

Can we easily store Selection? No, let’s go Range.
Can we easily store Range? No, let’s go startContainer.
Can we easily store startContainer? Maybe.

By reading through the documentation, we know that a container is a node itself. Once you can store the node, you can even by pass range and build the Selection directly. It is such a fundamental concept now and it might even worth to go through the documentation of Node and research what is the unique identifier for a node that we can use to recreate it from stored data.

Node

Node is a very important concept in Javascript. All the Browser operations center around locating and manipulating certain objects within the page. That can be translated into locating certain nodes and do something with it. By reading through the documentation, the Node interface sort of got abstracted further, to a level where most regular end users (like the regular Joes who use Internet Explorer) won’t recognize it has anything to do with Browsing, because it is closely coupled to the underlying data structure, tree navigation.

node

Most of its methods are related to its relative position within the tree, the parent, the children, the sibling, etc.

The good news is that there are many ways to locate a Node within the DOM tree, like by its path from root, like by its tag name, by its relative position, by its class name, by its unique id, by its text, ..etc. I don’t believe there exists a way to uniquely identify a node because the DOM tree itself can get rendered differently based on different situations.

There are many ways of construction a XPath from an Element and also many ways of locating an Element from a given XPath. The content related to XPath is fascinating but the amount of information required to cover easily exceed the scope of this blog post. We will have a separate article related to XPath but for now, let’s assume that we are in a good shape.

Conclusion

The XPath was built using the Chrome Web Developer Console as there is an option called copy xpath for any inspected element.

Hopefully this post has helped some of you better understand how the selection and range object works.

xpath_getselection

React.js memo – memoization

reactjs

This morning, I came across that React has a memoization function – memo when I was learning about dynamic programming. Like any other memoization implementation, it serves the purpose of recycling calculated result.

“If your function component renders the same result given the same props, you can wrap it in a call to React.memo for a performance boost in some cases by memoizing the result. ”¬† – quoted from reactjs’ documentaiton

This post is to document some of the things that I learned while discussing with my wife, who is much more experienced with react.js.

Our discussion starts with this great tutorial provided by Sam Pakvis, Sam provided some sample code to demonstrate the usage of memo function but it is indeed a bit succinct, or maybe too much that I will add more details in this post.

Environment – codepen.io

When you learn new programming languages, the most difficult part for starters tend not to be the language itself, it is usually the environment which could possibly take you hours even before you start seeing the first output of your “hello world”.

codepen.io is a great online sandbox or online coding editor which you can test out small snippet of code and see the output interactively.

codepen

Component/Props

I am primarily a Python developer so some of the basics of React of even Javascript definitely worth sharing here. A React Component is very much like a Function while the input variables or arguments are called props (short for properties).

Here are a few ways how you can define them:

function Welcome(props) {return result;}
Welcome = (props) => {return results;}
class Welcome extends React.Component { render(){return this.props}}

Lifecycle

When a function is definitely, naturally you need to specify at what time to do what. Like many programming languages like gaming programming in C# for Unity, like processing for Arduino, there is an initialization and following iterations of update or refresh, sometimes infinite. In React, they use the term lifecycle to refer to several built-in methods like DidMount being called when it first got mounted and DidUpdate whenever the state got changed and many others.

componentDidMount, componentDidUpdate

setInterval(() => {}, 1000) will execute a function (1st argument) every 1000ms (2nd argument). In the tutorial, they set a name randomization using setInterval within componentDidMount method.

In the end, there is a built-in method called render which is responsible for rendering the page by returning elements.

render() {return } – JSX

There are some very interesting observations that I made by adding console.log to componentDidUpdate and render. You can simply plugin console.log(‘msg’ + Date()) and it will print out your message and the timestamp. One can quickly realize that both of these two methods got called every 1 sec, the interesting part is that with or without memoization and with/without the state being changed. “Change” is a vague term just as “Same”. In this tutorial, they random select names and you have 1/3 chance of selecting the same name, in that case, the value for sure will stay the same but clearly, React thinks that componentDid(get)Update(d) and it should re”render” the page.

After some research, I realized that the render function might get executed multiple times (literally every 1 second in the tutorial) but it doesn’t mean the web page got “rerendered” every 1 sec. And the “page” got stayed mostly the same and only the specific part got repainted only when a different name got changed.

React.memo

React.memo({prop} => {return xxx}). In this case, whenever there is a prop being passed and the result got calculated. The prop and the corresponding result got cache for later usage. In the tutorial, every second, the prop will be a randomized name. If in the next cycle, the same name got selected, as now the View is a memoization function, it will realize that this name has seen before and instead of return the whole function, it will retrieve the result from cache, return back.

memo_viewmemo_console

As you can tell from the console log, they “skipped” a cycle in 17:18:54 as the BruceSun got picked again at that time, when it got passed to View, it realized that the previous name used was BruceSun, so it decided not to execute View again, hence, there is no data printed for 17:18:54. What is interesting that cache usually can go nuts if the input parameters are very diverse and your cache becomes big. By observing the output, one can clearly see that it is only comparing with the previous state, otherwise, there will only be three lines which each line indicates when its corresponding name got selected. For example, PeterSun got calculated at 17:18:59 and 17:18:55. So the memo in React.js is only cache one record. This is already super helpful for fairly static records.

VirtualDOM (VDOM)

The render method got called every cycle, does that mean our page change every second? The answer could definitely be yes but probably no. Why? In React, this part is really smart. They have some mechanism to determine even if the render got called, certain things happened and the DOM got updated, the updated value could be the same, the updated value could only apply to certain elements, or the update is a complete rebuild of the page, React will efficiently identify the difference between the two versions of DOM and apply the change when necessary.

It is done via something called VirtualDOM.

I watched a Youtube video which talked a bit indepth of each version of DOM between intervals are represented as trees and how the difference got detected and how the real DOM got updated.

Chrome Developer Tool Performance

When I first saw render function got called every second, that strongly made me think that the whole page got rendered or repainted every second too. Even if there is nothing changed on the page, it is still being repainted. Just like a cartoon with multiple frames that are the exact copy, the frames are still the same.

I later on changed my mind after seeing what is actually happening after recording the session in DeveloperTool/Performance/Record. In that, there is an event called repaint who looks like repaint or literally modify the site got called only when there is a name got changed.

console_performance_recordingconsole_performance_eventlogconsole_performance_recording_replay

It looks like a very powerful to front end developers but that convinced me that page only got repainted when there is change.

Conclusion

By studying the tutorial for two hours, you have to acknowledge that the React library is a very powerful tool for building dynamic websites. Due to the nature of a webbrowser, you do have to invest the time to learn the basics of React in order to fully grasp which code serve what purpose because it is just how it is supposed to work, for a reason that you might not be aware of yet but good for sure.

Again, the best way of learning a new library is to invest time covering the basics and also whatever way some of the most experienced React developers recommended.

So RTFM.

C4.5 Study Notes – Criterion

Screen Shot 2020-04-30 at 9.10.32 PM

Book of Programs for Machine Learning C4.5 from J. Ross Quinlan

For those who are into A.I. know the existence of a very popular machine learning technique called a decision tree. It is so well known that people without any machine learning background probably used it under the scenarios like “if A happens, we should do B, but if not, we should do C. However, even if A happens, we might …”. The easiest way to visualize it is to start at the beginning called the root and break it down into different scenarios based on conditions.

Even in Finance, traders want to understand how likely the interest rate might move towards the current spot price, rather than on a pure condition based, they actually assume the condition stay the same but with different probabilities of going up and down, and might use option exercise conditions to prune the tree. In essence, tree is a great way of visualizing decision makings, and using data to draw an optimal tree that best represent the underlying probability distribution or “predicts mostly right” is the specialized field of construction decision trees based on data.

Arbre_Binomial_Options_Reelles

Wikipedia: Binomial_options_pricing_model – Interest Rate Tree

I have used this technique for years and even today, many of my favorite models are still using decision trees as the building unit, but it is just a matter of ensembling them in a different way like random forest or gradient boosting machines. Clearly, decision trees are very important but very few know its real history. Actually, the decisions trees that everyone is using actually even appear before modern computers.

[picture]

Ross Quinlan – Australian Computer Scientist and Researcher

C4.5 and ID3 are two common implementations from Ross Quinan who pioneered the development of a tree based induction algorithm back in the old days. This book was first published in 1992 and its predecessor ID3 was developed in the 1980s. On the other hand, the famous CART (classification and regression tree) by Breiman, Friedman, Olshen and Stone was also developed around that time. The DOS (Disk Operating System) was first introduced at that time around 1980s just so you got an idea. In Quinlan’s book, he even traced back further into the history saying that he was already studying this CLS related INDUCE system in 1970s and even referred some of the work in a book called the experiments in induction by Hunt which I barely could find any trace on the Internet. Anyhow, if anyone becomes too excited to start bragging about how awesome they are because they are using A.I. to change the world, thinking it is all new, well, they parents probably were not born when similar techniques got applied in real life. I got so humbled when I read through this book C4.5 and it is so cleanly written (maybe people careless about using fancy words back then?) and things just make sense on its own instead of today’s work, like every word got hyperlinked to another book.

Literally, the book is only 100 pages of content and the rest are all ancient C code. In these 100 pages, the chapter II is what interested me the most discussing the criterion of building test – in modern ML lingo, splitting the trees.

I came across learning ML from the practical perspective that cross entropy is just another cost function people commonly used that is best for classification problem. Not only because it is bounded from 0 to 1 but also has this vague feeling that it goes low when the number of wrong predictions goes wrong and vice versa. I never truely understand the meaning behind cross-entropy until I read what Quinlan said.

Here is directly quoted from the book

“…Even though his programs used simple criteria of this kind, Hunt suggested that an approach based on information theory might have advantages [Hunt et al., 1966, p.95] When I was building the forerunner of ID3, I had forgotten this suggestion until the possibility of using information-based methods was raised independently by Peter Gacs. The original ID3 used a criterion called gain, defined below. The information theory that underpins this criterion can be given in one statement: The information conveyed by a message depends on its probability and can be measured in bits as minus the logarithm to base 2 of that probability….” – Quinlan 1992, p.21

Screen Shot 2020-04-30 at 9.54.12 PM

So basically, Quinlan emphasized that the reason that he and many others picked entropy, or cross entropy as the KPI / criterion was based on what information theory said. OK, then who came up with that then? what it actually means?

Now we need to jump from 1990s even further to when the measurement of modern information was introduced by Claude Shannon.

Celebrating Claude Shannon - IEEE Spectrum        Bell Labs: Claude Shannon, Father of Information Theory, Dies at 84

This guy published a paper in 1948, YES, even a year before the people republic of China was founded. This paper – the mathematical theory of communication serve as the milestone for the Digital age, now that you know how it revolutionized the A.I. field? well, that was not even half it, have you heard of Huffman coding, the algorithm behind your gzip, 7zip, and others? the bottom-up greedy tree construction coding algorithm that you scratch your head about in college, well, David Huffman came up with his coding algorithm under when he was a student of Fino and there comes the famous Shannon-Fino coding, well, the same Shannon. Imagine if Shannon is still alive, do you know how many Github stars that he will have or how many stackoverflow points he will get? …

Click to access entropy.pdf

Screen Shot 2020-04-30 at 10.12.24 PM

1948 seriously? well, let’s not give him too many credits as we are talking about Quinlan’s book. So basically, they used this information theory based measurement as the bedrock for machine learning cost function.

which later on, Quinlan also introduced a modified version of criterion which is pretty much the relative gain, gain ratio, rather than the absolute gain.

Screen Shot 2020-05-02 at 9.03.09 AM

The interesting part first to me was that gainratio was not necessarily a typical change ratio. Other wise gainratio = (gain – info) / info. However, there is still some difference between split info and info.

Splitinfo is merely looking at how many records got split into each buckets, and it doesn’t worry about how “pure” each bucket is nor class distribution within each bucket. However, for info_X(T), test X will determine how training records go into which bucket, and likely there will be misclassification, then the info_X(T) will be calculated as not only on each bucket, but within each bucket, it will be calculated as the info within that bucket which has nothing to do with test, purely got class got distributed within that bucket. Or in another way, splitinfo even has nothing to do with the accuracy, not even using the label data. But info does.

Intuitively speaking, info_T won’t change, but as we apply different tests, T might get sliced and diced into different buckets (two buckets for a binary tree split), and then info_X(T) will change.

Assume that we find a group of test which split the training data into the same amount of proportion, say always 50/50, then the T_i / T part will always be the same, but depends on which class go to which bucket, the info(T_i) part will be different. So let’s assume that we have a good split, all cases within the same bucket has the same class. Then we know T_i will be 0 as it is so pure like the probability within the logarithm will always be 1, and there is no uncertainly. However, if the classification is not pure, then the log will be negative and everything will be positive, and maximized when it is completely random.

Screen Shot 2020-05-02 at 9.23.26 AM

That sort of explained if we are trying to find the best split with the highest gain, we will need to find the split X with the lowest info, highest “purity”. However, a definitive approach of having the purest division is to have one branch for each record, but that is also the secret and that defies the philosophy of generalizable pattern recognition. So that is why Quinlan introduced the term gainratio which stops that from happening? how? because if we tend to over fit, there will be lots of buckets, and T_i/T will become small, really small if you have lots of classes, like a training data that has a primary key – unique values. so if we have n buckets: splitinfo = n * (1/n) log (1/n) = log(1/n) which will be a big number when n increased and the denominator of gain ratio will increase, which achieved the goal of not overfitting.

Well, that makes sense but doesn’t make sense at the same time. The goal is clear but the number of approaches to achieve the goal is endless. I do have the question that is there any reason that one criterion is better than the other, and if so, is there a possibility that there is a criterion that that will work the best without overfitting. Information theory is a theory in essence, and people trying to apply it might not applied it in a perfect way, for example, another commonly used criterion is Gini impurity which is defined as sum(p(1-p)), it has a surprising resemblance to info gain but instead of log(p), they used (1-p)? well, clearly, anything that has a reverse force has its legitimacy or maybe even other function like p * sin(p), or p * 1/e^p or … anything else.

Also, the common decision trees today are usually a binary tree that meet the criterion of yes or no, but is it necessarily the most accuracy or even efficient way of contructing a tree? certainly any other decision tree can be represented by binary tree but is it the best? just like people use B-Tree rather than binary search tree, ..etc.

Most importantly, even Quinlan mentioned in his book that building the smallest decision tree consistent with a training set is NP-complete [Hyafil and Rivest, 1976]. There could be 4M different trees for the sample example that has only a handful of records. How can you be sure that there isn’t a better way of doing this? …

Maybe there is an engineering approach to simulte and search for the best and generic approach, if there is one ūüôā