React.js memo – memoization


This morning, I came across that React has a memoization function – memo when I was learning about dynamic programming. Like any other memoization implementation, it serves the purpose of recycling calculated result.

“If your function component renders the same result given the same props, you can wrap it in a call to React.memo for a performance boost in some cases by memoizing the result. ”  – quoted from reactjs’ documentaiton

This post is to document some of the things that I learned while discussing with my wife, who is much more experienced with react.js.

Our discussion starts with this great tutorial provided by Sam Pakvis, Sam provided some sample code to demonstrate the usage of memo function but it is indeed a bit succinct, or maybe too much that I will add more details in this post.

Environment –

When you learn new programming languages, the most difficult part for starters tend not to be the language itself, it is usually the environment which could possibly take you hours even before you start seeing the first output of your “hello world”. is a great online sandbox or online coding editor which you can test out small snippet of code and see the output interactively.



I am primarily a Python developer so some of the basics of React of even Javascript definitely worth sharing here. A React Component is very much like a Function while the input variables or arguments are called props (short for properties).

Here are a few ways how you can define them:

function Welcome(props) {return result;}
Welcome = (props) => {return results;}
class Welcome extends React.Component { render(){return this.props}}


When a function is definitely, naturally you need to specify at what time to do what. Like many programming languages like gaming programming in C# for Unity, like processing for Arduino, there is an initialization and following iterations of update or refresh, sometimes infinite. In React, they use the term lifecycle to refer to several built-in methods like DidMount being called when it first got mounted and DidUpdate whenever the state got changed and many others.

componentDidMount, componentDidUpdate

setInterval(() => {}, 1000) will execute a function (1st argument) every 1000ms (2nd argument). In the tutorial, they set a name randomization using setInterval within componentDidMount method.

In the end, there is a built-in method called render which is responsible for rendering the page by returning elements.

render() {return } – JSX

There are some very interesting observations that I made by adding console.log to componentDidUpdate and render. You can simply plugin console.log(‘msg’ + Date()) and it will print out your message and the timestamp. One can quickly realize that both of these two methods got called every 1 sec, the interesting part is that with or without memoization and with/without the state being changed. “Change” is a vague term just as “Same”. In this tutorial, they random select names and you have 1/3 chance of selecting the same name, in that case, the value for sure will stay the same but clearly, React thinks that componentDid(get)Update(d) and it should re”render” the page.

After some research, I realized that the render function might get executed multiple times (literally every 1 second in the tutorial) but it doesn’t mean the web page got “rerendered” every 1 sec. And the “page” got stayed mostly the same and only the specific part got repainted only when a different name got changed.


React.memo({prop} => {return xxx}). In this case, whenever there is a prop being passed and the result got calculated. The prop and the corresponding result got cache for later usage. In the tutorial, every second, the prop will be a randomized name. If in the next cycle, the same name got selected, as now the View is a memoization function, it will realize that this name has seen before and instead of return the whole function, it will retrieve the result from cache, return back.


As you can tell from the console log, they “skipped” a cycle in 17:18:54 as the BruceSun got picked again at that time, when it got passed to View, it realized that the previous name used was BruceSun, so it decided not to execute View again, hence, there is no data printed for 17:18:54. What is interesting that cache usually can go nuts if the input parameters are very diverse and your cache becomes big. By observing the output, one can clearly see that it is only comparing with the previous state, otherwise, there will only be three lines which each line indicates when its corresponding name got selected. For example, PeterSun got calculated at 17:18:59 and 17:18:55. So the memo in React.js is only cache one record. This is already super helpful for fairly static records.

VirtualDOM (VDOM)

The render method got called every cycle, does that mean our page change every second? The answer could definitely be yes but probably no. Why? In React, this part is really smart. They have some mechanism to determine even if the render got called, certain things happened and the DOM got updated, the updated value could be the same, the updated value could only apply to certain elements, or the update is a complete rebuild of the page, React will efficiently identify the difference between the two versions of DOM and apply the change when necessary.

It is done via something called VirtualDOM.

I watched a Youtube video which talked a bit indepth of each version of DOM between intervals are represented as trees and how the difference got detected and how the real DOM got updated.

Chrome Developer Tool Performance

When I first saw render function got called every second, that strongly made me think that the whole page got rendered or repainted every second too. Even if there is nothing changed on the page, it is still being repainted. Just like a cartoon with multiple frames that are the exact copy, the frames are still the same.

I later on changed my mind after seeing what is actually happening after recording the session in DeveloperTool/Performance/Record. In that, there is an event called repaint who looks like repaint or literally modify the site got called only when there is a name got changed.


It looks like a very powerful to front end developers but that convinced me that page only got repainted when there is change.


By studying the tutorial for two hours, you have to acknowledge that the React library is a very powerful tool for building dynamic websites. Due to the nature of a webbrowser, you do have to invest the time to learn the basics of React in order to fully grasp which code serve what purpose because it is just how it is supposed to work, for a reason that you might not be aware of yet but good for sure.

Again, the best way of learning a new library is to invest time covering the basics and also whatever way some of the most experienced React developers recommended.


C4.5 Study Notes – Criterion

Screen Shot 2020-04-30 at 9.10.32 PM

Book of Programs for Machine Learning C4.5 from J. Ross Quinlan

For those who are into A.I. know the existence of a very popular machine learning technique called a decision tree. It is so well known that people without any machine learning background probably used it under the scenarios like “if A happens, we should do B, but if not, we should do C. However, even if A happens, we might …”. The easiest way to visualize it is to start at the beginning called the root and break it down into different scenarios based on conditions.

Even in Finance, traders want to understand how likely the interest rate might move towards the current spot price, rather than on a pure condition based, they actually assume the condition stay the same but with different probabilities of going up and down, and might use option exercise conditions to prune the tree. In essence, tree is a great way of visualizing decision makings, and using data to draw an optimal tree that best represent the underlying probability distribution or “predicts mostly right” is the specialized field of construction decision trees based on data.


Wikipedia: Binomial_options_pricing_model – Interest Rate Tree

I have used this technique for years and even today, many of my favorite models are still using decision trees as the building unit, but it is just a matter of ensembling them in a different way like random forest or gradient boosting machines. Clearly, decision trees are very important but very few know its real history. Actually, the decisions trees that everyone is using actually even appear before modern computers.


Ross Quinlan – Australian Computer Scientist and Researcher

C4.5 and ID3 are two common implementations from Ross Quinan who pioneered the development of a tree based induction algorithm back in the old days. This book was first published in 1992 and its predecessor ID3 was developed in the 1980s. On the other hand, the famous CART (classification and regression tree) by Breiman, Friedman, Olshen and Stone was also developed around that time. The DOS (Disk Operating System) was first introduced at that time around 1980s just so you got an idea. In Quinlan’s book, he even traced back further into the history saying that he was already studying this CLS related INDUCE system in 1970s and even referred some of the work in a book called the experiments in induction by Hunt which I barely could find any trace on the Internet. Anyhow, if anyone becomes too excited to start bragging about how awesome they are because they are using A.I. to change the world, thinking it is all new, well, they parents probably were not born when similar techniques got applied in real life. I got so humbled when I read through this book C4.5 and it is so cleanly written (maybe people careless about using fancy words back then?) and things just make sense on its own instead of today’s work, like every word got hyperlinked to another book.

Literally, the book is only 100 pages of content and the rest are all ancient C code. In these 100 pages, the chapter II is what interested me the most discussing the criterion of building test – in modern ML lingo, splitting the trees.

I came across learning ML from the practical perspective that cross entropy is just another cost function people commonly used that is best for classification problem. Not only because it is bounded from 0 to 1 but also has this vague feeling that it goes low when the number of wrong predictions goes wrong and vice versa. I never truely understand the meaning behind cross-entropy until I read what Quinlan said.

Here is directly quoted from the book

“…Even though his programs used simple criteria of this kind, Hunt suggested that an approach based on information theory might have advantages [Hunt et al., 1966, p.95] When I was building the forerunner of ID3, I had forgotten this suggestion until the possibility of using information-based methods was raised independently by Peter Gacs. The original ID3 used a criterion called gain, defined below. The information theory that underpins this criterion can be given in one statement: The information conveyed by a message depends on its probability and can be measured in bits as minus the logarithm to base 2 of that probability….” – Quinlan 1992, p.21

Screen Shot 2020-04-30 at 9.54.12 PM

So basically, Quinlan emphasized that the reason that he and many others picked entropy, or cross entropy as the KPI / criterion was based on what information theory said. OK, then who came up with that then? what it actually means?

Now we need to jump from 1990s even further to when the measurement of modern information was introduced by Claude Shannon.

Celebrating Claude Shannon - IEEE Spectrum        Bell Labs: Claude Shannon, Father of Information Theory, Dies at 84

This guy published a paper in 1948, YES, even a year before the people republic of China was founded. This paper – the mathematical theory of communication serve as the milestone for the Digital age, now that you know how it revolutionized the A.I. field? well, that was not even half it, have you heard of Huffman coding, the algorithm behind your gzip, 7zip, and others? the bottom-up greedy tree construction coding algorithm that you scratch your head about in college, well, David Huffman came up with his coding algorithm under when he was a student of Fino and there comes the famous Shannon-Fino coding, well, the same Shannon. Imagine if Shannon is still alive, do you know how many Github stars that he will have or how many stackoverflow points he will get? …

Click to access entropy.pdf

Screen Shot 2020-04-30 at 10.12.24 PM

1948 seriously? well, let’s not give him too many credits as we are talking about Quinlan’s book. So basically, they used this information theory based measurement as the bedrock for machine learning cost function.

which later on, Quinlan also introduced a modified version of criterion which is pretty much the relative gain, gain ratio, rather than the absolute gain.

Screen Shot 2020-05-02 at 9.03.09 AM

The interesting part first to me was that gainratio was not necessarily a typical change ratio. Other wise gainratio = (gain – info) / info. However, there is still some difference between split info and info.

Splitinfo is merely looking at how many records got split into each buckets, and it doesn’t worry about how “pure” each bucket is nor class distribution within each bucket. However, for info_X(T), test X will determine how training records go into which bucket, and likely there will be misclassification, then the info_X(T) will be calculated as not only on each bucket, but within each bucket, it will be calculated as the info within that bucket which has nothing to do with test, purely got class got distributed within that bucket. Or in another way, splitinfo even has nothing to do with the accuracy, not even using the label data. But info does.

Intuitively speaking, info_T won’t change, but as we apply different tests, T might get sliced and diced into different buckets (two buckets for a binary tree split), and then info_X(T) will change.

Assume that we find a group of test which split the training data into the same amount of proportion, say always 50/50, then the T_i / T part will always be the same, but depends on which class go to which bucket, the info(T_i) part will be different. So let’s assume that we have a good split, all cases within the same bucket has the same class. Then we know T_i will be 0 as it is so pure like the probability within the logarithm will always be 1, and there is no uncertainly. However, if the classification is not pure, then the log will be negative and everything will be positive, and maximized when it is completely random.

Screen Shot 2020-05-02 at 9.23.26 AM

That sort of explained if we are trying to find the best split with the highest gain, we will need to find the split X with the lowest info, highest “purity”. However, a definitive approach of having the purest division is to have one branch for each record, but that is also the secret and that defies the philosophy of generalizable pattern recognition. So that is why Quinlan introduced the term gainratio which stops that from happening? how? because if we tend to over fit, there will be lots of buckets, and T_i/T will become small, really small if you have lots of classes, like a training data that has a primary key – unique values. so if we have n buckets: splitinfo = n * (1/n) log (1/n) = log(1/n) which will be a big number when n increased and the denominator of gain ratio will increase, which achieved the goal of not overfitting.

Well, that makes sense but doesn’t make sense at the same time. The goal is clear but the number of approaches to achieve the goal is endless. I do have the question that is there any reason that one criterion is better than the other, and if so, is there a possibility that there is a criterion that that will work the best without overfitting. Information theory is a theory in essence, and people trying to apply it might not applied it in a perfect way, for example, another commonly used criterion is Gini impurity which is defined as sum(p(1-p)), it has a surprising resemblance to info gain but instead of log(p), they used (1-p)? well, clearly, anything that has a reverse force has its legitimacy or maybe even other function like p * sin(p), or p * 1/e^p or … anything else.

Also, the common decision trees today are usually a binary tree that meet the criterion of yes or no, but is it necessarily the most accuracy or even efficient way of contructing a tree? certainly any other decision tree can be represented by binary tree but is it the best? just like people use B-Tree rather than binary search tree, ..etc.

Most importantly, even Quinlan mentioned in his book that building the smallest decision tree consistent with a training set is NP-complete [Hyafil and Rivest, 1976]. There could be 4M different trees for the sample example that has only a handful of records. How can you be sure that there isn’t a better way of doing this? …

Maybe there is an engineering approach to simulte and search for the best and generic approach, if there is one 🙂









Home Wifi 5GHz vs 2.4GHz

Due to the ongoing pandemic of COVID19, I have been working from home for almost two months with limited time even outside the house. Entertainment, working, communication and pretty much all activities center around Internet. Previously, I chose if not the cheapest, Comcast Internet only service that supports up to 100MBs. It was sufficient for very majority of my activities but not until now, I realized that my video conference was lagging and download speed was limited and then I decided to pay extra $5 a month for an upgrade to 200MBs. I definitely saw a huge improvement in my daily activities and the internet speed is awesome!

The download speed was definitely beautiful. But somehow I realized that the signal was not ideal at my home office. Like most families, my moderm was connected out of the TV cable in the living room which is based on the first floor, so does the router connected to it. My home office is located on the second floor, and with stairs, walls, and doors closed. I start to ask my self the question, “how much I am getting out of the total bandwidth now”. The first question is how do you measure your internet speed? There are absolutely a lot of factors goes into the “download speed” like your Internet package, your connection method (Ethernet or wireless), if wireless, your physical proximity and even your neighbor’s activities.

Living Room (Router)

A good starting point is to build a baseline. What is the fastest speed that I am not getting? By connecting to the router directly through ethernet, this is what we got:

Screen Shot 2020-04-26 at 6.54.44 PM


45Mbps is already pretty good but that is not what I paid for? what the H*?

And then I placed my laptop close to the router and tested not only the 2.4GHz but also the 5GHz channel.

Screen Shot 2020-04-26 at 7.00.28 PM

5GHz close to router

Screen Shot 2020-04-26 at 7.03.10 PM

2.4GHz close to router

The test result says that both wifi options are faster than ethernet and the 5GHz reaches 236Mbps which is even faster than what I paid for… Well, I am not going to debate if is the golden rule for measuring internet speed but it is a good starting point. Also, by opening the stats tab in Youtube also showed something different but similar in nature.

Screen Shot 2020-04-26 at 7.07.36 PM

5GHz streaming 1600×900 HD

Before we ran the test in the office, there is another MacOS trick that I want to share. By holding the option key and then left click the Wifi logo in the menu, it will show more diagnostic information for you.

Screen Shot 2020-04-26 at 7.13.42 PMScreen Shot 2020-04-26 at 7.14.29 PM

You might have several questions now about what each metrics mean, for now, let’s wait till we finish our test in the office and then we will get back to this.

Home Office

Screen Shot 2020-04-26 at 7.22.04 PM

5GHz at home office

Screen Shot 2020-04-26 at 7.20.33 PM

2.4GHz at home office

Screen Shot 2020-04-26 at 7.21.54 PMScreen Shot 2020-04-26 at 7.19.36 PM


Screen Shot 2020-04-26 at 7.25.32 PM


Screen Shot 2020-04-26 at 7.28.53 PMScreen Shot 2020-04-26 at 7.31.10 PMScreen Shot 2020-04-26 at 7.27.56 PMScreen Shot 2020-04-26 at 7.30.01 PM


RSSI: Received Signal Strength Indicator

Noise: noise, noise

SNR: RSSI – Noise = -50dbm – (-90dbm) = 40dbm

PHY: physical layer

MCS Index: Modulation and Coding Scheme

Full confession, I did get a degree in EE and my track was in RF, Microwave but a few years into my data career, I remember pretty much nothing out of it. However, one thing I still remember was how beautiful those Maxwell equations were. I don’t know if this is a good analogy but signals are like a big balloon and the power is fixed as it transmitted out, the bigger it gets, the thinner it gets. So what happens when it gets “thin”, so “thin” that you cannot accurately receive/recreate the wave on your end as it was sent out. So what you do? just might like your mom yelling at you for dinner, three steps, raise her voice, make every word longer and most important, keep yelling until you confirmed “your receival”. well, I guess at a high level it is not that much different than what someone with a doctorate too when he/she wanted to talk to you via signal.

It was a bit hard for me to find some intuitive materials but here is a paragraph that I found most relevant from the book “Broarband Access Networks: Technologies and Deployments”.

“5.4 Adaptive Modulation and Coding (AMC)

Adaptive modulation and coding (AMC) technique is a part of an adaptive transmission scheme where transmission parameters, such a modulation, code-rate and power, are adjusted based on the channel state information (CSI). … This scheme has the capability to significantly increase the throughput of the wireless communications system by increasing the average data rate, spectral efficiency , and system capacity. …. In the AMC schemae, when the error rate at the receiver increases due tot he interfered and attenuated received signal resulting from the channel, the receiver sends this information back to the transmitter through a feedback path. The transmitter, in turn, automatically shifts to a more robust, though less efficient, AMC technique.”


  1. 5GHz’s signal is weaker than 2.4GHz (31->57/26, 43->69/26), at the beginning, I wanted to say 5GHz drops faster, but looking at the numbers, both channels dropped 26dbms from living room to the bedroom upstairs
  2. The 5GHz at my office room is so low ~-70dbm which is at the edge, very weak. So my point is that if you have a strong signal, use 5GHz.

Curious about some of the mathematics part of it?

Friis equation, here you go.


life is short, use shortcuts

As an analyst, I spent a significant amount of time typing, all kinds of typing, writing code in Jupyter notebook, navigating through the servers within bash, browsing the internet or even just writing emails and text editing in general, or even right at this second, typing this blog up (I guess Jupyter notebook and blog writing both falls under the browser navigating). Most people might not have the opportunity or access to watching some of the best “typers” in a close distance, some of you who constantly do pair programmings with your coworkers, coaching your junior team members how to use linux, or even just by sitting in a meeting watching your someone sharing their screen, you can clearly tell a difference in people’s productivity not necessarily by people’s IQ, nor methodology, merely noticeable time saving/wasting led by operatational efficiencies introduce by how fast they can instruct the computer in general, or simply typing.

Blind Typing



To me, blind typing is as much a necessary skillset in the 21st century as driving to society after Ford commercialized automobiles. I happen to have the opportunity sit in several technical interviews with lots of candidates with various experiences based on their own claims, and various levels of experiences based on their own demonstration. I came across college students who navigate through stackoverflow and google like playing starcraft 2 with a 200APM cracking the coding interview like no one’s business, and at the same time, tortured by watching self-claimed decades of experience senior staff delete lines of text again and again because they made a typo in previous lines of code but unable to notice it till the very end because the attention is all on the keyboard to find the right key. People claim that writing or programming is only about typing, and many top notch programmers aren’t the fastest typers in the world, it is more about the logic, the ideas, which I fully agree and bear that philosophy deep in my heart.But in my humble opinion, having great logic and beautiful written code doesn’t conflict with you typing fast. Making blind typing your second nature with 60 words per minute minimum and decent amount of accuracy (no constant backspace) will certainly free your mind to focus more on thinking.


Lebron James might be an exception who is “allowed” to type using two index fingers but only because he doesn’t even need to look at his “monitor” when he does his job, let alone “keyboard”.


To me, this is the differentiator which many people got left behind because now they have a way of doing things. Theoretically, yes, having the keys, backspace and arrow keys can help you navigate through if not all, very majority of your tasks. It is more intuitive, “makes sense”, but it is those black magics that made your job unique, it is those mysteries that distinguish you from the rest. And shortcuts are certainly the key. It get you to a place faster, faster to allow to make more mistakes, faster to save time, faster to achieve more, and faster to level that quantitative difference one day will actually make a qualitative difference that people who watch your screen suddenly drops their jaw to the ground.

You can pretty much Google search “shortcuts of” with any softwares that you use at a day to day basis, conquer the inertia of doing it in the old way and soon you will learn it. However, on a software to software basis, sometimes the marginal benefit is fairly low because some shortcuts are not quite frequently used, and those shortcuts won’t work in other tools so it will be good to start analyzing people’s typing behavior and come up with a some of the most generic and beneficial shortcuts to get started.

There is no doubt of those usual suspects like C-c (Control +C), C-v and C-z, if you are not already familiar with. Here I want to share a few shortcuts specifically related to text editing that I wish someone told me when I first started my career, or back to when I was access computer since 12 years old. I do have to say, many of the short cuts were actually learned from when I started using Linux, which a mouse or even a GUI is not readily available. You need to figure out how to only use a keyboard to do everything you wanted text editing.

Control + E

C-e: go to the end of the current line, this one works EVERYWHERE, in the browser, in code editor, in terminal and pretty much everywhere that I know of texting editing. Can you use mouse? sure, can you hit the right arrow key until you get to the end of the text? yes, can you use down Arrow key and it will get to the end of the line if the line your are editing is indeed the last line? yes. Can you … I believe there are a dozens of ways of achieving go the end of the current line but to me. I will share when I use this functions the most.

For example, when I edit Python code, I tend to use Jupyter Notebook which whenever I type any type of closure, like left open parenthesis, left brackets, single/double/triple comma, it will auto complete with the pairing half and focus your cursor in the middle. This is a great feature to save you the effort of counting brackets or avoid error out of accidentally missing the closure or actually saved you half of the typing because when there is an open, there must be a closure. However, now you are in the middle, after you auto editing the content, there isn’t necessarily a way to get back outside so you can keep editing, most people now will use the Arrow key, or god forbid, mouse, to navigate back to where they wanted to edit. Here is where C-e comes in which directly navigate to the end of the line which you can keep editing. If you are comfortable using Control keys or Meta keys (Alt or Option), C-e literally feels like one keystroke. but instead of completely moving your right hand away, find arrow keys, push it several times, and then reposition your index finger on the J key now feels too much work all of a sudden. In this case, there is a easily 2x or 3x+ inefficiencies introduced by not using C-e, and imagine how many closures you have in your code? every function has a closure, every collection (list, dictionary) has some sort of closure, indexing slicing, ..etc.

Screen Shot 2020-04-12 at 1.27.22 PM

I did a character count on all the Python files within one of the most popular libraries out there, sklearn. As you can tell, there are about ~ 3% of all your keystrokes that can benefit from this little shortcut.


People go to beginning of the line far less frequent than end of line as you right right beginning to ending! But if you do, which I use occasionally, C-a is the opposite of C-e.

Shift Click Selection

Sometimes, one needs to select a good chunk of text, copy and paste it somewhere. Have you had this experience that your selection is several pages and after you click your left mosue key, scrolling for pages after pages while holding that key down, only once in a while you accidentally let it go and you will have to scroll up and redo everything again? Sometime, we are talking about pages after pages. At least, I certainly remember the frustration when it happens and I have to hold my breath. There is a little trick that you can just click the start of the text that you want to copy, without holding anything, it is just like place a an invisible sign to mark the start of your selection, and then you can casually scroll down, and use “Shift + Click” to mark the end of the selection and everything will be selected! What is even better, after you mark the start, you can even use page up, page down, space and other keystrokes to navigate without even rolling that squeeky wheel on the poor mouse.

C-f/b better than Right/Left Arrow

As I have mentioned before, moving either of your hand away, so far away from the F and J key that you need to reposition should be avoided at all possible costs. Then people might ask, I need the Arrow keys to go forward and backward some certain characters like when there is a typo, missed typed several characters, or only when you noticed there is a typo in the middle of your line. Don’t worry, there are shortcuts too so you don’t have to use Arrow keys. You can use C-f to move your cursor forward by one character and C-b to move backward by one chart. holding those keys, just like using the arrow key will move as many characters as long as you are holding them down. This one took me a while to change the habit but once I get used to it – combination keys don’t slow me down anymore and reaching out to the arrow keys feel like a 5 hrs trip from LA to NY, again, reaching out to the mouse or trackpad is figuratively visiting India from NewYork, on an economy class in the middle seat with both of your neighbors need seat belt extension.

M-f/b faster than C-f/b

M is many contexts refers to the Meta (meta) key or the Alt key for your keyboard. If you are editing one line of texts and want to move way back or forard to make a change, holding down either arrow or C-f/b feels too long. Have you seen this before?


This is where some shortcuts conflict with each other and might be platform dependent as in Linux or Emac, it is M-f to move forward by a word and M-b to move backward by a word. In MacOS terminal, it is still the case, but certainly not the case in text editor. Instead it looks like Meta + Arrow keys to move by word just so everyone knows. movebywordsarrow


Many people know that tab means autocomplete, but the level of adoption certainly varies. I have seen people use tab merely as a suggestion, still prefers to type literally every character out one by one, also have seen people use the mouse to choose from the autocomplete, probably trained by years of experience using filters in Excel.



Later on, I realized that many of the short comes from Linux and here is a Emac reference card which if not all, very majority of the shortcuts related to Motion and Editing are applicable in all text editor. Even Google search bar 🙂 I will keep adding to this posts about great shortcuts that I found and share with you. Also, the ultimate secret to take the shortcut is to a good trade off between exploration and exploitation. Can you turn a code snippet into a small framework or function, can you use others libraries.



Customize the Python in PySpark

In the data booming age, there is an unprecedented amount of demand for data processing, especially “big data scale” processing, something that you just know it won’t work on your laptop. A few common cases like parsing large amounts of HTML pages, preprocess millions of photos and many others. There are tools readily available if your data is fairly structured, like Hive, Impala but there are just those cases, which you need a bit more flexibility other than plain SQL. In this blog post, we will use one of the most popular frameworks Apache Spark, and share how to ship out your own Python environment without worrying at all about the Python dependency issues.

In most of the vanilla big data environment, HDFS is the still a common technology where data is stored. In Cloudera/Hortonworks distribution and cloud provider solutions like AWS EMR, they mostly YARN or Hadoop Next Gen. Even for an environment where Spark is installed, there are many users run Spark on top of YARN like this. As Python is the most commonly use language, Pyspark sounds like the best option.

Managing Python environment on your own laptop is already fun, managing multiple versions, multiple copies on a cluster that likely be shared with other users could be a disaster. If not otherwise configured, pySpark will use the default Python installed on each node. And your system admin certainly won’t let you mess with the server Python interpreter at all. So your cluster admin and you might come to a middle ground which a new Python, say Anaconda Python, can be installed on all the nodes using a network shared file system which you have more flexibility, as any installation can be mirrored to other nodes and the consistency is ensured. However, only after a few days that you noticed your colleagues accidentally “broke” the environment by upgrading certain libraries without you knowing, and now, you have to police that Python environment. After all, this comes to the question of how each user can customize their own Python environment and make sure it runs on the cluster. The answer to Python environment management is certainly Anaconda, and the answer to distributed shipment will be YARN archives argument.

The idea is that for any project, feel free to create your own conda environment and do the development within that environment, so in the end, you will use the Python at your own choice and a clean environment with all the necessary libraries only installed for this project. Like a uber jar idea of Java. Everything python related in one box. Then, we will submit our pyspark job first by shipping that environment to all the executors where the workload will be distributed, but then second, configuring each executor will use the Python that we shipped. All the resource management, fan out and pull back will be handled by Spark and YARN. So three steps in total, first being a conda environment, second being a Pyspark driver and last is the code submition.

In the following section, I will share how to distribute the HTML parsing in a Cloudera environment as an example, every single step is also available on Github in case you want to follow along.


If you are not already familiar with Conda, it is only one of the tools that many Python users live and breathe on a day to day basis.  You can learn more about Anaconda from here.

Screen Shot 2020-04-11 at 2.53.43 PM

Screen Shot 2020-04-11 at 2.41.10 PM

Screen Shot 2020-04-11 at 2.37.55 PM










ssh tunneling via port 22.

Recently I got access to a server that somehow all the ports got blocked except for SSH, instead of waiting for a few weeks until the IT freeze and ports got opened up, I found that you can access the other ports via a technique called SSH tunneling if you have admin rights on the remote host.

ssh -L 9191:localhost:9191 -L 8080:localhost:8080 -L 8088:localhost:8088 user@host

This will ssh into the remote host and keep those three ports mapped from local client to remote host. However, this is fragile as if your ssh is broken, all the ports will be broken, and if your terminal is idle with a broken pipe, your tunnel is also break. There are other flags like run in background as daemon but if you chose that route, make sure you kill the ssh process when done, otherwise, you will never get back to your own 9191 🙂

Screen Shot 2020-04-05 at 11.21.20 PM

Now you can access the port in your client browser just like below.

Screen Shot 2020-04-05 at 11.21.57 PM

Converting Alpha only images to RGB

There are plenty of images data out there that are of the format RGBA (Red Green Blue and Alpha), in which Alpha represents the transparency. By leaving RGB all empty and Alpha to store the information of the shape, the image itself is like a layer that you can add on top of any other photos that sort of “float around”, just like lots of icons out there. However, this type of photos aren’t necessarily friendly or ready to be directly fed into many of the machine learning frameworks which usually work with RGB or greyscale directly. This is a post to document what I did to convert some of this kind of alpha only images into RGB image.

First, there are plenty of libraries out there process images. I am merely sharing some of the work that I did without necessarily comparing the performance of different approaches. The libraries that I will cover in this post is imageio and Pillow. I have read htat imageio is supposed to be your first choice as it is well maintained while Pillow isn’t anymore. However, I found that imageio is very easy to use but its functionality is limited and not as diverse as Pillow. Imageio, as the name indicates, deals with read and write of images, while Pillow has more functionalities of dealing with images itself like channels, data manipulation, etc.

My very first try was to convert RGBA into a matrix that has (200, 200, 4) which my image already has the square size of 200×200. And then drop the last column which is alpha and then populate the RGB channel using the value of alpha channel. In that case, our RGB will be equally populated and our photo should look black and white.

Screen Shot 2020-04-05 at 10.59.27 PM

The code is simple, m is PIL.Image.imread returned object. I first resize it to be 256*256, which is more ML friendly. And then convert it into an array, data which has the shape of 40000, 4. After assigning RGB and dropping A, we call the reshape to turn it back into 256*256*4 and reconstruct the image “fromarray”. Here astype(np.unit8) caught me offguard a bit and there maybe one of the reasons that Pillow is not perfect due to the lack of maintenance. And the invert in the end will invert the color (black to white and white to black) so it has the background that I like.

The code is easy to understand but when I execute it, it was slow, I did not run any benchmark but it was noticeable slow comparing with some other preprocessing that I ran before.

In the end, I realize that there is already a built in function called getchannel so you can get a specific channel without dealing with arrays directly.

Screen Shot 2020-04-05 at 11.04.11 PM

The convert_rgb_fast does the same thing as the function above but execute much faster, likely because I was doing lots of matrix assignment and index slicing which is not very efficient I guess.

Also, by using getchannel, you can easily convert it to have only one channel that is basically greyscale. All the channels don’t have name and RGBA is just convention, if you only have one channel, all the tools out there will assume it is greyscale.

Screen Shot 2020-04-05 at 11.07.47 PM

Image preprocessing is likely as important as training the model itself, the easy part of image processing is that it can be easily distributed using frameworks like mapreduce, spark and others. Using Python is probably the easiest way for model data professionals and we will find an opportunity in the future to demonstrate how to speedup the data cleaning.

stylegan2 – using docker image to set up environment

First, here is the proof that I got stylegan2 (using pre-trained model) working 🙂

Screen Shot 2020-04-05 at 10.44.40 PM

Nvidia GPU can accelerate the computing dramatically, especially for training models, however, if not careful, all the time that you saved from training can be easily wasted on struggling with setting up the environment in the first place, if you can get it working.

The challenges here is that there are multiple nvidia related enviroment like the basic GPU driver, CUDA version, cudnn and others. For each of those, there is also different versions which you need to be careful about making sure they are consistent. That itself is already some pain that you want to go through. Last but certainly the least fun, is getting tensorflow itself not only working, but getting the tensorflow versions in alignment with the CUDA environment that you have, at the same time, using the tensorflow with the project that you likely did not write yourself but forked some other person’s github project. The odds of all of those steps working seamless will add no value to you as someone who just wanted to generate some images and serve no purpose but becoming frustrated.

I know that lots of the Python users out there use anaconda to manage their Python development environment, switching between different versions of Python at will, maintaining multiple environment with different versions of tensorflow if you want, and sometimes even have a completely new conda environment for each project just to keep things clean. In the world of getting github project up and running fast, I guess that workflow is not enough. There is much more than just Python so in the end, a tool like Docker is actually the panacea.

If you have not used Docker that much in the past, it is as easy as memorizing just a few command line instructions to start and stop the Docker container. This Tensorflow with Docker from Google is a fantastic tutorial to get started.

For stylegan2, here are some commands that might help you.

sudo docker build - < Dockerfile    # build the docker image using the Docker file from stylegan2
sudo docker image ls
docker tag 90bbdeb87871 datafireball/stylegan2:v0.1
sudo docker run --gpus all -it -rm -v `pwd`:/tmp -w /tmp datafireball/stylegan2:v0.1 bash # create a disposal working environment that will get deleted after logout that also map the current host working directory to the /tmp folder into the Docker
sudo docker run --gpus all -it -d -v `pwd`:/tmp -w /tmp datafireball/stylegan2:v0.1 bash # run it as a long running daemon like a development environment
sudo docker container ls 
sudo docker exec -it youthful_sammet /bin/bash # connect to the container and run bash command "like ssh"

I assure you, the pleasure from getting all the examples run AS-IS is unprecedented. I suddenly changed my view from “nothing F* works, what is wrong with these developers” to “the world is beautiful, I love the open source community”. Screen Shot 2020-04-05 at 10.42.48 PM

You can use Docker to not only get Stylegan running, you can get the tensorflow-gpu-py3 working, and not meant to jump the gun for the rest of the development world, I bet there are plenty of other people who struggle to environment set up can benefit from using Docker, and there are equally amount of people out there who can make the world a better place by start his/her project with a docker image knowing that no one in the world, including the developer himself, how the environment is configured.

Life is short, use [Python inside a docker] 🙂

Screen Shot 2020-04-05 at 10.45.04 PM