Web Technology

I recently ran into an issue with my static file serving strategy, to make it short, here is a high level introduction of my architecture, Django web framework, I am using CloudFront to serve my  static files, and S3 as my origin. The problem I ran into was after I collect static to S3, refreshing the CloudFront URL will randomly give me the old or the latest version of that object.

A little bit more details on this, my CloudFront cache time was set to 24 hours. Yesterday, I ran collect static twice, first time I uploaded test.txt with 1-line content “test 1”, the website runs for 4 hours, and 4 hours later, I updated my test.txt with content “test 2”, and I ran collectstatic again, now I have two questions, first, what’s in test.txt on S3, second, what are you going to see in test.txt from CloudFront? Answer, first, it is “test 2” on S3, second, sometimes you will get “test 1”, sometimes you will get “test 2” while you keep refreshing the CloudFront URL.

I thought this is super weird, since sometimes I got the latest version of my object from CloudFront, I expected the latest version to be served up all the time. Therefore, I took some further inspection today, I updated test.txt again to “test 3”, I ran collectstatic to get the file up to S3. All of these 3 collectstatic operations happened within 24 hours. Question, what’s there on S3, and what’s there on CloudFront. Answer, S3 has the latest version “test 3”, CloudFront will return me “test 1”, “test 2” or “test 3” randomly while I refresh.

I spent some time debugging, trying to get CloudFront to serve the latest version, making “test 1” and “test 2” disappear, but no luck.

On the second thought, before I was going to take any further actions, I started realizing that I was probably thinking in a completely wrong direction, after some read through on CloudFront documentation, I found this:]}`Y(`AS~DIZVVSHA(K7LZT.jpg

CloudFront will make another request to the origin after the objects expire, my “test 1” did not expire, why I started to see “test 2” and “test 3”.

This is what I think is happening, I am in San Francisco, assume my edge location has 20 servers, first time I collectstatic to get “test 1” to S3, and then I requested CF Server 1 from my edge location, CF Server 1 did not have it, so it forwarded the request to S3, and got it from there. And then I refreshed the URL again, this time it hit CF Server 2 from my edge location, it did not have it, so it got it from S3. Therefore, CF server 1 and 2 from my edge location now had “test 1”. The second time when I collectstatic to get “test 2” up to S3, when I requested for the file, if my request hit CF server 1 or 2, it would find the file there, and serve it, it my request hit another 18 CF servers from my edge location, it would ask S3 for the file, which was the latest version “test 2”, cache it on the server, and serve it. From that point on, when I refreshed the URL, whether I would see “test 1” or “test 2” would depend on which CF server got my request.

Someone may ask doesn’t CF server 1 propagate to all the other CF servers when it gets the latest version. Apparently it does not. Here is what I think the reason will be, if an edge location has hundreds or thousands of servers, and every time it starts propagating to all the other servers if 1 of them gets a latest version of a file, how much effort that is going to cost such as file transfer. However, there must be some kinda of ways to “propagate”, that’s when invalidation comes to the rescue. Also, if I was a developer on CloudFront team, in the implementation point of view, invalidating an object is much much cheaper than propagating an object to all the CloudFront servers, because invalidating an object only needs to delete that object from all the servers that have that object, but propagating an object will have to transfer that object to all the servers. That’s why I believe the philosophy behind CloudFront is that they want you to bust those old version of files once you have a new version.

Finally, I invalidated my test.txt file, I got “test 3” all the time as I expected.

Thanks for reading.

Check out the real action here.

The naive approach on converting an image to SVG circled dots image, and applying some effects on it VS d3 native selector with transition. It is obvious, it is not going to turn out well. Looping through pixels is just a dumb way when working with images, but sometimes inevitable.

I loaded the image, and looped through each pixel to collect the color data and some coordinate info that will be used later to render the effect, the process is very lagging while loading a large image, I tried to cut my images small so that the rendering can be more fluent, and since my circle radius is set to be small, a large image will give me better resolution, so it also kind of defeats the purpose of making my image look dotty. Eventually, I made each image around 100×100.

The following Paint effect I called is actually a little trick with loops.


The 4 directions effect, when I generated all the dots, I randomly decides which dots I want it to come from which direction, and simply marked each circle with a class name of direction, and then everything else is just self explained.


The following is a little bouncing effect on rain.

Apr-24-2016 19-35-55

You can see them in real action here.

Thanks for reading.

I open sourced my playbook on how to quickly start a Ubuntu 14.04.4 LTS (Trusty64) virtual box with Docker preinstalled on your local machine with Vagrant. You can download it from here.

If you want to do complex work inside of your box, you can easily add more roles and build on top of it.

If you want to start an Ubuntu Server 14.04 LTS instance on Cloud Service, such as AWS, it will also work.

Thanks for reading.


This blog is a follow up on my previous post here. As I mentioned my testing strategy at KuKy World, we use Lettuce for testing, and for each installed app, we will test for both back-end behavior and web browser behavior. As we have already addressed the issue of how to switch to test database for back-end test, I will discuss how to switch to test database for browser behavior testing.

Since we use SQLAlchemy, the session is created at the global level. However, the session will be imported in each different view file, and if webdriver opens up a URL, it will use the session which has been created at the global level, which is usually using the development database.

The way we are dealing this issue is not too hard, the idea is to make the session use test database at the global level when python mange.py harvest gets called. The following is the piece of code in our local settings file that does the job.

DB_ENVIRONMENT = 'mysql://root:@localhost/development'

# This is for browser behavior testing on local
if "harvest" in sys.argv:
    DB_ENVIRONMENT = 'mysql://root:@localhost/test'

engine = sqlalchemy.create_engine(DB_ENVIRONMENT)
Session = sqlalchemy.orm.sessionmaker(bind=engine)
session = Session()
Base = declarative_base()

Now, when you run python manage.py harvest, it will start using the test database, and python manage.py runserver will keep using the development database.

Thanks for reading.

I haven’t started doing any automation testing, and I know it is important; therefore, I start thinking about a good way to do tests, being more specific, at this stage, I want to focus on browser behavior testing. I quickly found out something called Watir WebDriver, which is in Ruby. Since everything here is written in Python, although personally I want to learn some Ruby, it will become very hard if I will have any testing related to database. Therefore, I kept looking until I found Lettuce. Lettuce has built-in integration with Django, however for some of the features to work, it assumes that you are using the default Django ORM, which is not the case here at KuKy World because we use SQLAlchemy.

Now, I am going to talk what we do to switch between our test and development database. First of all, my testing strategy is the following: I have a features folder under each of my registered app, and in each features folder, I have two sub folders: backend, which is for testing on the back end, and web, which is testing on the browser. The file structure will look this:

Screen Shot 2014-10-05 at 10.55.58 AM

In order to make the testing use test database, I will have to define the test database environment in terrain.py file, the following is what I did:

from lettuce import world
import sqlalchemy, sqlalchemy.orm
from sqlalchemy.ext.declarative import declarative_base

def recrete_test_db():
    testing_db = 'mysql://root:@localhost/test'
    engine = sqlalchemy.create_engine(testing_db)
    Session = sqlalchemy.orm.sessionmaker(bind=engine)
    world.session = Session()
    Base = declarative_base()
    # import all the models
    from webapp.models import *
    # Re create the tables in test database

world.recrete_test_db = recrete_test_db

and then in my signup.py file, you can recreate your test database anywhere based on your need, here is an example:

from lettuce import *

def say_hello():
    print 'Hello there!'
    print 'Lettuce will recreate the test database and run tests...'

Note: when you communicate with your test database in tests, you will be using world.session rather than other sessions you might have at your global level.

Thanks for reading.

I have been thinking about how to make data visualization more fun, although this blog has absolutely nothing to do with data, it is still a very fun experiment with d3. Basically the idea comes from a super simple and ugly wall decoration at GYPSY:


and then I suddenly want to do something fun with it, which leads to the following: here is the live version.


As you can probably tell, the colorful circles will drop to the bottom and get absorbed by those black holes until all the black holes are filled by the colorful ones.

Thanks for reading.

A lot of people ask me about what I use to write my resume, here is a copy, it is obviously LaTeX. The confusion they have is that my resume looks like some popular latex templates combined. Indeed, I did some customization of two popular templates, and added some other stuff, eventually turned it into what my resume looks like now.

Today I am using Jinja environment to turn my static LaTeX resume into a template which will later read a JSON file to fill in the content. I call it XNemo, you can download it here, and apply to your own resume. It requires no knowledge of coding, but a bit of JSON.

Now, I will go into some of the details how I designed this, and how you can use it for your own resume.

After you download the code, you should see the following file structure, pretty simple.

Screen Shot 2014-09-26 at 4.36.35 AM

At a high level, how it works is also easy to understand: you run a python script, it will read the content of data.json to templates/template.tex file, and generate the LaTeX source file resume.tex and the final resume.pdf file. What you need to do is simple, just type the following two commands.

Screen Shot 2014-09-26 at 4.00.33 AMNote: Make sure you have Flask installed, otherwise you will get errors.

Now I will explain some of the details.

First, I defined my own template block syntax in generate_pdf.py as the following:

Screen Shot 2014-09-26 at 4.04.02 AM

Basically, when I set block_start_string to ‘((*’, and block_end_string to ‘*))’, that means your for loop in the LaTeX template will look like: ((* for <condition> *)) ((* endfor *)), setting variable_start_string will make your variable value retrieval looks like <(( value ))>. From line 57 to line 60, I customized some template tags that later will be used in the template.

Now let’s take a look at data.json file. My resume contains multiple different colors, and timeline, and this block of info simply defines them:Screen Shot 2014-09-26 at 4.16.54 AM

When you set “usecolor” to true, the timeline color and the theme color will be applied to your resume, false will make your resume look pure black and white. Since my resume has timeline, “start_year” will be the earliest year that will appear in your resume, and the default hidden “end_year” is just next year. “theme_color” defines all the other colors besides the timeline colors (including the quote).

For the rest in the data.json file, it defines your skills, working experience, projects, education, and interests, the most common things you will most likely be interested in having them in your resume, and how it works is very self explained.

If you want to add more content to you resume, you can look through data.json and templates/template.tex, and modify them into what fits you, which might require a bit of knowledge of LaTeX and JSON, but it is still not too hard.

If you have any questions, you can always send me a request or comment on this blog. I will respond as soon as possible.

Thanks for reading.

As you work in a team, collaboration becomes essential. At this early stage, I am leading a small team to build our platform. Although I am the one who writes the most code, I realized how important to start doing code review. I know a lot of companies use Review Board as their code review tool, and some other tools for task management, etc. Back to the time I worked at Minted, Review Board did not impress me; therefore, I started looking for something better, and I quickly found out Phabricator.

Phabricator is a collection of open source web application that help software companies build better software, it does not only do one thing, it can do many things, it is a suite of applications including task mangement, code review, etc. It is also free with self hosting.

I started looking at this, and decide to host Phabircator on a EC2 Ubuntu instance for experiments. Before I decide to do so, I had the chance to ask the CTO at Phabricator that how soon their hosting service will become live, and he replied with “At least a few months”, I could not wait.

Just like all the other people who are also using Phabricator, I got super addictive to it. I spent hours playing around with it, trying different configurations and exploring different features. The following is an image of my Phabricator home page, which is also customizable for everyone.Screen Shot 2014-09-24 at 6.24.30 PM

One of the feature that I found that is extremely amazing is that their Arcanist tool provide a command “arc lint” to analyze the source code and raise warnings and errors about it, like a syntax checker that makes sure you are writing clean code. If you are not, it will raise errors and warnings; for example, the following is what happens for a Python code.Screen Shot 2014-09-24 at 6.34.33 PMFor more about Phabricator, you can find out on their official website. All you need to believe is that Phabricator is cool.

In terms of the price, at KuKy World, my hosting strategy is that we reserved a t2.micro  heavy utilization instance for 3 years with total $109, and hourly rate at $0.002, and a general SSD EBS volume with 1GB data. You can do the math, it is not expensive.

Thanks for reading.