Apr 7, 2009
Build Your Own URL Shortening Service
Everyone seems to have their own favourite URL shortening service, TinyURL, is.gd, tr.im, pendek.in, and so on. Some even do more than just shorten URLs, e.g. tr.im can track stats.
It’s relatively easy to roll own your own URL shortening service once you know the magic behind it. I’m going to explain the basic idea so you can start rolling out your own.
UPDATE: check out django-shorturls, a Django app recently created by Simon Willison and Jacob Kaplan-Moss, it uses a similar base 62 approach.
How it works
The basic idea of how these services work is by storing URLs in a database, each with its own unique ID, which is normally an integer. The ID is then magically converted into a series of characters which can be used to make up a complete URL that people can pass around to people.
When someone opens the shortened URL with the code, the server will convert the code back into the unique ID so that it can retrieve the URL in the database. Once the URL is retrieved then the server will send a HTTP redirect response to the browser.
The magic
It’s not magic, really.
The conversion function from an integer to the short code must produce as short code as possible while still being practical: the characters in the code must be valid for URLs, no “funny” characters (“?”, “/”, “&”, “%”, “+” are probably a bad idea).
One of the easiest functions to use for this purpose is to encode the integer (decimal, base 10 number) into base 36.
Normally we use base 10 to count. In base 10 numbers we only have 10 possibilities (0 to 9) for each digit before we need add another digit in front of it and roll over to the starting symbol (“10″ is two digits, because we have no more possibile symbol to use after “9″).
In base 16 (hexadecimal) we add 6 more symbols (“ABCDEF”) so that we can count to “F” (or 15) before we need to add a digit in front and roll over to the starting symbol (“10″, or 16 in hexadecimal). If you’re having trouble following, imagine having 16 fingers and 16 toes to count with.
Same idea for base 36, but we add the whole 26 Roman alphabets, “A” to “Z”. So “Z” is 35 in base 36, and “10″ is 36.
How short is short?
With a 3-digit base 10 numbers, we can only have 1000 possible combinations (0 to 999). If I remember my high school math correctly, the formula is ND where N is the number of possibilities for each digit (or the base number, 10 in this case), and D is the number of digits (3 in this case), so 103 = 1000.
The goal is to have as many combinations as possible with as little number of digits as possible. Since the number of digits will grow over time (as we store more URLs) then the only way we can increase the number of possible combinations is by increasing the base number.
That’s why converting from base 10 to base 36 works.
So why stop at 36, you ask. As a matter of fact, we can make use of both uppercase and lowercase letters of the alphabets so that we can use base 62. If you’re willing to have non-alphanumeric characters in your short codes, you can even push it further with a few more characters. In fact, this is what essentially base 64 encoding does. Be careful though, the commonly found base 64 encoding normally works on strings instead of numbers, converting 6 bits of data at a time, and it can add paddings, the output is 33.3% longer than the input. Quite different from what we want.
I would probably use base 62 because it’s slightly more pragmatic and it has less chance of screwing up, but if you insist on using base 64 (for numbers, not strings), then I suggest you use “-” and “=” as the additional symbols. I’d avoid underscores (“_”) because they can be hard to see in links, which are normally shown underlined.
The following table shows the number of possibilities for some of the bases:
| Digits | Base 10 | Base 26 | Base 36 | Base 62 | Base 64 |
|---|---|---|---|---|---|
| 2 | 100 | 676 | 1,296 | 3,844 | 4,096 |
| 3 | 1,000 | 17,576 | 46,656 | 238,328 | 262,144 |
| 4 | 10,000 | 456,976 | 1,679,616 | 14,776,336 | 16,777,216 |
| 5 | 100,000 | 11,881,376 | 60,466,176 | 916,132,832 | 1,073,741,824 |
| 6 | 1,000,000 | 308,915,776 | 2,176,782,336 | 56,800,235,584 | 68,719,476,736 |
Say, you add 1 million new URLs every month on average. With base 62, it would take a little over a year to use up all the possible 4 digits, and after that it would take another 75 years to use up all the possible 5 digits, and after that it would take another 4,658 years to use up all the possible 6 digits, and so on.
It would probably take even longer to use up the digits if you don’t add URLs that are already in the database. Conversely, you might use up the digits faster if you want to do per-user tracking, for example.
Sample code
Here’s a simple Python code that you can use to convert an integer to a base N “number” (actually string), and vice versa.
# Code adapted from: http://en.wikipedia.org/wiki/Base_36#Python_Conversion_Code
# This is base 62:
ALPHABETS = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
def base_n_decode(s, alphabets=ALPHABETS):
n = len(alphabets)
rv = pos = 0
charlist = list(s)
charlist.reverse()
for char in charlist:
rv += alphabets.find(char) * n**pos
pos += 1
return rv
def base_n_encode(num, alphabets=ALPHABETS):
n = len(alphabets)
rv = ""
while num != 0:
rv = alphabets[num % n] + rv
num /= n
return rv
Can’t we just use MD5?
No. We can’t use hash functions like MD5, because hash functions are one-way, irreversible, meaning we can’t get the original input just from knowing the hash code. Also, hash functions can have collisions, meaning from one hash code there is a remote possibility that it “maps” back to two or more URLs. Plus, the output of hash functions are normally long-ish fixed-length strings.
So, what are you waiting for?
Now that the magic is revealed, you can create your own URL shortening service. Be creative and innovative so you can compete with all the other URL shortening services out there.
As a bonus, you can also use this technique to shorten the permalinks or absolute URLs for your
objects in your Django (or Rails, etc.) websites. So instead of http://example.com/post/12345
it would be http://example.com/post/3D7. Note that this has nothing to do with security
whatsoever. The coded ID can still be reversed with some effort. And you shouldn’t rely on
obfuscation to protect anything anyway.
Follow
Boleh percaya boleh tidak, banyak corporate proxy server yang melakukan blocking URL shortening service macam tinyurl, misalkan.
@Dedhi: I can understand why. I’m still undecided myself on whether URL shortening services *as proxies* is a good idea in general, I’m leaning towards no.
I guess the main point of this article is more for creating URL shortening service for your own website. I should have worded my article more carefully.
URL shortening can be useful for content producers, e.g. Facebook could provide their own URL shortening service to produce short links like facebook.com/p/Q74m instead of facebook.com/photos.php?u=1725352662&p=18726367181.
Twitter could automatically shorten URLs in tweets and replace them with e.g. {Q74m} code that can be pasted at the end of twitter.com/url/ for example.
saya pernah baca di mana gtu *lupa* url panjang bisa dikatakan sebagai “pemakan bandwidth” apalagi tuk situs2 sangat sibuk seperti facebook.