How to code a URL shortener?

caw Source

I want to create a URL shortener service where you can write a long URL into an input field and the service shortens the URL to "http://www.example.org/abcdef".

Edit: Due to the ongoing interest in this topic, I've published an efficient solution to GitHub, with implementations for JavaScript, PHP, Python and Java. Add your solutions if you like :)

Instead of "abcdef" there can be any other string with six characters containing a-z, A-Z and 0-9. That makes 56~57 billion possible strings.

My approach:

I have a database table with three columns:

  1. id, integer, auto-increment
  2. long, string, the long URL the user entered
  3. short, string, the shortened URL (or just the six characters)

I would then insert the long URL into the table. Then I would select the auto-increment value for "id" and build a hash of it. This hash should then be inserted as "short". But what sort of hash should I build? Hash algorithms like MD5 create too long strings. I don't use these algorithms, I think. A self-built algorithm will work, too.

My idea:

For "http://www.google.de/" I get the auto-increment id 239472. Then I do the following steps:

short = '';
if divisible by 2, add "a"+the result to short
if divisible by 3, add "b"+the result to short
... until I have divisors for a-z and A-Z.

That could be repeated until the number isn't divisible any more. Do you think this is a good approach? Do you have a better idea?

algorithmurl

Answers

answered 9 years ago shoosh #1

Why would you want to use a hash?
You can just use a simple translation of your auto-increment value to an alphanumeric value. You can do that easily by using some base conversion. Say you character space (A-Z,a-z,0-9 etc') has 40 characters, convert the id to a base-40 number and use the characters are the digits.

answered 9 years ago cr333 #2

Why not just translate your id to a string? You just need a function that maps a digit between, say, 0 and 61 to a single letter (upper/lower case) or digit. Then apply this to create, say, 4-letter codes, and you've got 14.7 million URLs covered.

answered 9 years ago Marcel Jackwerth #3

I would continue your "convert number to string" approach. However you will realize that your proposed algorithm fails if your ID is a prime and greater than 52.

Theoretical background

You need a Bijective Function f. This is necessary so that you can find a inverse function g('abc') = 123 for your f(123) = 'abc' function. This means:

  • There must be no x1, x2 (with x1 ≠ x2) that will make f(x1) = f(x2),
  • and for every y you must be able to find an x so that f(x) = y.

How to convert the ID to a shortened URL

  1. Think of an alphabet we want to use. In your case that's [a-zA-Z0-9]. It contains 62 letters.
  2. Take an auto-generated, unique numerical key (the auto-incremented id of a MySQL table for example).

    For this example I will use 12510 (125 with a base of 10).

  3. Now you have to convert 12510 to X62 (base 62).

    12510 = 2×621 + 1×620 = [2,1]

    This requires use of integer division and modulo. A pseudo-code example:

    digits = []
    
    while num > 0
      remainder = modulo(num, 62)
      digits.push(remainder)
      num = divide(num, 62)
    
    digits = digits.reverse
    

    Now map the indices 2 and 1 to your alphabet. This is how your mapping (with an array for example) could look like:

    0  → a
    1  → b
    ...
    25 → z
    ...
    52 → 0
    61 → 9
    

    With 2 → c and 1 → b you will receive cb62 as the shortened URL.

    http://shor.ty/cb
    

How to resolve a shortened URL to the initial ID

The reverse is even easier. You just do a reverse lookup in your alphabet.

  1. e9a62 will be resolved to "4th, 61st, and 0th letter in alphabet".

    e9a62 = [4,61,0] = 4×622 + 61×621 + 0×620 = 1915810

  2. Now find your database-record with WHERE id = 19158 and do the redirect.

Some implementations (provided by commenters)

answered 9 years ago Alister Bulman #4

If you don't want re-invent the wheel ... http://lilurl.sourceforge.net/

answered 9 years ago Ash #5

Not an answer to your question, but I wouldn't use case-sensitive shortened URLs. They are hard to remember, usually unreadable (many fonts render 1 and l, 0 and O and other characters very very similar that they are near impossible to tell the difference) and downright error prone. Try to use lower or upper case only.

Also, try to have a format where you mix the numbers and characters in a predefined form. There are studies that show that people tend to remember one form better than others (think phone numbers, where the numbers are grouped in a specific form). Try something like num-char-char-num-char-char. I know this will lower the combinations, especially if you don't have upper and lower case, but it would be more usable and therefore useful.

answered 9 years ago Michael Stum #6

My approach: Take the Database ID, then Base36 Encode it. I would NOT use both Upper AND Lowercase letters, because that makes transmitting those URLs over the telephone a nightmare, but you could of course easily extend the function to be a base 62 en/decoder.

answered 9 years ago MrChrisRodriguez #7

alphabet = map(chr, range(97,123)+range(65,91)) + map(str,range(0,10))

def lookup(k, a=alphabet):
    if type(k) == int:
        return a[k]
    elif type(k) == str:
        return a.index(k)


def encode(i, a=alphabet):
    '''Takes an integer and returns it in the given base with mappings for upper/lower case letters and numbers 0-9.'''
    try:
        i = int(i)
    except Exception:
        raise TypeError("Input must be an integer.")

    def incode(i=i, p=1, a=a):
        # Here to protect p.                                                                                                                                                                                                                
        if i <= 61:
            return lookup(i)

        else:
            pval = pow(62,p)
            nval = i/pval
            remainder = i % pval
            if nval <= 61:
                return lookup(nval) + incode(i % pval)
            else:
                return incode(i, p+1)

    return incode()



def decode(s, a=alphabet):
    '''Takes a base 62 string in our alphabet and returns it in base10.'''
    try:
        s = str(s)
    except Exception:
        raise TypeError("Input must be a string.")

    return sum([lookup(i) * pow(62,p) for p,i in enumerate(list(reversed(s)))])a

Here's my version for whomever needs it.

answered 8 years ago bhelx #8

You could hash the entire URL, but if you just want to shorten the id, do as marcel suggested. I wrote this python implementation:

https://gist.github.com/778542

answered 7 years ago Davide Muzzarelli #9

This is what I use:

# Generate a [0-9a-zA-Z] string
ALPHABET = map(str,range(0, 10)) + map(chr, range(97, 123) + range(65, 91))

def encode_id(id_number, alphabet=ALPHABET):
    """Convert an integer to a string."""
    if id_number == 0:
        return alphabet[0]

    alphabet_len = len(alphabet) # Cache

    result = ''
    while id_number > 0:
        id_number, mod = divmod(id_number, alphabet_len)
        result = alphabet[mod] + result

    return result

def decode_id(id_string, alphabet=ALPHABET):
    """Convert a string to an integer."""
    alphabet_len = len(alphabet) # Cache
    return sum([alphabet.index(char) * pow(alphabet_len, power) for power, char in enumerate(reversed(id_string))])

It's very fast and can take long integers.

answered 7 years ago Joel Berger #10

For a similar project, to get a new key, I make a wrapper function around a random string generator that calls the generator until I get a string that hasn't already been used in my hashtable. This method will slow down once your name space starts to get full, but as you have said, even with only 6 characters, you have plenty of namespace to work with.

answered 7 years ago Xeoncross #11

Here is my PHP 5 class.

<?php
class Bijective
{
    public $dictionary = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";

    public function __construct()
    {
        $this->dictionary = str_split($this->dictionary);
    }

    public function encode($i)
    {
        if ($i == 0)
        return $this->dictionary[0];

        $result = '';
        $base = count($this->dictionary);

        while ($i > 0)
        {
            $result[] = $this->dictionary[($i % $base)];
            $i = floor($i / $base);
        }

        $result = array_reverse($result);

        return join("", $result);
    }

    public function decode($input)
    {
        $i = 0;
        $base = count($this->dictionary);

        $input = str_split($input);

        foreach($input as $char)
        {
            $pos = array_search($char, $this->dictionary);

            $i = $i * $base + $pos;
        }

        return $i;
    }
}

answered 7 years ago phirschybar #12

// simple approach

$original_id = 56789;

$shortened_id = base_convert($original_id, 10, 36);

$un_shortened_id = base_convert($shortened_id, 36, 10);

answered 6 years ago Simon East #13

Here is a decent URL encoding function for PHP...

// From http://snipplr.com/view/22246/base62-encode--decode/
private function base_encode($val, $base=62, $chars='0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ') {
    $str = '';
    do {
        $i = fmod($val, $base);
        $str = $chars[$i] . $str;
        $val = ($val - $i) / $base;
    } while($val > 0);
    return $str;
}

answered 6 years ago Ryan Charmley #14

Don't know if anyone will find this useful - it is more of a 'hack n slash' method, yet is simple and works nicely if you want only specific chars.

$dictionary = "abcdfghjklmnpqrstvwxyz23456789";
$dictionary = str_split($dictionary);

// Encode
$str_id = '';
$base = count($dictionary);

while($id > 0) {
    $rem = $id % $base;
    $id = ($id - $rem) / $base;
    $str_id .= $dictionary[$rem];
}


// Decode
$id_ar = str_split($str_id);
$id = 0;

for($i = count($id_ar); $i > 0; $i--) {
    $id += array_search($id_ar[$i-1], $dictionary) * pow($base, $i - 1);
} 

answered 6 years ago Svetoslav Marinov #15

did you omit O, 0, i on purpose ?

Just created a php class based on Ryan's solution.

<?php

    $shorty = new App_Shorty();

    echo 'ID: ' . 1000;
    echo '<br/> Short link: ' . $shorty->encode(1000);
    echo '<br/> Decoded Short Link: ' . $shorty->decode($shorty->encode(1000));


    /**
     * A nice shorting class based on Ryan Charmley's suggestion see the link on stackoverflow below.
     * @author Svetoslav Marinov (Slavi) | http://WebWeb.ca
     * @see http://stackoverflow.com/questions/742013/how-to-code-a-url-shortener/10386945#10386945
     */
    class App_Shorty {
        /**
         * Explicitely omitted: i, o, 1, 0 because they are confusing. Also use only lowercase ... as
         * dictating this over the phone might be tough.
         * @var string
         */
        private $dictionary = "abcdfghjklmnpqrstvwxyz23456789";
        private $dictionary_array = array();

        public function __construct() {
            $this->dictionary_array = str_split($this->dictionary);
        }

        /**
         * Gets ID and converts it into a string.
         * @param int $id
         */
        public function encode($id) {
            $str_id = '';
            $base = count($this->dictionary_array);

            while ($id > 0) {
                $rem = $id % $base;
                $id = ($id - $rem) / $base;
                $str_id .= $this->dictionary_array[$rem];
            }

            return $str_id;
        }

        /**
         * Converts /abc into an integer ID 
         * @param string
         * @return int $id
         */
        public function decode($str_id) {
            $id = 0;
            $id_ar = str_split($str_id);
            $base = count($this->dictionary_array);

            for ($i = count($id_ar); $i > 0; $i--) {
                $id += array_search($id_ar[$i - 1], $this->dictionary_array) * pow($base, $i - 1);
            }

            return $id;
        }
    }

?>

answered 6 years ago richard #16

public class UrlShortener {
    private static final String ALPHABET = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
    private static final int    BASE     = ALPHABET.length();

    public static String encode(int num) {
        StringBuilder sb = new StringBuilder();
        while ( num > 0 ) {
            sb.append( ALPHABET.charAt( num % BASE ) );
            num /= BASE;
        }
        return sb.reverse().toString();   
    }

    public static int decode(String str) {
        int num = 0;
        for ( int i = 0; i < str.length(); i++ )
            num = num * BASE + ALPHABET.indexOf(str.charAt(i));
        return num;
    }   
}

answered 5 years ago user1477388 #17

C# version:

public class UrlShortener 
{
    private static String ALPHABET = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
    private static int    BASE     = 62;

    public static String encode(int num)
    {
        StringBuilder sb = new StringBuilder();

        while ( num > 0 )
        {
            sb.Append( ALPHABET[( num % BASE )] );
            num /= BASE;
        }

        StringBuilder builder = new StringBuilder();
        for (int i = sb.Length - 1; i >= 0; i--)
        {
            builder.Append(sb[i]);
        }
        return builder.ToString(); 
    }

    public static int decode(String str)
    {
        int num = 0;

        for ( int i = 0, len = str.Length; i < len; i++ )
        {
            num = num * BASE + ALPHABET.IndexOf( str[(i)] ); 
        }

        return num;
    }   
}

answered 3 years ago Graham #18

I have a variant of the problem, in that I store web pages from many different authors and need to prevent discovery of pages by guesswork. So my short URLs add a couple of extra digits to the Base-62 string for the page number. These extra digits are generated from information in the page record itself and they ensure that only 1 in 3844 URLs are valid (assuming 2-digit Base-62). You can see an outline description at http://mgscan.com/MBWL.

answered 3 years ago Jerry Jacobs #19

Very good answer, I have created a Golang implementation of the bjf:

package bjf

import (
    "math"
    "strings"
    "strconv"
)

const alphabet = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"

func Encode(num string) string {
    n, _ := strconv.ParseUint(num, 10, 64)
    t := make([]byte, 0)

    /* Special case */
    if n == 0 {
        return string(alphabet[0])
    }

    /* Map */
    for n > 0 {
        r := n % uint64(len(alphabet))
        t = append(t, alphabet[r])
        n = n / uint64(len(alphabet))
    }

    /* Reverse */
    for i, j := 0, len(t) - 1; i < j; i, j = i + 1, j - 1 {
        t[i], t[j] = t[j], t[i]
    }

    return string(t)
}

func Decode(token string) int {
    r := int(0)
    p := float64(len(token)) - 1

    for i := 0; i < len(token); i++ {
        r += strings.Index(alphabet, string(token[i])) * int(math.Pow(float64(len(alphabet)), p))
        p--
    }

    return r
}

Hosted at github: https://github.com/xor-gate/go-bjf

answered 2 years ago Hrishikesh Mishra #20

/**
 * <p>
 *     Integer to character and vice-versa
 * </p>
 *  
 */
public class TinyUrl {

    private final String characterMap = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
    private final int charBase = characterMap.length();

    public String covertToCharacter(int num){
        StringBuilder sb = new StringBuilder();

        while (num > 0){
            sb.append(characterMap.charAt(num % charBase));
            num /= charBase;
        }

        return sb.reverse().toString();
    }

    public int covertToInteger(String str){
        int num = 0;
        for(int i = 0 ; i< str.length(); i++)
            num += characterMap.indexOf(str.charAt(i)) * Math.pow(charBase , (str.length() - (i + 1)));

        return num;
    }
}

class TinyUrlTest{

    public static void main(String[] args) {
        TinyUrl tinyUrl = new TinyUrl();
        int num = 122312215;
        String url = tinyUrl.covertToCharacter(num);
        System.out.println("Tiny url:  " + url);
        System.out.println("Id: " + tinyUrl.covertToInteger(url));
    }
}

answered 2 years ago wyx #21

My python3 version

base_list = list("0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ")
base = len(base_list)


def encode(num: int):
    result = []
    if num == 0:
        result.append(base_list[0])

    while num > 0:
        result.append(base_list[num % base])
        num //= base

    print("".join(reversed(result)))


def decode(code: str):
    num = 0
    code_list = list(code)
    for index, code in enumerate(reversed(code_list)):
        num += base_list.index(code) * base ** index
    print(num)


if __name__ == '__main__':
    encode(341413134141)
    decode("60FoItT")

answered 2 years ago Hafiz Arslan #22

Here is Node.js implementation that is likely to bit.ly. generate highly random 7 character string. using Node.js crypto to generate highly random 25 charset than random select 7 character.

var crypto = require("crypto");
exports.shortURL = new function () {
    this.getShortURL = function () {
        var sURL = '',
            _rand = crypto.randomBytes(25).toString('hex'),
            _base = _rand.length;
        for (var i = 0; i < 7; i++)
            sURL += _rand.charAt(Math.floor(Math.random() * _rand.length));
        return sURL;
    };
}

answered 1 year ago fisch2 #23

For a quality NodeJS / Javascript solution, see the id-shortener module, which is thoroughly tested and has been used in production for months.

It provides an efficient id / url shortener backed by pluggable storage defaulting to redis, and you can even customize your short id character set and whether or not shortening is idempotent. This is an important distinction that not all URL shorteners take into account.

In relation to other answers here, this module implements the Marcel Jackwerth's excellent accepted answer above.

The core of the solution is provided by the following Redis Lua snippet:

local sequence = redis.call('incr', KEYS[1])

local chars = '0123456789ABCDEFGHJKLMNPQRSTUVWXYZ_abcdefghijkmnopqrstuvwxyz'
local remaining = sequence
local slug = ''

while (remaining > 0) do
  local d = (remaining % 60)
  local character = string.sub(chars, d + 1, d + 1)

  slug = character .. slug
  remaining = (remaining - d) / 60
end

redis.call('hset', KEYS[2], slug, ARGV[1])

return slug

answered 8 months ago adrift #24

Implementation in Scala:

class Encoder(alphabet: String) extends (Long => String) {

  val Base = alphabet.size

  override def apply(number: Long) = {
    def encode(current: Long): List[Int] = {
      if (current == 0) Nil
      else (current % Base).toInt :: encode(current / Base)
    }
    encode(number).reverse
      .map(current => alphabet.charAt(current)).mkString
  }
}

class Decoder(alphabet: String) extends (String => Long) {

  val Base = alphabet.size

  override def apply(string: String) = {
    def decode(current: Long, encodedPart: String): Long = {
      if (encodedPart.size == 0) current
      else decode(current * Base + alphabet.indexOf(encodedPart.head),encodedPart.tail)
    }
    decode(0,string)
  }
}

Test example with Scala test:

import org.scalatest.{FlatSpec, Matchers}

class DecoderAndEncoderTest extends FlatSpec with Matchers {

  val Alphabet = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"

  "A number with base 10" should "be correctly encoded into base 62 string" in {
    val encoder = new Encoder(Alphabet)
    encoder(127) should be ("cd")
    encoder(543513414) should be ("KWGPy")
  }

  "A base 62 string" should "be correctly decoded into a number with base 10" in {
    val decoder = new Decoder(Alphabet)
    decoder("cd") should be (127)
    decoder("KWGPy") should be (543513414)
  }

}

answered 6 months ago Firas Omrane #25

A node js and mongodb solution

Since we know the format that mongodb uses to create a new ObjectId with 12 bytes.

  • a 4-byte value representing the seconds since the Unix epoch,
  • a 3-byte machine identifier,
  • a 2-byte process id
  • a 3-byte counter (in your machine), starting with a random value.

Example (I choose a random sequence) a1b2c3d4e5f6g7h8i9j1k2l3

  • a1b2c3d4 represents the seconds since the Unix epoch,
  • 4e5f6g7 represents machine identifier,
  • h8i9 represents process id
  • j1k2l3 represents the counter, starting with a random value.

Since the counter will be unique if we are storing the data in the same machine we can get it with no doubts that it will be duplicate.

So the short URL will be the counter and here is a code snippet assuming that your server is running properly.

const mongoose = require('mongoose');
const Schema = mongoose.Schema;

// create a schema
const shortUrl = new Schema({
    long_url: { type: String, required: true },
    short_url: { type: String, required: true, unique: true },
  });
const ShortUrl = mongoose.model('ShortUrl', shortUrl);

//The user can request to get a short URL by providing a long URL using a form

app.post('/shorten', function(req ,res){
    //create a new shortUrl*/
    //the submit form has an input with longURL as its name attribute. 
    const longUrl = req.body["longURL"];
    const newUrl = ShortUrl({
        long_url : longUrl,
        short_url : "",
    });
    const shortUrl = newUrl._id.toString().slice(-6);
    newUrl.short_url = shortUrl;
    console.log(newUrl);
    newUrl.save(function(err){
        console.log("the new url is added");
    })
});

answered 6 months ago AJcodez #26

I keep incrementing a integer sequence per domain in the database and use Hashids to encode the integer into a URL path.

static hashids = Hashids(salt = "my app rocks", minSize = 6)

I ran a script to see how long it takes until it exhausts the character length. For 6 characters it can do 164,916,224 links and then goes up to 7 characters. Bitly uses 7 characters. Under 5 characters looks weird to me.

Hashids can decode the URL path back to a integer but a simpler solution is to use the entire short link sho.rt/ka8ds3 as a primary key.

Here is the full concept:

function addDomain(domain) {
    table("domains").insert("domain", domain, "seq", 0)
}

function addURL(domain, longURL) {
    seq = table("domains").where("domain = ?", domain).increment("seq")
    shortURL = domain + "/" + hashids.encode(seq)
    table("links").insert("short", shortURL, "long", longURL)
    return shortURL
}

// GET /:hashcode
function handleRequest(req, res) {
    shortURL = req.host + "/" + req.param("hashcode")
    longURL = table("links").where("short = ?", shortURL).get("long")
    res.redirect(301, longURL)
}

answered 4 months ago Luis Neighbur #27

Function based in Xeoncross Class

function shortly($input){
$dictionary = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z','A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z','0','1','2','3','4','5','6','7','8','9'];
if($input===0)
    return $dictionary[0];
$base = count($dictionary);
if(is_numeric($input)){
    $result = [];
    while($input > 0){
        $result[] = $dictionary[($input % $base)];
        $input = floor($input / $base);
    }
    return join("", array_reverse($result));
}
$i = 0;
$input = str_split($input);
foreach($input as $char){
    $pos = array_search($char, $dictionary);
    $i = $i * $base + $pos;
}
return $i;
}

comments powered by Disqus