TIL from this1 video that Python’s urllib.parse.urlparse2 is quite slow at parsing URLs. I’ve always used urlparse to destructure URLs and didn’t know that there’s a faster alternative to this in the standard library. The official documentation also recommends the alternative function.

The urlparse function splits a supplied URL into multiple seperate components and returns a ParseResult object. Consider this example:

In [1]: from urllib.parse import urlparse

In [2]: url = "https://httpbin.org/get?q=hello&r=22"

In [3]: urlparse(url)
Out[3]: ParseResult(
        scheme='https', netloc='httpbin.org',
        path='/get', params='', query='q=hello&r=22',
        fragment=''
    )

You can see how the function disassembles the URL and builds a ParseResult object with the URL components. Along with this, the urlparse function can also parse an obscure type of URL that you’ll most likely never need. If you notice closely in the previous example, you’ll see that there’s a params argument in the ParseResult object. This params argument gets parsed whether you need it or not and that adds some overhead. The params field will be populated if you have a URL like this:

In [1]: from urllib.parse import urlparse

In [2]: url = "https://httpbin.org/get;a=mars&b=42?q=hello&r=22"

In [3]: urlparse(url)
Out[4]: ParseResult(
    scheme='https', netloc='httpbin.org', path='/get',
    params='a=mars&b=42', query='q=hello&r=22', fragment=''
    )

Notice the parts in the URL that appears after https://httpbin.org/get. There’s a semicolon and a few more parameters succeeding that—;a=mars&b=42. The resulting ParseResult now has the params field populated with the parsed param value a=mars&b=42. Unless you need this param support, there’s a better and faster alternative to this in the standard library. The urlsplit3 function does the same thing as urlparse minus the param parsing and is twice as fast. Here’s how you’d use urlsplit:

In [1]: from urllib.parse import urlsplit

In [2]: url = "https://httpbin.org/get?q=hello&r=22"

In [3]: urlsplit(url)
Out[3]: SplitResult(
    scheme='https', netloc='httpbin.org', path='/get',
    query='q=hello&r=22', fragment=''
    )

The urlsplit function returns a SplitResult object similar to the ParseResult object you’ve seen before. Notice there’s no param argument in the output here. I measured the speed difference like this:

In [1]: from urllib.parse import urlparse, urlsplit

In [2]: url = "https://httpbin.org/get?q=hello&r=22"

In [3]: %timeit urlparse(url)
1.7 µs ± 2.91 ns per loop (
    mean ± std. dev. of 7 runs, 1,000,000 loops each)

In [4]: %timeit urlsplit(url)
885 ns ± 10.9 ns per loop (
    mean ± std. dev. of 7 runs, 1,000,000 loops each)

Wow, that’s almost 2x speed improvement. Although this shouldn’t be much of an issue in a real codebase but it can matter if you are parsing URLs in a critical hot path.

— ⁂ —

Recent posts