TIL from this1 video that Python’s urllib.parse.urlparse2 is quite slow at parsing
URLs. I’ve always used urlparse to destructure URLs and didn’t know that there’s a faster
alternative to this in the standard library. The official documentation also recommends the
alternative function.
The urlparse function splits a supplied URL into multiple seperate components and returns
a ParseResult object. Consider this example:
In [1]: from urllib.parse import urlparse
In [2]: url = "https://httpbin.org/get?q=hello&r=22"
In [3]: urlparse(url)
Out[3]: ParseResult(
scheme='https', netloc='httpbin.org',
path='/get', params='', query='q=hello&r=22',
fragment=''
)
You can see how the function disassembles the URL and builds a ParseResult object with the
URL components. Along with this, the urlparse function can also parse an obscure type of
URL that you’ll most likely never need. If you notice closely in the previous example,
you’ll see that there’s a params argument in the ParseResult object. This params
argument gets parsed whether you need it or not and that adds some overhead. The params
field will be populated if you have a URL like this:
In [1]: from urllib.parse import urlparse
In [2]: url = "https://httpbin.org/get;a=mars&b=42?q=hello&r=22"
In [3]: urlparse(url)
Out[4]: ParseResult(
scheme='https', netloc='httpbin.org', path='/get',
params='a=mars&b=42', query='q=hello&r=22', fragment=''
)
Notice the parts in the URL that appears after https://httpbin.org/get. There’s a
semicolon and a few more parameters succeeding that — ;a=mars&b=42. The resulting
ParseResult now has the params field populated with the parsed param value
a=mars&b=42. Unless you need this param support, there’s a better and faster alternative
to this in the standard library. The urlsplit3 function does the same thing as
urlparse minus the param parsing and is twice as fast. Here’s how you’d use urlsplit:
In [1]: from urllib.parse import urlsplit
In [2]: url = "https://httpbin.org/get?q=hello&r=22"
In [3]: urlsplit(url)
Out[3]: SplitResult(
scheme='https', netloc='httpbin.org', path='/get',
query='q=hello&r=22', fragment=''
)
The urlsplit function returns a SplitResult object similar to the ParseResult object
you’ve seen before. Notice there’s no param argument in the output here. I measured the
speed difference like this:
In [1]: from urllib.parse import urlparse, urlsplit
In [2]: url = "https://httpbin.org/get?q=hello&r=22"
In [3]: %timeit urlparse(url)
1.7 µs ± 2.91 ns per loop (
mean ± std. dev. of 7 runs, 1,000,000 loops each)
In [4]: %timeit urlsplit(url)
885 ns ± 10.9 ns per loop (
mean ± std. dev. of 7 runs, 1,000,000 loops each)
Wow, that’s almost 2x speed improvement. Although this shouldn’t be much of an issue in a real codebase but it can matter if you are parsing URLs in a critical hot path.
Recent posts
- Splintered failure modes in Go
- Re-exec testing Go subprocesses
- Revisiting interface segregation in Go
- Avoiding collisions in Go context keys
- Organizing Go tests
- Subtest grouping in Go
- Let the domain guide your application structure
- Test state, not interactions
- Early return and goroutine leak
- Lifecycle management in Go tests