Peter Bengtsson: Update to speed comparison for Redis vs PostgreSQL storing blobs of JSON
(published: Sept. 30, 2019, 3:06 a.m.)
Last week, I blogged about "How much faster is Redis at storing a blob of JSON compared to PostgreSQL?". Judging from a lot of comments, people misinterpreted this. (By the way, Redis is persistent). It's no surprise that Redis is faster.
However, it's a fact that I have do have a lot of blobs stored and need to present them via the web API as fast as possible. It's rare that I want to do relational or batch operations on the data. But Redis isn't a slam dunk for simple retrieval because I don't know if I trust its integrity with the 3GB worth of data that I both don't want to lose and don't want to load all into RAM.
But is it entirely wrong to look at WHICH database to get the best speed?
Reviewing this corner of Song Search helped me rethink this. PostgreSQL is, in my view, a better database for storing stuff. Redis is faster for individual lookups. But you know what's even faster? Nginx
The way the application works is that a React web app is requesting the Amazon product data for the sake of presenting an appropriate affiliate link. This is done by the browser essentially doing:
const response = await fetch('https://songsear.ch/api/song/5246889/amazon');
Internally, in the app, what it does is that it looks this up, by ID, on the
AmazonAffiliateLookup ORM model. Suppose it wasn't there in the PostgreSQL, it uses the Amazon Affiliate Product Details API, to look it up and when the results come in it stores a copy of this in PostgreSQL so we can re-use this URL without hitting rate limits on the Product Details API. Lastly, in a piece of Django view code, it carefully scrubs and repackages this result so that only the fields used by the React rendering code is shipped between the server and the browser. That "scrubbed" piece of data is actually much smaller. Partly because it limits the results to the first/best match and it deletes a bunch of things that are never needed such as
TrackSequence etc. The proportion is roughly 23x. I.e. of the 3GB of JSON blobs stored in PostgreSQL only 130MB is ever transported from the server to the users.
Nginx has a built in reverse HTTP proxy cache which is easy to set up but a bit hard to do purges on. The biggest flaw, in my view, is that it's hard to get a handle of how much RAM this it's eating up. Well, if the total possible amount of data within the server is 130MB, then that is something I'm perfectly comfortable to let Nginx handle cache in RAM.
Good HTTP performance benchmarking is hard to do but here's a teaser from my local laptop version of Nginx:
▶ hey -n 10000 -c 10 https://songsearch.local/api/song/1810960/affiliate/amazon-itunes Summary: Total: 0.9882 secs Slowest: 0.0279 secs Fastest: 0.0001 secs Average: 0.0010 secs Requests/sec: 10119.8265 Response time histogram: 0.000  | 0.003  |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 0.006  | 0.008  | 0.011  | 0.014  | 0.017  | 0.020  | 0.022  | 0.025  | 0.028  | Latency distribution: 10% in 0.0003 secs 25% in 0.0006 secs 50% in 0.0008 secs 75% in 0.0010 secs 90% in 0.0013 secs 95% in 0.0016 secs 99% in 0.0068 secs Details (average, fastest, slowest): DNS+dialup: 0.0000 secs, 0.0001 secs, 0.0279 secs DNS-lookup: 0.0000 secs, 0.0000 secs, 0.0026 secs req write: 0.0000 secs, 0.0000 secs, 0.0011 secs resp wait: 0.0008 secs, 0.0001 secs, 0.0206 secs resp read: 0.0001 secs, 0.0000 secs, 0.0013 secs Status code distribution:  10000 responses
10,000 requests across 10 clients at rougly 10,000 requests per second. That includes doing all the HTTP parsing, WSGI stuff, forming of a SQL or Redis query, the deserialization, the Django JSON HTTP response serialization etc. The cache TTL is controlled by simply setting a
Cache-Control HTTP header with something like
Now, repeated fetches for this are cached at the Nginx level and it means it doesn't even matter how slow/fast the database is. As long as it's not taking seconds, with a long
Cache-Control, Nginx can hold on to this in RAM for days or until the whole server is restarted (which is rare).
If you the total amount of data that can and will be cached is controlled, putting it in a HTTP reverse proxy cache is probably order of magnitude faster than messing with chosing which database to use.