2023-12-31 02:54:51 +05:30
19 changed files with 212 additions and 0 deletions
--- a/bench/results/README.md
+++ b/bench/results/README.md
@ -0,0 +1,212 @@
+# Benchmark Report
+
+Benchmarks were run at various stages of development to keep track of
+performance. Tech stacks were changed and the implementation optimized
+to increase throughput. This report summarizes the findings of the
+benchmarks
+
+Ultimately, we were able to identify a bottleneck that was previously
+hidden in mCaptcha (hidden because a different bottleneck like DB access
+eclipsed it :p) [and were able to increase performance of the critical
+path by ~147 times](https://git.batsense.net/mCaptcha/dcache/pulls/3)
+through a trivial optimization.
+
+## Environment
+
+These benchmarks were run on a noisy development laptop and should be
+used for guidance only. 
+
+- CPU: AMD Ryzen 5 5600U with Radeon Graphics (12) @ 4.289GHz
+- Memory: 22849MiB
+- OS:  Arch Linux x86_64
+- Kernel: 6.6.7-arch1-1
+- rustc: 1.73.0 (cc66ad468 2023-10-03)
+
+## Baseline: Tech stack version 1
+
+Actix Web based networking with JSON for message format. Was chosen for
+prototyping, and was later used to set a baseline.
+
+## Without connection pooling in server-to-server communications
+
+### Single requests (no batching)
+
+
+<details>
+
+
+<summary>Peak throughput observed was 1117 request/second (please click
+to see charts)</summary>
+
+
+#### Total number of requests vs time
+
+![number of requests](./v1/nopooling/nopipelining/total_requests_per_second_1703969194.png)
+
+#### Response times(ms) vs time
+
+![repsonse times(ms)](<./v1/nopooling/nopipelining/response_times_(ms)_1703969194.png>)
+
+#### Number of concurrent users vs time
+
+![number of concurrent
+users](./v1/nopooling/nopipelining/number_of_users_1703969194.png)
+
+
+</details>
+
+### Batched requests
+
+<details>
+<summary>
+Each network request contained 1,000 application requests, so peak throughput observed was 1,800 request/second.
+Please click to see charts</summary>
+
+
+#### Total number of requests vs time
+
+![number of requests](./v1/pooling/pipelining/total_requests_per_second_1703968582.png)
+
+#### Response times(ms) vs time
+
+![repsonse times(ms)](<./v1/pooling/pipelining/response_times_(ms)_1703968582.png>))
+
+#### Number of concurrent users vs time
+
+![number of concurrent
+users](./v1/pooling/pipelining/number_of_users_1703968582.png)
+
+
+</details>
+
+## With connection pooling in server-to-server communications
+
+
+### Single requests (no batching)
+
+<details>
+<summary>
+Peak throughput observed was 3904 request/second. Please click to see
+charts</summary>
+
+
+#### Total number of requests vs time
+
+![number of requests](./v1/pooling/nopipelining/total_requests_per_second_1703968214.png)
+
+#### Response times(ms) vs time
+
+![repsonse times(ms)](<./v1/pooling/nopipelining/response_times_(ms)_1703968215.png>)
+
+#### Number of concurrent users vs time
+
+![number of concurrent
+users](./v1/pooling/nopipelining/number_of_users_1703968215.png)
+
+
+</details>
+
+### Batched requests
+
+
+<details>
+<summary>
+Each network request contained 1,000 application requests, so peak throughput observed was 15,800 request/second.
+Please click to see charts.
+</summary>
+
+
+#### Total number of requests vs time
+
+![number of requests](./v1/pooling/pipelining/total_requests_per_second_1703968582.png)
+
+#### Response times(ms) vs time
+
+![repsonse times(ms)](<./v1/pooling/pipelining/response_times_(ms)_1703968582.png>))
+
+#### Number of concurrent users vs time
+
+![number of concurrent
+users](./v1/pooling/pipelining/number_of_users_1703968582.png)
+
+</details>
+
+
+## Tech stack version 2
+
+Tonic for the network stack and GRPC for wire format. We ran over a
+dozen benchmarks with this tech stack. The trend was similar to the ones
+observed above: throughput was higher when connection pool was used and
+even higher when requests were batched. _But_ the throughput of all of these benchmarks were lower than the
+baseline benchmarks!
+
+The CPU was busier. We put it through
+[flamgragh](https://github.com/flamegraph-rs/flamegraph) and hit it with
+the same test suite to identify compute-heavy areas. The result was
+unexpected:
+
+![flamegraph indicating libmcaptcha being
+slow](./v2/libmcaptcha-bottleneck/problem/flamegraph.svg)
+
+libmCaptcha's [AddVisitor
+handler](https://github.com/mCaptcha/libmcaptcha/blob/e3f456f35b2c9e55e0475b01b3e05d48b21fd51f/src/master/embedded/counter.rs#L124)
+was taking up 59% of CPU time of the entire test run. This is a very
+critical part of the variable difficulty factor PoW algorithm that
+mCaptcha uses. We never ran into this bottleneck before because in other
+cache implementations, it was always preceded with a database request.
+It surfaced here as we are using in-memory data sources in dcache.
+
+libmCaptcha uses an actor-based approach with message passing for clean
+concurrent state management. Message passing is generally faster in most
+cases, but in our case, sharing memory using CPU's concurrent primitives
+turned out to be significantly faster:
+
+![flamegraph indicating libmcaptcha being
+slow](./v2/libmcaptcha-bottleneck/solution/flamegraph.svg)
+
+CPU time was reduced from 59% to 0.4%, roughly by one 147 times!
+
+With this fix in place:
+
+
+### Connection pooled server-to-server communications, single requests (no batching)
+
+Peak throughput observed was 4816 request/second, ~1000 requests/second
+more than baseline.
+
+
+#### Total number of requests vs time
+
+![number of requests](./v2/grpc-conn-pool-post-bottleneck/single/total_requests_per_second_1703970940.png)
+
+#### Response times(ms) vs time
+
+![repsonse times(ms)](./v2/grpc-conn-pool-post-bottleneck/single/response_times_(ms)_1703970940.png)
+
+#### Number of concurrent users vs time
+
+![number of concurrent
+users](./v2/grpc-conn-pool-post-bottleneck/single/number_of_users_1703970940.png)
+
+
+### Connection pooled server-to-server communications, batched requests
+
+
+Each network request contained 1,000 application requests, so peak throughput observed was 95,700 request/second. This six times higher than baseline.
+Please click to see charts.
+
+
+#### Total number of requests vs time
+
+![number of requests](./v2/grpc-conn-pool-post-bottleneck/pipeline/total_requests_per_second_1703971082.png)
+
+#### Response times(ms) vs time
+
+![repsonse times(ms)](./v2/grpc-conn-pool-post-bottleneck/pipeline/response_times_(ms)_1703971082.png)
+
+#### Number of concurrent users vs time
+
+![number of concurrent
+users](./v2/grpc-conn-pool-post-bottleneck/pipeline/number_of_users_1703971082.png)
+
+</details>
--- a/bench/results/v1/nopooling/nopipelining/number_of_users_1703969194.png
+++ b/bench/results/v1/nopooling/nopipelining/number_of_users_1703969194.png
--- a/bench/results/v1/nopooling/nopipelining/response_times_(ms)_1703969194.png
+++ b/bench/results/v1/nopooling/nopipelining/response_times_(ms)_1703969194.png
--- a/bench/results/v1/nopooling/nopipelining/total_requests_per_second_1703969194.png
+++ b/bench/results/v1/nopooling/nopipelining/total_requests_per_second_1703969194.png
--- a/bench/results/v1/nopooling/pipelining/number_of_users_1703969381.png
+++ b/bench/results/v1/nopooling/pipelining/number_of_users_1703969381.png
--- a/bench/results/v1/nopooling/pipelining/response_times_(ms)_1703969381.png
+++ b/bench/results/v1/nopooling/pipelining/response_times_(ms)_1703969381.png
--- a/bench/results/v1/nopooling/pipelining/total_requests_per_second_1703969381.png
+++ b/bench/results/v1/nopooling/pipelining/total_requests_per_second_1703969381.png
--- a/bench/results/v1/pooling/nopipelining/number_of_users_1703968215.png
+++ b/bench/results/v1/pooling/nopipelining/number_of_users_1703968215.png
--- a/bench/results/v1/pooling/nopipelining/response_times_(ms)_1703968215.png
+++ b/bench/results/v1/pooling/nopipelining/response_times_(ms)_1703968215.png
--- a/bench/results/v1/pooling/nopipelining/total_requests_per_second_1703968214.png
+++ b/bench/results/v1/pooling/nopipelining/total_requests_per_second_1703968214.png
--- a/bench/results/v1/pooling/pipelining/number_of_users_1703968582.png
+++ b/bench/results/v1/pooling/pipelining/number_of_users_1703968582.png
--- a/bench/results/v1/pooling/pipelining/response_times_(ms)_1703968582.png
+++ b/bench/results/v1/pooling/pipelining/response_times_(ms)_1703968582.png
--- a/bench/results/v1/pooling/pipelining/total_requests_per_second_1703968582.png
+++ b/bench/results/v1/pooling/pipelining/total_requests_per_second_1703968582.png
--- a/bench/results/v2/grpc-conn-pool-post-bottleneck/pipeline/number_of_users_1703971082.png
+++ b/bench/results/v2/grpc-conn-pool-post-bottleneck/pipeline/number_of_users_1703971082.png
--- a/bench/results/v2/grpc-conn-pool-post-bottleneck/pipeline/response_times_(ms)_1703971082.png
+++ b/bench/results/v2/grpc-conn-pool-post-bottleneck/pipeline/response_times_(ms)_1703971082.png
--- a/bench/results/v2/grpc-conn-pool-post-bottleneck/pipeline/total_requests_per_second_1703971082.png
+++ b/bench/results/v2/grpc-conn-pool-post-bottleneck/pipeline/total_requests_per_second_1703971082.png
--- a/bench/results/v2/grpc-conn-pool-post-bottleneck/single/number_of_users_1703970940.png
+++ b/bench/results/v2/grpc-conn-pool-post-bottleneck/single/number_of_users_1703970940.png
--- a/bench/results/v2/grpc-conn-pool-post-bottleneck/single/response_times_(ms)_1703970940.png
+++ b/bench/results/v2/grpc-conn-pool-post-bottleneck/single/response_times_(ms)_1703970940.png
--- a/bench/results/v2/grpc-conn-pool-post-bottleneck/single/total_requests_per_second_1703970940.png
+++ b/bench/results/v2/grpc-conn-pool-post-bottleneck/single/total_requests_per_second_1703970940.png