feat: benchmark report
All checks were successful
ci/woodpecker/push/woodpecker Pipeline was successful
ci/woodpecker/pr/woodpecker Pipeline was successful
ci/woodpecker/pull_request_closed/woodpecker Pipeline was successful

This commit is contained in:
Aravinth Manivannan 2023-12-31 02:54:07 +05:30
parent f20d044537
commit a82b9044d5
Signed by: realaravinth
GPG key ID: F8F50389936984FF
19 changed files with 212 additions and 0 deletions

212
bench/results/README.md Normal file
View file

@ -0,0 +1,212 @@
# Benchmark Report
Benchmarks were run at various stages of development to keep track of
performance. Tech stacks were changed and the implementation optimized
to increase throughput. This report summarizes the findings of the
benchmarks
Ultimately, we were able to identify a bottleneck that was previously
hidden in mCaptcha (hidden because a different bottleneck like DB access
eclipsed it :p) [and were able to increase performance of the critical
path by ~147 times](https://git.batsense.net/mCaptcha/dcache/pulls/3)
through a trivial optimization.
## Environment
These benchmarks were run on a noisy development laptop and should be
used for guidance only.
- CPU: AMD Ryzen 5 5600U with Radeon Graphics (12) @ 4.289GHz
- Memory: 22849MiB
- OS: Arch Linux x86_64
- Kernel: 6.6.7-arch1-1
- rustc: 1.73.0 (cc66ad468 2023-10-03)
## Baseline: Tech stack version 1
Actix Web based networking with JSON for message format. Was chosen for
prototyping, and was later used to set a baseline.
## Without connection pooling in server-to-server communications
### Single requests (no batching)
<details>
<summary>Peak throughput observed was 1117 request/second (please click
to see charts)</summary>
#### Total number of requests vs time
![number of requests](./v1/nopooling/nopipelining/total_requests_per_second_1703969194.png)
#### Response times(ms) vs time
![repsonse times(ms)](<./v1/nopooling/nopipelining/response_times_(ms)_1703969194.png>)
#### Number of concurrent users vs time
![number of concurrent
users](./v1/nopooling/nopipelining/number_of_users_1703969194.png)
</details>
### Batched requests
<details>
<summary>
Each network request contained 1,000 application requests, so peak throughput observed was 1,800 request/second.
Please click to see charts</summary>
#### Total number of requests vs time
![number of requests](./v1/pooling/pipelining/total_requests_per_second_1703968582.png)
#### Response times(ms) vs time
![repsonse times(ms)](<./v1/pooling/pipelining/response_times_(ms)_1703968582.png>))
#### Number of concurrent users vs time
![number of concurrent
users](./v1/pooling/pipelining/number_of_users_1703968582.png)
</details>
## With connection pooling in server-to-server communications
### Single requests (no batching)
<details>
<summary>
Peak throughput observed was 3904 request/second. Please click to see
charts</summary>
#### Total number of requests vs time
![number of requests](./v1/pooling/nopipelining/total_requests_per_second_1703968214.png)
#### Response times(ms) vs time
![repsonse times(ms)](<./v1/pooling/nopipelining/response_times_(ms)_1703968215.png>)
#### Number of concurrent users vs time
![number of concurrent
users](./v1/pooling/nopipelining/number_of_users_1703968215.png)
</details>
### Batched requests
<details>
<summary>
Each network request contained 1,000 application requests, so peak throughput observed was 15,800 request/second.
Please click to see charts.
</summary>
#### Total number of requests vs time
![number of requests](./v1/pooling/pipelining/total_requests_per_second_1703968582.png)
#### Response times(ms) vs time
![repsonse times(ms)](<./v1/pooling/pipelining/response_times_(ms)_1703968582.png>))
#### Number of concurrent users vs time
![number of concurrent
users](./v1/pooling/pipelining/number_of_users_1703968582.png)
</details>
## Tech stack version 2
Tonic for the network stack and GRPC for wire format. We ran over a
dozen benchmarks with this tech stack. The trend was similar to the ones
observed above: throughput was higher when connection pool was used and
even higher when requests were batched. _But_ the throughput of all of these benchmarks were lower than the
baseline benchmarks!
The CPU was busier. We put it through
[flamgragh](https://github.com/flamegraph-rs/flamegraph) and hit it with
the same test suite to identify compute-heavy areas. The result was
unexpected:
![flamegraph indicating libmcaptcha being
slow](./v2/libmcaptcha-bottleneck/problem/flamegraph.svg)
libmCaptcha's [AddVisitor
handler](https://github.com/mCaptcha/libmcaptcha/blob/e3f456f35b2c9e55e0475b01b3e05d48b21fd51f/src/master/embedded/counter.rs#L124)
was taking up 59% of CPU time of the entire test run. This is a very
critical part of the variable difficulty factor PoW algorithm that
mCaptcha uses. We never ran into this bottleneck before because in other
cache implementations, it was always preceded with a database request.
It surfaced here as we are using in-memory data sources in dcache.
libmCaptcha uses an actor-based approach with message passing for clean
concurrent state management. Message passing is generally faster in most
cases, but in our case, sharing memory using CPU's concurrent primitives
turned out to be significantly faster:
![flamegraph indicating libmcaptcha being
slow](./v2/libmcaptcha-bottleneck/solution/flamegraph.svg)
CPU time was reduced from 59% to 0.4%, roughly by one 147 times!
With this fix in place:
### Connection pooled server-to-server communications, single requests (no batching)
Peak throughput observed was 4816 request/second, ~1000 requests/second
more than baseline.
#### Total number of requests vs time
![number of requests](./v2/grpc-conn-pool-post-bottleneck/single/total_requests_per_second_1703970940.png)
#### Response times(ms) vs time
![repsonse times(ms)](./v2/grpc-conn-pool-post-bottleneck/single/response_times_(ms)_1703970940.png)
#### Number of concurrent users vs time
![number of concurrent
users](./v2/grpc-conn-pool-post-bottleneck/single/number_of_users_1703970940.png)
### Connection pooled server-to-server communications, batched requests
Each network request contained 1,000 application requests, so peak throughput observed was 95,700 request/second. This six times higher than baseline.
Please click to see charts.
#### Total number of requests vs time
![number of requests](./v2/grpc-conn-pool-post-bottleneck/pipeline/total_requests_per_second_1703971082.png)
#### Response times(ms) vs time
![repsonse times(ms)](./v2/grpc-conn-pool-post-bottleneck/pipeline/response_times_(ms)_1703971082.png)
#### Number of concurrent users vs time
![number of concurrent
users](./v2/grpc-conn-pool-post-bottleneck/pipeline/number_of_users_1703971082.png)
</details>

Binary file not shown.

After

Width:  |  Height:  |  Size: 18 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 30 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 27 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 19 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 35 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 35 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 22 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 27 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 29 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 16 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 24 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 26 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 19 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 24 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 28 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 22 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 33 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 28 KiB