dcache/bench/results/README.md

# Benchmark Report

Benchmarks were run at various stages of development to keep track of
performance. Tech stacks were changed and the implementation optimized
to increase throughput. This report summarizes the findings of the
benchmarks

Ultimately, we were able to identify a bottleneck that was previously
hidden in mCaptcha (hidden because a different bottleneck like DB access
eclipsed it :p) [and were able to increase performance of the critical
path by ~147 times](https://git.batsense.net/mCaptcha/dcache/pulls/3)
through a trivial optimization.

## Environment

These benchmarks were run on a noisy development laptop and should be
used for guidance only. 

- CPU: AMD Ryzen 5 5600U with Radeon Graphics (12) @ 4.289GHz
- Memory: 22849MiB
- OS:  Arch Linux x86_64
- Kernel: 6.6.7-arch1-1
- rustc: 1.73.0 (cc66ad468 2023-10-03)

## Baseline: Tech stack version 1

Actix Web based networking with JSON for message format. Was chosen for
prototyping, and was later used to set a baseline.

## Without connection pooling in server-to-server communications

### Single requests (no batching)


<details>


<summary>Peak throughput observed was 1117 request/second (please click
to see charts)</summary>


#### Total number of requests vs time

![number of requests](./v1/nopooling/nopipelining/total_requests_per_second_1703969194.png)

#### Response times(ms) vs time

![repsonse times(ms)](<./v1/nopooling/nopipelining/response_times_(ms)_1703969194.png>)

#### Number of concurrent users vs time

![number of concurrent
users](./v1/nopooling/nopipelining/number_of_users_1703969194.png)


</details>

### Batched requests

<details>
<summary>
Each network request contained 1,000 application requests, so peak throughput observed was 1,800 request/second.
Please click to see charts</summary>


#### Total number of requests vs time

![number of requests](./v1/pooling/pipelining/total_requests_per_second_1703968582.png)

#### Response times(ms) vs time

![repsonse times(ms)](<./v1/pooling/pipelining/response_times_(ms)_1703968582.png>))

#### Number of concurrent users vs time

![number of concurrent
users](./v1/pooling/pipelining/number_of_users_1703968582.png)


</details>

## With connection pooling in server-to-server communications


### Single requests (no batching)

<details>
<summary>
Peak throughput observed was 3904 request/second. Please click to see
charts</summary>


#### Total number of requests vs time

![number of requests](./v1/pooling/nopipelining/total_requests_per_second_1703968214.png)

#### Response times(ms) vs time

![repsonse times(ms)](<./v1/pooling/nopipelining/response_times_(ms)_1703968215.png>)

#### Number of concurrent users vs time

![number of concurrent
users](./v1/pooling/nopipelining/number_of_users_1703968215.png)


</details>

### Batched requests


<details>
<summary>
Each network request contained 1,000 application requests, so peak throughput observed was 15,800 request/second.
Please click to see charts.
</summary>


#### Total number of requests vs time

![number of requests](./v1/pooling/pipelining/total_requests_per_second_1703968582.png)

#### Response times(ms) vs time

![repsonse times(ms)](<./v1/pooling/pipelining/response_times_(ms)_1703968582.png>))

#### Number of concurrent users vs time

![number of concurrent
users](./v1/pooling/pipelining/number_of_users_1703968582.png)

</details>


## Tech stack version 2

Tonic for the network stack and GRPC for wire format. We ran over a
dozen benchmarks with this tech stack. The trend was similar to the ones
observed above: throughput was higher when connection pool was used and
even higher when requests were batched. _But_ the throughput of all of these benchmarks were lower than the
baseline benchmarks!

The CPU was busier. We put it through
[flamgragh](https://github.com/flamegraph-rs/flamegraph) and hit it with
the same test suite to identify compute-heavy areas. The result was
unexpected:

![flamegraph indicating libmcaptcha being
slow](./v2/libmcaptcha-bottleneck/problem/flamegraph.svg)

libmCaptcha's [AddVisitor
handler](https://github.com/mCaptcha/libmcaptcha/blob/e3f456f35b2c9e55e0475b01b3e05d48b21fd51f/src/master/embedded/counter.rs#L124)
was taking up 59% of CPU time of the entire test run. This is a very
critical part of the variable difficulty factor PoW algorithm that
mCaptcha uses. We never ran into this bottleneck before because in other
cache implementations, it was always preceded with a database request.
It surfaced here as we are using in-memory data sources in dcache.

libmCaptcha uses an actor-based approach with message passing for clean
concurrent state management. Message passing is generally faster in most
cases, but in our case, sharing memory using CPU's concurrent primitives
turned out to be significantly faster:

![flamegraph indicating libmcaptcha being
slow](./v2/libmcaptcha-bottleneck/solution/flamegraph.svg)

CPU time was reduced from 59% to 0.4%, roughly by one 147 times!

With this fix in place:


### Connection pooled server-to-server communications, single requests (no batching)

Peak throughput observed was 4816 request/second, ~1000 requests/second
more than baseline.


#### Total number of requests vs time

![number of requests](./v2/grpc-conn-pool-post-bottleneck/single/total_requests_per_second_1703970940.png)

#### Response times(ms) vs time

![repsonse times(ms)](./v2/grpc-conn-pool-post-bottleneck/single/response_times_(ms)_1703970940.png)

#### Number of concurrent users vs time

![number of concurrent
users](./v2/grpc-conn-pool-post-bottleneck/single/number_of_users_1703970940.png)


### Connection pooled server-to-server communications, batched requests


Each network request contained 1,000 application requests, so peak throughput observed was 95,700 request/second. This six times higher than baseline.
Please click to see charts.


#### Total number of requests vs time

![number of requests](./v2/grpc-conn-pool-post-bottleneck/pipeline/total_requests_per_second_1703971082.png)

#### Response times(ms) vs time

![repsonse times(ms)](./v2/grpc-conn-pool-post-bottleneck/pipeline/response_times_(ms)_1703971082.png)

#### Number of concurrent users vs time

![number of concurrent
users](./v2/grpc-conn-pool-post-bottleneck/pipeline/number_of_users_1703971082.png)

</details>
feat: benchmark report 2023-12-31 02:54:07 +05:30			`# Benchmark Report`

			`Benchmarks were run at various stages of development to keep track of`
			`performance. Tech stacks were changed and the implementation optimized`
			`to increase throughput. This report summarizes the findings of the`
			`benchmarks`

			`Ultimately, we were able to identify a bottleneck that was previously`
			`hidden in mCaptcha (hidden because a different bottleneck like DB access`
			`eclipsed it :p) [and were able to increase performance of the critical`
			`path by ~147 times](https://git.batsense.net/mCaptcha/dcache/pulls/3)`
			`through a trivial optimization.`

			`## Environment`

			`These benchmarks were run on a noisy development laptop and should be`
			`used for guidance only.`

			`- CPU: AMD Ryzen 5 5600U with Radeon Graphics (12) @ 4.289GHz`
			`- Memory: 22849MiB`
			`- OS: Arch Linux x86_64`
			`- Kernel: 6.6.7-arch1-1`
			`- rustc: 1.73.0 (cc66ad468 2023-10-03)`

			`## Baseline: Tech stack version 1`

			`Actix Web based networking with JSON for message format. Was chosen for`
			`prototyping, and was later used to set a baseline.`

			`## Without connection pooling in server-to-server communications`

			`### Single requests (no batching)`


			`<details>`


			`<summary>Peak throughput observed was 1117 request/second (please click`
			`to see charts)</summary>`


			`#### Total number of requests vs time`

			`![number of requests](./v1/nopooling/nopipelining/total_requests_per_second_1703969194.png)`

			`#### Response times(ms) vs time`

			`![repsonse times(ms)](<./v1/nopooling/nopipelining/response_times_(ms)_1703969194.png>)`

			`#### Number of concurrent users vs time`

			`![number of concurrent`
			`users](./v1/nopooling/nopipelining/number_of_users_1703969194.png)`


			`</details>`

			`### Batched requests`

			`<details>`
			`<summary>`
			`Each network request contained 1,000 application requests, so peak throughput observed was 1,800 request/second.`
			`Please click to see charts</summary>`


			`#### Total number of requests vs time`

			`![number of requests](./v1/pooling/pipelining/total_requests_per_second_1703968582.png)`

			`#### Response times(ms) vs time`

			`![repsonse times(ms)](<./v1/pooling/pipelining/response_times_(ms)_1703968582.png>))`

			`#### Number of concurrent users vs time`

			`![number of concurrent`
			`users](./v1/pooling/pipelining/number_of_users_1703968582.png)`


			`</details>`

			`## With connection pooling in server-to-server communications`


			`### Single requests (no batching)`

			`<details>`
			`<summary>`
			`Peak throughput observed was 3904 request/second. Please click to see`
			`charts</summary>`


			`#### Total number of requests vs time`

			`![number of requests](./v1/pooling/nopipelining/total_requests_per_second_1703968214.png)`

			`#### Response times(ms) vs time`

			`![repsonse times(ms)](<./v1/pooling/nopipelining/response_times_(ms)_1703968215.png>)`

			`#### Number of concurrent users vs time`

			`![number of concurrent`
			`users](./v1/pooling/nopipelining/number_of_users_1703968215.png)`


			`</details>`

			`### Batched requests`


			`<details>`
			`<summary>`
			`Each network request contained 1,000 application requests, so peak throughput observed was 15,800 request/second.`
			`Please click to see charts.`
			`</summary>`


			`#### Total number of requests vs time`

			`![number of requests](./v1/pooling/pipelining/total_requests_per_second_1703968582.png)`

			`#### Response times(ms) vs time`

			`![repsonse times(ms)](<./v1/pooling/pipelining/response_times_(ms)_1703968582.png>))`

			`#### Number of concurrent users vs time`

			`![number of concurrent`
			`users](./v1/pooling/pipelining/number_of_users_1703968582.png)`

			`</details>`


			`## Tech stack version 2`

			`Tonic for the network stack and GRPC for wire format. We ran over a`
			`dozen benchmarks with this tech stack. The trend was similar to the ones`
			`observed above: throughput was higher when connection pool was used and`
			`even higher when requests were batched. _But_ the throughput of all of these benchmarks were lower than the`
			`baseline benchmarks!`

			`The CPU was busier. We put it through`
			`[flamgragh](https://github.com/flamegraph-rs/flamegraph) and hit it with`
			`the same test suite to identify compute-heavy areas. The result was`
			`unexpected:`

			`![flamegraph indicating libmcaptcha being`
			`slow](./v2/libmcaptcha-bottleneck/problem/flamegraph.svg)`

			`libmCaptcha's [AddVisitor`
			`handler](https://github.com/mCaptcha/libmcaptcha/blob/e3f456f35b2c9e55e0475b01b3e05d48b21fd51f/src/master/embedded/counter.rs#L124)`
			`was taking up 59% of CPU time of the entire test run. This is a very`
			`critical part of the variable difficulty factor PoW algorithm that`
			`mCaptcha uses. We never ran into this bottleneck before because in other`
			`cache implementations, it was always preceded with a database request.`
			`It surfaced here as we are using in-memory data sources in dcache.`

			`libmCaptcha uses an actor-based approach with message passing for clean`
			`concurrent state management. Message passing is generally faster in most`
			`cases, but in our case, sharing memory using CPU's concurrent primitives`
			`turned out to be significantly faster:`

			`![flamegraph indicating libmcaptcha being`
			`slow](./v2/libmcaptcha-bottleneck/solution/flamegraph.svg)`

			`CPU time was reduced from 59% to 0.4%, roughly by one 147 times!`

			`With this fix in place:`


			`### Connection pooled server-to-server communications, single requests (no batching)`

			`Peak throughput observed was 4816 request/second, ~1000 requests/second`
			`more than baseline.`


			`#### Total number of requests vs time`

			`![number of requests](./v2/grpc-conn-pool-post-bottleneck/single/total_requests_per_second_1703970940.png)`

			`#### Response times(ms) vs time`

			`![repsonse times(ms)](./v2/grpc-conn-pool-post-bottleneck/single/response_times_(ms)_1703970940.png)`

			`#### Number of concurrent users vs time`

			`![number of concurrent`
			`users](./v2/grpc-conn-pool-post-bottleneck/single/number_of_users_1703970940.png)`


			`### Connection pooled server-to-server communications, batched requests`


			`Each network request contained 1,000 application requests, so peak throughput observed was 95,700 request/second. This six times higher than baseline.`
			`Please click to see charts.`


			`#### Total number of requests vs time`

			`![number of requests](./v2/grpc-conn-pool-post-bottleneck/pipeline/total_requests_per_second_1703971082.png)`

			`#### Response times(ms) vs time`

			`![repsonse times(ms)](./v2/grpc-conn-pool-post-bottleneck/pipeline/response_times_(ms)_1703971082.png)`

			`#### Number of concurrent users vs time`

			`![number of concurrent`
			`users](./v2/grpc-conn-pool-post-bottleneck/pipeline/number_of_users_1703971082.png)`

			`</details>`