dcache/bench/results
Aravinth Manivannan a82b9044d5
ci/woodpecker/push/woodpecker Pipeline was successful Details
ci/woodpecker/pr/woodpecker Pipeline was successful Details
ci/woodpecker/pull_request_closed/woodpecker Pipeline was successful Details
feat: benchmark report
2023-12-31 02:54:11 +05:30
..
v1 feat: benchmark report 2023-12-31 02:54:11 +05:30
v2 feat: benchmark report 2023-12-31 02:54:11 +05:30
README.md feat: benchmark report 2023-12-31 02:54:11 +05:30

README.md

Benchmark Report

Benchmarks were run at various stages of development to keep track of performance. Tech stacks were changed and the implementation optimized to increase throughput. This report summarizes the findings of the benchmarks

Ultimately, we were able to identify a bottleneck that was previously hidden in mCaptcha (hidden because a different bottleneck like DB access eclipsed it :p) and were able to increase performance of the critical path by ~147 times through a trivial optimization.

Environment

These benchmarks were run on a noisy development laptop and should be used for guidance only.

  • CPU: AMD Ryzen 5 5600U with Radeon Graphics (12) @ 4.289GHz
  • Memory: 22849MiB
  • OS: Arch Linux x86_64
  • Kernel: 6.6.7-arch1-1
  • rustc: 1.73.0 (cc66ad468 2023-10-03)

Baseline: Tech stack version 1

Actix Web based networking with JSON for message format. Was chosen for prototyping, and was later used to set a baseline.

Without connection pooling in server-to-server communications

Single requests (no batching)

Peak throughput observed was 1117 request/second (please click to see charts)

Total number of requests vs time

number of requests

Response times(ms) vs time

repsonse times(ms)

Number of concurrent users vs time

number of concurrentusers

Batched requests

Each network request contained 1,000 application requests, so peak throughput observed was 1,800 request/second. Please click to see charts

Total number of requests vs time

number of requests

Response times(ms) vs time

repsonse times(ms))

Number of concurrent users vs time

number of concurrentusers

With connection pooling in server-to-server communications

Single requests (no batching)

Peak throughput observed was 3904 request/second. Please click to see charts

Total number of requests vs time

number of requests

Response times(ms) vs time

repsonse times(ms)

Number of concurrent users vs time

number of concurrentusers

Batched requests

Each network request contained 1,000 application requests, so peak throughput observed was 15,800 request/second. Please click to see charts.

Total number of requests vs time

number of requests

Response times(ms) vs time

repsonse times(ms))

Number of concurrent users vs time

number of concurrentusers

Tech stack version 2

Tonic for the network stack and GRPC for wire format. We ran over a dozen benchmarks with this tech stack. The trend was similar to the ones observed above: throughput was higher when connection pool was used and even higher when requests were batched. But the throughput of all of these benchmarks were lower than the baseline benchmarks!

The CPU was busier. We put it through flamgragh and hit it with the same test suite to identify compute-heavy areas. The result was unexpected:

flamegraph indicating libmcaptcha beingslow

libmCaptcha's AddVisitor handler was taking up 59% of CPU time of the entire test run. This is a very critical part of the variable difficulty factor PoW algorithm that mCaptcha uses. We never ran into this bottleneck before because in other cache implementations, it was always preceded with a database request. It surfaced here as we are using in-memory data sources in dcache.

libmCaptcha uses an actor-based approach with message passing for clean concurrent state management. Message passing is generally faster in most cases, but in our case, sharing memory using CPU's concurrent primitives turned out to be significantly faster:

flamegraph indicating libmcaptcha beingslow

CPU time was reduced from 59% to 0.4%, roughly by one 147 times!

With this fix in place:

Connection pooled server-to-server communications, single requests (no batching)

Peak throughput observed was 4816 request/second, ~1000 requests/second more than baseline.

Total number of requests vs time

number of requests

Response times(ms) vs time

repsonse times(ms)

Number of concurrent users vs time

number of concurrentusers

Connection pooled server-to-server communications, batched requests

Each network request contained 1,000 application requests, so peak throughput observed was 95,700 request/second. This six times higher than baseline. Please click to see charts.

Total number of requests vs time

number of requests

Response times(ms) vs time

repsonse times(ms)

Number of concurrent users vs time

number of concurrentusers