animated-garbanzo

Wrk load testing of SciCat

https://github.com/ingvord/animated-garbanzo

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 2 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (8.0%) to scientific vocabulary

Last synced: 6 months ago · JSON representation ·

Repository

Wrk load testing of SciCat

Basic Info

Host: GitHub
Owner: Ingvord
License: mit
Language: HTML
Default Branch: main
Size: 3.84 MB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 1

Created over 1 year ago · Last pushed 8 months ago

Metadata Files

Readme License Citation

Research Summary: SciCat `metadataTypes` API Endpoint Performance with Caching

Overview

This research investigates the performance of a SciCat metadataTypes API endpoint backed by MongoDB under various load scenarios. The study evaluates the impact of caching and scaling factors, such as the number of records queried and the request rate (RPS). Key performance metrics include latency percentiles and request throughput. The findings highlight bottlenecks and the effectiveness of caching in enhancing system performance.

Motivation

Efficient system design and capacity planning are crucial for ensuring a seamless user experience, especially for applications expected to handle high loads. This study aims to:

Establish baseline performance for a Node.js API server under varying loads.
Assess the role of caching in reducing latency and improving throughput.
Identify bottlenecks when querying large datasets.

Methodology

Environment:

Application: nginx + 4 SciCat backend instances.
Database: one instance of MongoDB, filled with generated data.
Testing Tool: wrk, a modern HTTP benchmarking tool.
VM:

Scenarios:

Tested against datasets with 1,000, 10,000, 100,000, and 250,000 records.

Number of metadata fields: 100 and 1000.

Requests Per Second (RPS): 100, 500, 1,000.

Connections: Fixed at 10 (-c 10).

Caching:

Results were compared between cache-disabled and cache-enabled scenarios.

Pre-computation of cache for larger datasets was analyzed for its time cost and effectiveness.

Metrics:

Latency (50th, 75th, 90th, and 99th percentiles).

Request throughput (RPS).

Failed requests (timeouts, errors).

Findings

Without Cache:

High latency and increased failures for large datasets (e.g., 10,000+ records).

Performance degraded significantly as RPS increased, especially for datasets over 100,000 records.

Basically the application is not usefull.

With Cache:

Could not generate 100K and 250K data sets with 1K metadata keys:

~/scicat# ./generate_scicat_data.py 100000 1000 Killed

The computation times are displayed on a logarithmic scale, and "N/A" is shown where data for 1000 metadata keys is unavailable due to inability to generate datasets.

Dramatic reduction in latency for all dataset sizes.

Near-linear scaling for RPS up to 500 for datasets of 1,000 and 10,000 records.

The 99th percentile growth at 1K RPS highlights a significant increase in the tail-end latency. This reflects the system's struggle to handle peak load efficiently, even with caching. It suggests that as the number of requests increases, a small subset of requests experience disproportionately higher delays.

Larger datasets (100,000+ records) showed improved performance but required significant time to precompute the cache i.e. 5+ min.

These screenshots ARE from Elastic APM that provide insights into the performance of the GET /backend/api/v3/datasets/metadataTypes endpoint during test runs.

Here's a breakdown of the metrics displayed:

Latency:

The average response time of the endpoint over the test duration is shown in milliseconds. The graph indicates that latency starts high and stabilizes at a much lower value, with occasional spikes, possibly reflecting the test load or system adjustments during the run.

Throughput:

The throughput, measured in transactions per minute (tpm), demonstrates the request-handling capacity of the endpoint. The graph shows periods of varying activity, including ramp-up phases and peak transaction processing times.

Failed Transaction Rate:

This metric tracks the percentage of failed requests. In this test, the failed transaction rate remains near 0%, indicating a high success rate during the tests, which reflects good system reliability under the given conditions.

Time Spent by Span Type:

This section displays the distribution of time spent by the application in processing the requests. The chart shows that 100% of the time is allocated to the application logic, highlighting that most delays are intrinsic to the application and not caused by external dependencies.

Key Observations:

For smaller datasets (1,000 and 10,000 records), latency remained low across all percentiles even at high RPS.

Larger datasets (100,000 and 250,000 records) exhibited latency spikes at high RPS, particularly in the 90th and 99th percentiles.

Caching effectively masked database query latency but was limited by the precomputation overhead for very large datasets.

Visualizations

To aid analysis, latency percentiles (50th, 75th, 90th, and 99th) were plotted against RPS for each dataset size, highlighting:

The impact of increasing RPS on latency.

The role of caching in reducing high-latency tail events.

flamegraphs

Orgidatablocks 100 files ingestion with wrk -c10 -t10 -r10

Orgidatablocks 75K files ingestion with curl

Orgidatablocks 1K files ingestion with wrk -c10 -t10 -R10

Orgidatablocks 10K files ingestion with wrk -c10 -t10 -R10

Orgidatablocks 75K files ingestion with wrk -c10 -t10 -R10

Conclusion

This research underscores the importance of caching for read-heavy workloads, especially when dealing with large datasets. While caching significantly improves performance, the time cost of precomputing the cache for large datasets must be accounted for in deployment strategies.

Repository

This research is part of the Performance Benchmarking for Node.js Applications project. Contributions and feedback are welcome!

References

Building a Reliable Node.js Application | Part I

Building a reliable Node.js application | Part II

Owner

Login: Ingvord
Kind: user
Location: San Diego Supercomputer Center
Company: @rcsb

Website: http://www.ingvord.ru
Repositories: 3
Profile: https://github.com/Ingvord

Citation (CITATION.cff)

cff-version: 1.1.0
message: "If you use this software, please cite it as below."
authors:
- family-names: Khokhriakov
  given-names: Igor
  email: "mail@ingvord.ru"
  orcid: "https://orcid.org/0000-0002-8553-6984"
doi: "10.5281/zenodo.15056190"
url: "https://github.com/Ingvord/animated-garbanzo"
title: Ingvord/animated-garbanzo
version: zenodo-1
date-released: 2025-03-20

GitHub Events

Total

Release event: 1
Push event: 58
Create event: 3

Last Year

Release event: 1
Push event: 58
Create event: 3

Issues and Pull Requests

Last synced: 11 months ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

animated-garbanzo

Science Score: 67.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Research Summary: SciCat `metadataTypes` API Endpoint Performance with Caching

Overview

Motivation

Methodology

Findings

Key Observations:

Visualizations

flamegraphs

Conclusion

Repository

References

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

animated-garbanzo

Science Score: 67.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Research Summary: SciCat metadataTypes API Endpoint Performance with Caching

Overview

Motivation

Methodology

Findings

Key Observations:

Visualizations

flamegraphs

Conclusion

Repository

References

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Research Summary: SciCat `metadataTypes` API Endpoint Performance with Caching