Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 2 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (8.0%) to scientific vocabulary
Repository
Wrk load testing of SciCat
Basic Info
- Host: GitHub
- Owner: Ingvord
- License: mit
- Language: HTML
- Default Branch: main
- Size: 3.84 MB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 1
Metadata Files
README.md
Research Summary: SciCat metadataTypes API Endpoint Performance with Caching
Overview
This research investigates the performance of a SciCat metadataTypes API endpoint backed by MongoDB under various load scenarios. The study evaluates the impact of caching and scaling factors, such as the number of records queried and the request rate (RPS). Key performance metrics include latency percentiles and request throughput. The findings highlight bottlenecks and the effectiveness of caching in enhancing system performance.
Motivation
Efficient system design and capacity planning are crucial for ensuring a seamless user experience, especially for applications expected to handle high loads. This study aims to:
- Establish baseline performance for a Node.js API server under varying loads.
- Assess the role of caching in reducing latency and improving throughput.
- Identify bottlenecks when querying large datasets.
Methodology
Environment:
- Application: nginx + 4 SciCat backend instances.
- Database: one instance of MongoDB, filled with generated data.
- Testing Tool: wrk, a modern HTTP benchmarking tool.
- VM:

Scenarios:
Tested against datasets with 1,000, 10,000, 100,000, and 250,000 records.
Number of metadata fields: 100 and 1000.
Requests Per Second (RPS): 100, 500, 1,000.
Connections: Fixed at 10 (-c 10).
Caching:
Results were compared between cache-disabled and cache-enabled scenarios.
Pre-computation of cache for larger datasets was analyzed for its time cost and effectiveness.
Metrics:
Latency (50th, 75th, 90th, and 99th percentiles).
Request throughput (RPS).
Failed requests (timeouts, errors).
Findings
Without Cache:

High latency and increased failures for large datasets (e.g., 10,000+ records).
Performance degraded significantly as RPS increased, especially for datasets over 100,000 records.
Basically the application is not usefull.
With Cache:







Could not generate 100K and 250K data sets with 1K metadata keys:
~/scicat# ./generate_scicat_data.py 100000 1000
Killed

The computation times are displayed on a logarithmic scale, and "N/A" is shown where data for 1000 metadata keys is unavailable due to inability to generate datasets.
Dramatic reduction in latency for all dataset sizes.
Near-linear scaling for RPS up to 500 for datasets of 1,000 and 10,000 records.
The 99th percentile growth at 1K RPS highlights a significant increase in the tail-end latency. This reflects the system's struggle to handle peak load efficiently, even with caching. It suggests that as the number of requests increases, a small subset of requests experience disproportionately higher delays.
Larger datasets (100,000+ records) showed improved performance but required significant time to precompute the cache i.e. 5+ min.


These screenshots ARE from Elastic APM that provide insights into the performance of the GET /backend/api/v3/datasets/metadataTypes endpoint during test runs.
Here's a breakdown of the metrics displayed:
Latency:
The average response time of the endpoint over the test duration is shown in milliseconds. The graph indicates that latency starts high and stabilizes at a much lower value, with occasional spikes, possibly reflecting the test load or system adjustments during the run.
Throughput:
The throughput, measured in transactions per minute (tpm), demonstrates the request-handling capacity of the endpoint. The graph shows periods of varying activity, including ramp-up phases and peak transaction processing times.
Failed Transaction Rate:
This metric tracks the percentage of failed requests. In this test, the failed transaction rate remains near 0%, indicating a high success rate during the tests, which reflects good system reliability under the given conditions.
Time Spent by Span Type:
This section displays the distribution of time spent by the application in processing the requests. The chart shows that 100% of the time is allocated to the application logic, highlighting that most delays are intrinsic to the application and not caused by external dependencies.
Key Observations:
For smaller datasets (1,000 and 10,000 records), latency remained low across all percentiles even at high RPS.
Larger datasets (100,000 and 250,000 records) exhibited latency spikes at high RPS, particularly in the 90th and 99th percentiles.
Caching effectively masked database query latency but was limited by the precomputation overhead for very large datasets.
Visualizations
To aid analysis, latency percentiles (50th, 75th, 90th, and 99th) were plotted against RPS for each dataset size, highlighting:
The impact of increasing RPS on latency.
The role of caching in reducing high-latency tail events.
flamegraphs
Orgidatablocks 100 files ingestion with wrk -c10 -t10 -r10
Orgidatablocks 75K files ingestion with curl
Orgidatablocks 75K files ingestion with curl
Orgidatablocks 1K files ingestion with wrk -c10 -t10 -R10
Orgidatablocks 10K files ingestion with wrk -c10 -t10 -R10
Orgidatablocks 75K files ingestion with wrk -c10 -t10 -R10
Conclusion
This research underscores the importance of caching for read-heavy workloads, especially when dealing with large datasets. While caching significantly improves performance, the time cost of precomputing the cache for large datasets must be accounted for in deployment strategies.
Repository
This research is part of the Performance Benchmarking for Node.js Applications project. Contributions and feedback are welcome!
References
Owner
- Login: Ingvord
- Kind: user
- Location: San Diego Supercomputer Center
- Company: @rcsb
- Website: http://www.ingvord.ru
- Repositories: 3
- Profile: https://github.com/Ingvord
Citation (CITATION.cff)
cff-version: 1.1.0 message: "If you use this software, please cite it as below." authors: - family-names: Khokhriakov given-names: Igor email: "mail@ingvord.ru" orcid: "https://orcid.org/0000-0002-8553-6984" doi: "10.5281/zenodo.15056190" url: "https://github.com/Ingvord/animated-garbanzo" title: Ingvord/animated-garbanzo version: zenodo-1 date-released: 2025-03-20
GitHub Events
Total
- Release event: 1
- Push event: 58
- Create event: 3
Last Year
- Release event: 1
- Push event: 58
- Create event: 3
Issues and Pull Requests
Last synced: 11 months ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0