https://github.com/cdcgov/recordlinker

The RecordLinker is a service that links records from two datasets based on a set of common attributes. The service is designed to be used in a variety of public health contexts, such as linking patient records from different sources or linking records from different public health surveillance systems.

https://github.com/cdcgov/recordlinker

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (18.9%) to scientific vocabulary

Keywords

fastapi python sqlalchemy

Keywords from Contributors

archival projection dibbs fhir fhir-client publichealth interactive generic sequences observability
Last synced: 5 months ago · JSON representation

Repository

The RecordLinker is a service that links records from two datasets based on a set of common attributes. The service is designed to be used in a variety of public health contexts, such as linking patient records from different sources or linking records from different public health surveillance systems.

Basic Info
Statistics
  • Stars: 6
  • Watchers: 2
  • Forks: 2
  • Open Issues: 32
  • Releases: 18
Topics
fastapi python sqlalchemy
Created over 1 year ago · Last pushed 7 months ago
Metadata Files
Readme Contributing License Code of conduct Codeowners

README.md

Record Linker

codecov release python

General disclaimer This repository was created for use by CDC programs to collaborate on public health related projects in support of the CDC mission. GitHub is not hosted by the CDC, but is a third party website used by CDC and its partners to share information and collaborate on software. CDC use of GitHub does not imply an endorsement of any one particular service, product, or enterprise.

Table of Contents

Overview

The Record Linker is a service that links records from two datasets based on a set of common attributes. The service is designed to be used in a variety of public health contexts, such as linking patient records from different sources or linking records from different public health surveillance systems. The service uses a probabilistic record linkage algorithm to determine the likelihood that two records refer to the same entity. The service is implemented as a RESTful API that can be accessed over HTTP. The API provides endpoints for uploading datasets, configuring the record linkage process, and retrieving the results of the record linkage process.

Getting Started

Pre-requisites

  • Python 3.11 or higher
  • Docker

Initial Setup

Set up a Python virtual environment and install the required development dependencies: NOTE: Sourcing the script is recommended over simply executing the script. This allows the virtual environment to stay active in your shell.

bash source scripts/bootstrap.sh

Note: If you are running in WSL on a Windows machine, will need to run the bootstrap file directly with ./scripts/bootstrap.sh and then activate the virtual environment by running source .venv/bin/activate.

Running the API

To run the API locally, use the following command:

bash ./scripts/local_server.sh

The API will be available at http://localhost:8000. Visit http://localhost:8000/redoc to view the API documentation.

Database Management

Record Linker supports 4 different database systems for managing the Master Patient Index (MPI) data. All will require the DB_URI environment variable to be set. (See Database Options for details.) For more information on setting up the database and/or managing the schema, see the Migrations README.

Testing

The Record Linker system comes with a number of built-in tests spread across several different types. Some of these tests are run automatically (e.g. by Github), while others must be manually executed by a developer.

  • tests/unit: These comprise basic unit (and in some cases integration) tests providing code coverage to Record Linker. These tests demonstrate the functionality of different parts of the code base under different logical conditions and with different inputs and outputs. They are automataically executed by a Github Actions workflow as part of a PR.
  • tests/algorithm: This is a set of scripts developed to test an algorithm configuration with a known set of particular edge cases. In response to frequent questions of how the DIBBs algorithm handles case X, this mini-project was created to help answer those questions by giving developers some persistent evaluation tools. These tests are not automated, and developers will need to go through the steps in the README in the relevant directory in order to run them.
  • tests/performance: Another set of scripts developed to see how fast the API can process linkage requests using synthetic data. This is useful for verifying refactors are still performant and helping developers identify bottlenecks along the way. These tests are not automated, and developers need to go through the steps in the README of the relevant directory in order to run them.

Running unit tests

To run all the unit tests, use the following command:

bash pytest

To run a single unit test, use the following command:

bash pytest tests/unit/test_utils.py::test_bind_functions

Running type checks

To run type checks, use the following command:

bash mypy

Running code formatting checks

To run code formatting checks, use the following command:

bash ruff check

For more information on developer workflows, see the Developer Guide.

Standard Notices

Public Domain Standard Notice

This repository constitutes a work of the United States Government and is not subject to domestic copyright protection under 17 USC § 105. This repository is in the public domain within the United States, and copyright and related rights in the work worldwide are waived through the CC0 1.0 Universal public domain dedication. All contributions to this repository will be released under the CC0 dedication. By submitting a pull request you are agreeing to comply with this waiver of copyright interest.

License Standard Notice

The repository utilizes code licensed under the terms of the Apache Software License and therefore is licensed under ASL v2 or later.

This source code in this repository is free: you can redistribute it and/or modify it under the terms of the Apache Software License version 2, or (at your option) any later version.

This source code in this repository is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the Apache Software License for more details.

You should have received a copy of the Apache Software License along with this program. If not, see http://www.apache.org/licenses/LICENSE-2.0.html

The source code forked from other open source projects will inherit its license.

Privacy Standard Notice

This repository contains only non-sensitive, publicly available data and information. All material and community participation is covered by the Disclaimer and Code of Conduct. For more information about CDC's privacy policy, please visit http://www.cdc.gov/other/privacy.html.

Contributing Standard Notice

Anyone is encouraged to contribute to the repository by forking and submitting a pull request. (If you are new to GitHub, you might start with a basic tutorial.) By contributing to this project, you grant a world-wide, royalty-free, perpetual, irrevocable, non-exclusive, transferable license to all users under the terms of the Apache Software License v2 or later.

All comments, messages, pull requests, and other submissions received through CDC including this GitHub page may be subject to applicable federal law, including but not limited to the Federal Records Act, and may be archived. Learn more at http://www.cdc.gov/other/privacy.html.

Records Management Standard Notice

This repository is not a source of government records, but is a copy to increase collaboration and collaborative potential. All government records will be published through the CDC web site.

Related documents

Additional Standard Notices

Please refer to CDC's Template Repository for more information about contributing to this repository, public domain notices and disclaimers, and code of conduct.

Owner

  • Name: Centers for Disease Control and Prevention
  • Login: CDCgov
  • Kind: organization
  • Email: data@cdc.gov
  • Location: Atlanta, GA

CDC's collaborative software projects to protect America from health, safety, and security threats, both foreign and in the U.S.

GitHub Events

Total
  • Create event: 193
  • Release event: 16
  • Issues event: 355
  • Watch event: 5
  • Delete event: 162
  • Member event: 7
  • Issue comment event: 325
  • Push event: 1,244
  • Pull request review comment event: 710
  • Pull request event: 346
  • Pull request review event: 895
  • Fork event: 5
Last Year
  • Create event: 193
  • Release event: 16
  • Issues event: 355
  • Watch event: 5
  • Delete event: 162
  • Member event: 7
  • Issue comment event: 325
  • Push event: 1,244
  • Pull request review comment event: 710
  • Pull request event: 346
  • Pull request review event: 895
  • Fork event: 5

Committers

Last synced: 8 months ago

All Time
  • Total Commits: 204
  • Total Committers: 11
  • Avg Commits per committer: 18.545
  • Development Distribution Score (DDS): 0.338
Past Year
  • Commits: 204
  • Committers: 11
  • Avg Commits per committer: 18.545
  • Development Distribution Score (DDS): 0.338
Top Committers
Name Email Commits
Eric Buckley e****y@g****m 135
Marcelle 5****s 22
bamader 4****r 18
cbrinson-rise8 1****8 13
Alex Hayward 4****d 6
Derek A Dombek 5****k 5
dependabot[bot] 4****] 1
Eileen Ruberto e****o 1
Boris Ning 4****s 1
Alis Akers 9****x 1
Alex Hayward a****d@a****n 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 208
  • Total pull requests: 299
  • Average time to close issues: 19 days
  • Average time to close pull requests: 5 days
  • Total issue authors: 5
  • Total pull request authors: 10
  • Average comments per issue: 0.49
  • Average comments per pull request: 1.2
  • Merged pull requests: 228
  • Bot issues: 0
  • Bot pull requests: 4
Past Year
  • Issues: 208
  • Pull requests: 299
  • Average time to close issues: 19 days
  • Average time to close pull requests: 5 days
  • Issue authors: 5
  • Pull request authors: 10
  • Average comments per issue: 0.49
  • Average comments per pull request: 1.2
  • Merged pull requests: 228
  • Bot issues: 0
  • Bot pull requests: 4
Top Authors
Issue Authors
  • ericbuckley (165)
  • m-goggins (25)
  • johanna-skylight (13)
  • bamader (9)
  • alhayward (4)
  • cbrinson-rise8 (1)
Pull Request Authors
  • ericbuckley (193)
  • m-goggins (49)
  • derekadombek (38)
  • bamader (33)
  • johanna-skylight (23)
  • cbrinson-rise8 (20)
  • alhayward (8)
  • dependabot[bot] (7)
  • alismx (2)
  • eileenruberto (1)
Top Labels
Issue Labels
feature (36) enhancement (25) api (24) ui (16) qa (13) bug (13) documentation (11) spike (10) automation (9) epic (9) ops (3) a11y (2) needs content (1) needs design (1) dependencies (1)
Pull Request Labels
feature (40) bug (25) enhancement (22) qa (21) automation (19) documentation (11) api (10) dependencies (9) ops (8) ui (5) spike (3) a11y (3) python (2)

Dependencies

Dockerfile docker
  • ghcr.io/cdcgov/phdi/dibbs latest build
pyproject.toml pypi
  • fastapi *
  • fhirpathpy *
  • phdi *
  • pydantic *
  • pyway *
  • rapidfuzz *
  • sqlalchemy *
  • uvicorn *
.github/workflows/check_code_vulnerabilities.yml actions
  • actions/checkout v3 composite
  • github/codeql-action/analyze v3 composite
  • github/codeql-action/init v3 composite
.github/workflows/check_lint.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v5 composite
.github/workflows/check_unit_tests.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v5 composite
  • postgres 13 docker