stata-codefinder
Efficient string finding in Stata using multiprocessing.
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (15.3%) to scientific vocabulary
Repository
Efficient string finding in Stata using multiprocessing.
Basic Info
- Host: GitHub
- Owner: jonathanbatty
- License: mit
- Language: Stata
- Default Branch: main
- Size: 28.4 MB
Statistics
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 2
Metadata Files
README.md

Installation | Syntax | Examples | Feedback | Change log | Roadmap
Codefinder for Stata
(v1.00, 14 Jun 2024)
This repository contains the code required to install and run codefinder, a package that uses multiprocessing, associative arrays and optimised Mata functions to speed up many-to-many string matching in Stata. This can be used to identify the presence of lists of codes (e.g. ICD, SNOMED-CT, Read/CTV3, Emis, etc) in variables containing data in string format.
At present, codefinder is in a reasonably developnmental state and has only been tested on Windows (11). Over the coming weeks, it will be fully tested on Windows 10 and 11, MacOS and UNIX (including HPC) machines, prior to release via SSC.
Installation
The package can be installed from GitHub using net install:
``` net install codefinder, from("https://raw.githubusercontent.com/jonathanbatty/stata-codefinder/main/installation/") replace
```
Syntax
Codefinder should be used with no data open in Stata. The syntax for codefinder is as follows:
``` codefinder varstosearch, dataset() codefiles() id() [options]
[options] = n_cores() summary ```
See the help file using help codefinder for full details of each option.
The basic usage is as follows:
codefinder dx*, dataset(".\data\patient_data.dta") codefiles("hypertension.txt diabetes.txt") id(id_var) n_cores(16)
Whereby the variables dx* (e.g. dx1, dx2, dx3, ... , dxn) present in patient_data.dta will be searched for the diagnosis codes (strings) present in hypertension.txt and diabetes.txt (one code per line in each file). Each row of data should be identified using a unique identifier, idvar. Codefinder will run the string matching procedure using 16 CPU cores, in this example. It will return a dataset in memory that includes idvar and a variable to indicate the presence of one or more codes from each text file in each initial observation (i.e. dx* in this case).
Feedback
Please open an issue to report errors, suggest feature enhancements, and/or make any other requests.
Change Log
v1.01 (16/06/24) - Minor bug fixes: installation now works with a single command.
v1.00 (14/06/24) - Initial release.
Roadmap
- Test on Unix / Mac machines.
- Improvements in error reporting functionality: workers to flag errors to main Stata instance, which should handle these appropriately.
- Further incremental improvements to speed and stability.
Acknowledgements
JB received funding from the Wellcome Trust 4ward North Clinical Research Training Fellowship (227498/Z/23/Z; R127002).
This work was done while JB was a member of the Survivorship and Multimorbidity Epidemiology Group at the University of Leeds, led by Dr Marlous Hall.
Suggested Citation
Batty, J. A. (2024). Stata package ``codefinder'': efficient many-to-many string searching in Stata using multiprocessing (Version 1.0) [Computer software]. https://github.com/jonathanbatty/stata-codefinder
Owner
- Login: jonathanbatty
- Kind: user
- Repositories: 1
- Profile: https://github.com/jonathanbatty
Citation (citation.cff)
cff-version: 1.2.0 authors: - family-names: "Batty" given-names: "Jonathan A" orcid: "https://orcid.org/0000-0003-4102-5418" title: "Stata package ``codefinder''" version: 1.0 date-released: 2024-06-14 url: "https://github.com/jonathanbatty/stata-nmf"