stata-codefinder

Efficient string finding in Stata using multiprocessing.

https://github.com/jonathanbatty/stata-codefinder

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (15.3%) to scientific vocabulary
Last synced: 10 months ago · JSON representation ·

Repository

Efficient string finding in Stata using multiprocessing.

Basic Info
  • Host: GitHub
  • Owner: jonathanbatty
  • License: mit
  • Language: Stata
  • Default Branch: main
  • Size: 28.4 MB
Statistics
  • Stars: 2
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 2
Created about 2 years ago · Last pushed almost 2 years ago
Metadata Files
Readme License Citation

README.md

Codefinder

StataMin issues license version release Stars


Installation | Syntax | Examples | Feedback | Change log | Roadmap


Codefinder for Stata

(v1.00, 14 Jun 2024)

This repository contains the code required to install and run codefinder, a package that uses multiprocessing, associative arrays and optimised Mata functions to speed up many-to-many string matching in Stata. This can be used to identify the presence of lists of codes (e.g. ICD, SNOMED-CT, Read/CTV3, Emis, etc) in variables containing data in string format.

At present, codefinder is in a reasonably developnmental state and has only been tested on Windows (11). Over the coming weeks, it will be fully tested on Windows 10 and 11, MacOS and UNIX (including HPC) machines, prior to release via SSC.

Installation

The package can be installed from GitHub using net install:

``` net install codefinder, from("https://raw.githubusercontent.com/jonathanbatty/stata-codefinder/main/installation/") replace

```

Syntax

Codefinder should be used with no data open in Stata. The syntax for codefinder is as follows:

``` codefinder varstosearch, dataset() codefiles() id() [options]

[options] = n_cores() summary ```

See the help file using help codefinder for full details of each option.

The basic usage is as follows:

codefinder dx*, dataset(".\data\patient_data.dta") codefiles("hypertension.txt diabetes.txt") id(id_var) n_cores(16)

Whereby the variables dx* (e.g. dx1, dx2, dx3, ... , dxn) present in patient_data.dta will be searched for the diagnosis codes (strings) present in hypertension.txt and diabetes.txt (one code per line in each file). Each row of data should be identified using a unique identifier, idvar. Codefinder will run the string matching procedure using 16 CPU cores, in this example. It will return a dataset in memory that includes idvar and a variable to indicate the presence of one or more codes from each text file in each initial observation (i.e. dx* in this case).

Feedback

Please open an issue to report errors, suggest feature enhancements, and/or make any other requests.

Change Log

v1.01 (16/06/24) - Minor bug fixes: installation now works with a single command.

v1.00 (14/06/24) - Initial release.

Roadmap

  • Test on Unix / Mac machines.
  • Improvements in error reporting functionality: workers to flag errors to main Stata instance, which should handle these appropriately.
  • Further incremental improvements to speed and stability.

Acknowledgements

JB received funding from the Wellcome Trust 4ward North Clinical Research Training Fellowship (227498/Z/23/Z; R127002).

This work was done while JB was a member of the Survivorship and Multimorbidity Epidemiology Group at the University of Leeds, led by Dr Marlous Hall.

Suggested Citation

Batty, J. A. (2024). Stata package ``codefinder'': efficient many-to-many string searching in Stata using multiprocessing (Version 1.0) [Computer software]. https://github.com/jonathanbatty/stata-codefinder

Owner

  • Login: jonathanbatty
  • Kind: user

Citation (citation.cff)

cff-version: 1.2.0
authors:
- family-names: "Batty"
  given-names: "Jonathan A"
  orcid: "https://orcid.org/0000-0003-4102-5418"
title: "Stata package ``codefinder''"
version: 1.0
date-released: 2024-06-14
url: "https://github.com/jonathanbatty/stata-nmf"

GitHub Events

Total
Last Year