Recent Releases of https://github.com/cthoyt/cthoyt.github.io
https://github.com/cthoyt/cthoyt.github.io - New post: Validating the FAIRness of knowledge graphs and ontologies in RDF using the Bioregistry
Using standard CURIE prefixes and URI prefixes in semantic web artifacts such as Resource Description Framework (RDF) promotes interoperability, enables reuse in downstream data integration, and makes data more FAIR. The Bioregistry defines a set of standard CURIE prefixes and URI prefixes against which RDF files can be validated/standardized. This blog post describes a new CLI tool bioregistry validate ttl in the Bioregistry Python package that can run validation on Turtle files (a common serialization of RDF).
Full post: https://cthoyt.com/2025/09/04/bioregistry-turtle-validation.html
- HTML
Published by cthoyt 9 months ago
https://github.com/cthoyt/cthoyt.github.io - New post: A historical analysis of ChEMBL
I've recently submitted an article to the Journal of Open Source Software (JOSS) describing chembl-downloader, a Python package for automating downloading and using ChEMBL data in a reproducible way. In this post, I use chembl-downloader to show how the number of compounds, assays, activities, and other entities in ChEMBL have changed over time.
Full post: https://cthoyt.com/2025/08/26/chembl-history.html
- HTML
Published by cthoyt 9 months ago
https://github.com/cthoyt/cthoyt.github.io - New post: Measuring the impact of the Bioregistry
The Bioregistry is a database and toolchain for standardization of prefixes, CURIEs, and URIs that appear in linked (open) data. While I created it in 2019 as a component of PyOBO in order to support parsing database cross-references appearing in biomedical ontologies, it has since become an independent project with a community-driven governance model and much broader applications. This post is a first attempt to quantify its usage and impact.
Full post: https://cthoyt.com/2025/08/22/bioregistry-impact.html
- HTML
Published by cthoyt 9 months ago
https://github.com/cthoyt/cthoyt.github.io - New post: Bioregistry and BiomarkerKB
The Bioregistry is a community-driven registry of semantic spaces and their metadata. When I learned about BiomarkerKB at the International Society for Biocuration's 18th Annual International Biocuration Conference, I was excited to curate new records (and prefixes) in the Bioregistry to cover BiomarkerKB's semantic spaces on biomarkers. This post summarizes the discussions I've had with its maintainers, Jeet and Raja, throughout the Bioregistry curation process and also gives insight into how databases can benefit from being represented in the Bioregistry.
- HTML
Published by cthoyt 9 months ago
https://github.com/cthoyt/cthoyt.github.io - Blog post: Text-based embedding of ontology terms
The Ontology Lookup Service (OLS) is now indexing dense embeddings for ontology terms constructed from term labels, synonyms, and descriptions using LLMs. I maintain a Python client library for the OLS (ols-client) and was recently asked in https://github.com/cthoyt/ols-client/issues/9 to implement a wrapper to the OLS's API endpoint that exposes these embeddings. This post is a demo of how to use that code, and how I replicated the same embedding functionality in PyOBO in https://github.com/biopragmatics/pyobo/pull/412 to arbitrarily extend it to ontologies and databases not in OLS.
- HTML
Published by cthoyt 10 months ago
https://github.com/cthoyt/cthoyt.github.io - Blog Post: Exploring Event Venues in Wikidata
I was working on making data about scholarly conferences more FAIR and a big question crossed my mind: what are all the conference venues? This post is about some queries I wrote for Wikidata, data issues I found, and a few drive-by curations that I did while looking for an answer, and my ideas for the future.
Full text: https://cthoyt.com/2025/01/17/event-venues-in-wikidata.html
- HTML
Published by cthoyt over 1 year ago
https://github.com/cthoyt/cthoyt.github.io - Blog Post: Dependency Groups and ReadTheDocs
PEP 735 introduced dependency groups in packaging metadata, which are complementary to optional dependencies in that they might not correspond to features in the package, but rather be something like development or release dependencies. I am slowly working towards updating my cookiecutter template cookiecutter-snekpack to use PEP 735. So far, uv and tox have released support - all that’s left is ReadTheDocs. This post summarizes the issue I added to their issue tracker and the following discussion.
Full post: https://cthoyt.com/2024/11/19/rtfd-dependency-groups.html
- HTML
Published by cthoyt over 1 year ago
https://github.com/cthoyt/cthoyt.github.io - Blog Post: Building Graphviz when installing PyGraphviz
Graphviz is software for graph visualization written in C. PyGraphviz provides a nice Python wrapper for it. The issue is that getting Python to know about the C headers changes every few months. I’ll try and keep this blog post updated every time there are some changes.
Full post: https://cthoyt.com/2024/11/05/installing-pygraphviz.html
- HTML
Published by cthoyt over 1 year ago
https://github.com/cthoyt/cthoyt.github.io - Blog Post: Some Haskell I Tried to Write
I’m working through making a contribution to pandoc that adds first-class support for author role annotations using the Contribution Role Taxonomy (CRediT) and also outputs compliant Journal Publishing Tag Set (JATS) XML. This has lead me down a (losing) journey with learning the Haskell programming language, so I thought I would post a short note on a function I tried to understand.
Full post: https://cthoyt.com/2024/09/26/some-haskell.html
- HTML
Published by cthoyt over 1 year ago
https://github.com/cthoyt/cthoyt.github.io - Blog Post: Easy ORCID
The Open Researcher and Contributor Identifier (ORCID) database is an invaluable resource that supports the unambiguous identification of researchers. However, its first party data dump is too complex, verbose, and unstandardized for many use cases. This post describes open source software I wrote that automates downloading, processing, and exporting ORCID into a more usable form. I put the results on Zenodo under the CC0 license.
- HTML
Published by cthoyt almost 2 years ago
https://github.com/cthoyt/cthoyt.github.io - Blog Post: Discussions and Follow-ups from Biocuration 2024
I've just returned from the 17th Annual International Biocuration Conference at the Indian Biological Data Centre (IBDC) in Faridabad, India. I wanted to highlight some of the interesting conversations I had while I was there, and ideas for follow-up. Most were centered around the Bioregistry and the Semantic Mapping Assembler and Reasoner (SeMRA), which I gave an oral presentation on.
What's Changed
- New Post: Discussions and Follow-ups from Biocuration 2024 by @cthoyt in https://github.com/cthoyt/cthoyt.github.io/pull/60
Full Changelog: https://github.com/cthoyt/cthoyt.github.io/compare/books-2023...biocuration2024-discussions
- HTML
Published by cthoyt about 2 years ago
https://github.com/cthoyt/cthoyt.github.io - Blog Post: Books I Read in 2023
Spoilers: it's a lot of Brandon Sanderson
- HTML
Published by cthoyt over 2 years ago
https://github.com/cthoyt/cthoyt.github.io - Blog Post: Unlocking UMLS
The Unified Medical Language System (UMLS) is a widely used biomedical and clinical vocabulary maintained by the United States National Library of Medicine. However, it is notoriously difficult to access and work with due to licensing restrictions and its complex download system. In the same vein as my previous posts about DrugBank and ChEMBL, this post describes open source software I’ve developed for downloading and working with this data. It also works for RxNorm, SemMedDB, SNOMED-CT, and any other data accessible through the UMLS Terminology Services (UTS) ticket granting system.
- HTML
Published by cthoyt over 2 years ago
https://github.com/cthoyt/cthoyt.github.io - Blog Post: Reproducibility Pilot in the Journal of Cheminformatics
I’ve been working on improving reproducibility in the field of cheminformatics for some time now. For example, I’ve written posts about making data from DrugBank and ChEMBL more actionable. Over the last year, I’ve been preparing a concept with the editors of the Journal of Cheminformatics on how to include an assessment of reproducibility to reviews of manuscripts submitted to the journal. This has resulted in an editorial Improving reproducibility and reusability in the Journal of Cheminformatics as well as a call for papers. In this post, I want to summarize the first generation review criteria we developed, give an example of it applied in practice.
- HTML
Published by cthoyt over 2 years ago
https://github.com/cthoyt/cthoyt.github.io - Blog: Querying Journals and Publishers in Wikidata
This post is about three SPARQL queries I wrote to get bibliometric information about journals and publishers out of Wikidata.
- HTML
Published by cthoyt almost 3 years ago
https://github.com/cthoyt/cthoyt.github.io - Blog: Modeling and Querying Awards in Wikidata
I was recently nominated for the International Society for Biocuration’s Excellence in Biocuration Early Career Award. This made me curious about how to model nominations and awards on Wikidata. In this post, I’ll describe how to curate awards, nominations, recipients, and how to make SPARQL queries to get them.
View the full post here.
- HTML
Published by cthoyt almost 3 years ago
https://github.com/cthoyt/cthoyt.github.io - Blog: Re-implementing the N2T ARK (Meta)Resolver
- HTML
Published by cthoyt about 3 years ago
https://github.com/cthoyt/cthoyt.github.io - Blog: Resources masquerading as OBO Foundry ontologies
View the post
- HTML
Published by cthoyt about 3 years ago