diaphanous
Exploring the limits of social media transparency data
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 2 DOI reference(s) in README -
✓Academic publication links
Links to: arxiv.org, zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.9%) to scientific vocabulary
Keywords
Scientific Fields
Repository
Exploring the limits of social media transparency data
Basic Info
Statistics
- Stars: 3
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 4
Topics
Metadata Files
README.md
Diaphanous: Transparency Disclosures About the Sexual Exploitation of Minors
This repository curates quantitative transparency disclosures about the online sexual exploitation of minors, i.e., people under the age of eighteen, in machine-readable form. It also includes a 4,400-line Python library for validating and tidying the data and Python as well as R notebooks with the analysis for the corresponding report Putting the Count Back Into Accountability: An Analysis of Transparency Data About the Sexual Exploitation of Minors, which is also available through this repository.
Please cite as: Robert Grimm. Diaphanous: Transparency Disclosures About the
Sexual Exploitation of Minors. Zenodo, 12 Dec. 2024,
.
The Code
To run the code in this repository, you'll need the following tools:
- According to vermin, the minimmum required Python version is 3.11.
- The analysis/platform.ipynb notebook is written
in Python and R. The necessary bindings are provided by the
rpy2 Python package. The package is installed like
other Python packages as described in the next bullet point. But it does
require a working R installation (e.g.,
brew install r). - Required Python packages are listed in the repository's
pyproject.toml. The simplest way of installing the
project's dependencies is create a local clone of this repository and then
installing it thusly:
sh $ python -m venv .venv # Create virtual environment $ . .venv/bin/activate # Activate virtual environment $ pip install -e . # Install diaphanous as editableThanks to the-eoption,pip installcreates a so-called editable install, i.e., it makes the Python code in thediaphanouspackage executable without copying it. It also installs all necessary dependencies.
Building the report requires additional tools, i.e., a working LaTeX installation, though the necessary incantations are scripted.
The Data
While a few CSV files contain tidy data, others are decidedly untidy with, for example, individual columns combining two variables. The organization of a dataset usually reflects that of the original disclosure and helps ensuring the correctness of data transcription. The Python package includes several examples for how to tidy up such data.
Dataset 1: CyberTipline Reports per Year (1998 onward)
The CyberTipline reports per year dataset captures the number of reports NCMEC received on its CyberTipline since inception in March 1998, largely based on the table included in Appendix A of its 2022, 2023, and 2024 transparency reports to the Office for Juvenile Justice and Delinquency Prevention at the Department of Justice.
We Are Social & Meltwater's Digital 2024 Global Overview Report includes statistics on the global number of social media accounts. They make for an effective estimate of population size when normalizing yearly CyberTipline report counts.
Dataset 2: CyberTipline Report Contents and Recipients (2020 onward)
The CyberTipline report contents and recipients dataset breaks down the reports NCMEC received by:
- the category of sexual exploitation, e.g., whether a report concerns child pornography, misleading words/images, online enticement, child sex trafficking, obscene material sent to a child, misleading domain names, child sexual molestation, or child sex tourism;
- the kind of attachments, e.g., photos, videos, or other;
- the uniqueness of attachments as determined by a precise hash (MD5) and a perceptual hash (PhotoDNA, Videntifier);
- their level of detail, i.e., whether they are actionable or only informational;
- their recipients in dedicated units, local, federal, or international law enforcement.
Labels for the uniqueness classification use "unique" for precisely hashed attachments and "similar" for perceptually hashed ones. The dataset combines several tables from NCMEC's 2022, 2023, and 2024 transparency reports to the Office for Juvenile Justice and Delinquency Prevention at the Department of Justice.
Dataset 3: CyberTipline Reports per Platform (2019 onward)
The CyberTipline reports per platform dataset is the project's main dataset. It collects:
- disclosures about child sexual exploitation by major non-Chinese social networks and other large service providers;
- corresponding disclosures about service providers' reporting by NCMEC.
The above linked JSON format is automatically generated from a Python module. Both formats have the same structure and contain the same information.
The dataset incorporates information about the following technology organizations and their platforms:
- Amazon
- Twitch
- Apple
- Automattic
- Tumblr
- Wordpress
- Aylo (née MindGeek)
- Pornhub
- Discord
- Google
- YouTube
- MediaLab
- Amino
- Imgur
- Kik
- Meta
- Threads
- Microsoft
- GitHub
- Omegle
- Quora
- Snap
- Telegram
- TikTok
- Wikimedia
- X (née Twitter)
Surveyed organizations fall into at least one of the following categories:
- Social media based on Buffer's list of top social media sites,
- Popular platforms based on the European Commission's list of very large online platforms,
- Platforms with at least 100,000 CyberTipline reports in one year
A separate codebook documents the JSON and Python formats. Basically, they consist of a top-level object that maps organization names to an object with the data about that organization. Since platforms vary widely in what metrics they disclose, the format necessarily is rather generic and collects all of a platform's quantitative disclosures within one table:
- Since platforms make transparency disclosures for quarter, half, and full years, each table also organizes metrics into time periods with the same granularity.
- To faithfully capture disclosures, time periods may vary within a table. They may also overlap, both to capture several partial disclosures and to capture several redundant disclosures. A flag clearly marks the latter entries.
Where possible, the table uses standard labels for equivalent metrics:
- reports tallies CyberTipline reports to NCMEC;
- pieces tallies instances of CSAM such as pictures and videos;
- accounts tallies user registrations implicated and terminated for CSAM;
Instead of "account termination," many platforms use a euphemism such as "permanent suspension." User registrations thusly impacted are included under accounts. However, temporarily impacted registrations are not.
Comparable CyberTipline report counts and per-provider comparable CyberTipline report counts are materialized views onto the same data. Both views are in long format and only include rows for counts that were disclosed by both electronic service provider and NCMEC.
The latter, more precise view has year, observer, count, and topic columns, with the topic column enabling the grouping of rows with service provider and NCMEC as observers. The former, simplified view has only id, observer, and count columns, with the ID column effectively combining the other view's year and topic columns and the observer column only distinguishing between a generic ServiceProvider and NCMEC.
Dataset 4: CyberTipline Reports per Country (2019 onward)
CyberTipline reports per country collects NCMEC's per-country breakdown of CyberTipline reports for 2019, 2020, 2021, 2022, 2023, and 2024 in machine-readable form. The CSV table is mostly straightforward: Its first two columns comprise the country name and ISO three-letter code, followed by a column per year from 2019 through 2022.
To preserve all information from NCMEC's disclosures, the table includes rows for the Netherlands Antilles (ANT), "Europe" (EEE), Bouvet Island (BVT), and "No Country Listed" (no code). NCMEC does not explain its inclusion of Europe in addition to individual European countries nor the Netherlands Antilles in addition to its 2010 successors Bonaire, Sint Eustatius, and Saba (BES), Curaçao (CUW), and Sint Maarten (SXM). Neither do they explain the inclusion of Bouvet Island; the subantarctic dependency of Norway is an uninhabited nature reserve and hence rather unlikely to serve as actual location of internet users.
This repository's Python package includes code that enriches this dataset with population counts, social account numbers, geometries, and region/continent information. It leverages the following data:
- Per-country population counts by the United Nations Population Division;
- Per-country internet user counts prepared by Our World in Data from statistics released by the International Telecommunication Union via WorldBank as well as the United Nations;
Per-country ratios of social accounts per capita based on:
- We Are Social & Hootsuite's Digital 2021 Local Country Headlines Report
- We Are Social & Kepios' Digital 2022 Local Country Headlines Report
- We Are Social & Meltwater's Digital 2023 Local Country Headlines Report
- We Are Social & Meltwater's Digital 2024 Local Country Headlines Report
- We Are Social & Meltwater's Digital 2025 Local Country Headlines Report
By dividing the reports per capita by social accounts per capita, we can determine per-country reports per social accounts, i.e., report counts normalized by likely population size.
Administrative boundaries for countries by Natural Earth, version 5.1.1;
Per-country ISO 3166 Alpha-2 and Alpha-3 codes scraped from ISO's website and corresponding region names based on Luke Duncalfe's ISO-3166 dataset.
The following choropleths using the Equal Earth projection visualize CyberTipline reports per year per country per capita:
Dataset 5: Platform Data (2020 onward)
Discord, Meta, Microsoft, and TikTok have released (some) data in machine-readable form. This dataset contains the corresponding files. Discord's and Meta's data is in CSV format, Microsoft's in Excel format, and TikTok's in Excel and later on CSV format. Meta's and TikTok's files include historical data whereas Discord's and Microsoft's do not. Since Meta re-uses the same URL every quarter, files released before Q2 2022 were retrieved from the Internet Archive's snapshots.
Dataset 6: Relationship between Offender and Victim
The CSAM pieces by relationship to victim dataset captures the relationship between suspected offenders and victims as determined by law enforcement agencies and tabulated by NCMEC. It is included in NCMEC's 2022 and 2023 transparency reports to the Office for Juvenile Justice and Delinquency Prevention at the Department of Justice.
Since the number of victims in NCMEC's database seems to be very small, I pulled in two more datasets characterizing relationships as well. The first stems from OJJDP's Statistical Briefing Book and covers years 2018 and 2019. The data was originally extracted from the FBI's National Incident-Based Reporting System Master Files. Note that all counts are relative to "typical 1,000 sexual assaults." The second stems from LEARCAT and covers the year 2016. It also draws on the FBI's National Incident-Based Reporting System. While the Briefing Book data is helpful indeed, the choice of relationship bins for the LEARCAT data renders it close to useless in this context.
Other Data
The data directory contains a few more tables, including one with global
population sizes also provided by the UN Population
Division and one with Meta's daily and monthly active
people, which captures the number of users
who logged into Facebook, Instagram, Messenger, or WhatsApp at least one over a
day or month. Both tables are used to calculate Meta's daily and monthly active
people as a fraction of the world population.
Repository Layout
In addition to the data, this repository also contains the Python code for analyzing it as well as resulting figures. In particular:
- The
analysisdirectory contains notebooks with the high-level analysis code. Theindex.ipynbnotebook includes almost all other notebooks. - The
diaphanousdirectory contains the Python library code used by the notebooks.- The remaining code in
diaphanous.mainshould be refactored into notebooks. - The
show()function indiaphanous.showis more generally useful. Most of this functionality should be up-streamed to Pandas because it significantly improves on the default table format.
- The remaining code in
- The
figuredirectory contains SVG figures. - The
stubsdirectory contains typing stubs. - The
reportdirectory contains the LaTeX sources for the article discussing the work.
Acronyms
- CSAM: Child Sexual Abuse Material
- CSE: Child Sexual Exploitation
- NCMEC: National Center for Missing and Exploited Children
- OCSE: Online Child Sexual Exploitation
- OJJDP: Office for Juvenile Justice and Delinquency Prevention (at the US Departmet of Justice)
Licensing
The code in this repository is ©️ 2023–2024 by Robert Grimm and has been released under the Apache 2.0 open source license. The datasets in this repository combine disclosures by electronic service providers as well as the National Center for Missing and Exploited Children (NCMEC) and make this data more easily accessible in machine-readable form. It has been released under the CC BY 4.0 license.
Owner
- Name: Robert Grimm
- Login: apparebit
- Kind: user
- Location: New York City
- Website: https://apparebit.com
- Twitter: apparebit
- Repositories: 5
- Profile: https://github.com/apparebit
Software engineer by day. Apocalyptic prophet at night.
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://citation-file-format.github.io/cff-initializer-javascript/#/
# to generate yours today!
cff-version: 1.2.0
title: >-
Diaphanous: Transparency Disclosures About the Sexual
Exploitation of Minors
message: >-
If you use this dataset and attendant software, please
cite as below.
type: dataset
authors:
- given-names: Robert
family-names: Grimm
email: rgrimm@alum.mit.edu
orcid: 'https://orcid.org/0000-0002-8300-2153'
identifiers:
- type: doi
value: 10.5281/zenodo.13896437
description: Zenodo
- type: doi
value: 10.48550/arXiv.2402.14625
description: arXiv
abstract: >-
This dataset collects transparency disclosures about
online child sexual exploitation. Sources are the US
clearinghouse for legally mandated reports about such
activities and materials, the National Center for Missing
and Exploited Children, as well as the technology industry
firms filing the reports. Criteria for inclusion of the
latter include industry rankings, population reach, and
report volume.
keywords:
- child sexual exploitation
- child sexual abuse material
- CSAM
- transparency report
- social media
- CyberTipline
- National Center for Missing and Exploited Children
- NCMEC
license: CC-BY-4.0
commit: 549a2fd55c7b22d79098adbdc37ade8dbc5f9761
version: v0.4
date-released: '2024-12-12'
GitHub Events
Total
- Release event: 2
- Watch event: 1
- Push event: 26
- Create event: 2
Last Year
- Release event: 2
- Watch event: 1
- Push event: 26
- Create event: 2
Issues and Pull Requests
Last synced: 10 months ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- geopandas *
- ipykernel *
- jinja2 *
- kaleido *
- matplotlib *
- nbformat *
- pandas *
- plotly *