diaphanous

Exploring the limits of social media transparency data

https://github.com/apparebit/diaphanous

Keywords

csam cybertipline ncmec social-media transparency

Scientific Fields

Political Science Social Sciences - 39% confidence

Last synced: 10 months ago · JSON representation ·

Repository

Exploring the limits of social media transparency data

Basic Info

Host: GitHub
Owner: apparebit
License: apache-2.0
Language: Jupyter Notebook
Default Branch: boss
Homepage:
Size: 110 MB

Statistics

Stars: 3
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 4

Topics

csam cybertipline ncmec social-media transparency

Created about 3 years ago · Last pushed 11 months ago

Metadata Files

Readme License Citation

Diaphanous: Transparency Disclosures About the Sexual Exploitation of Minors

This repository curates quantitative transparency disclosures about the online sexual exploitation of minors, i.e., people under the age of eighteen, in machine-readable form. It also includes a 4,400-line Python library for validating and tidying the data and Python as well as R notebooks with the analysis for the corresponding report Putting the Count Back Into Accountability: An Analysis of Transparency Data About the Sexual Exploitation of Minors, which is also available through this repository.

Please cite as: Robert Grimm. Diaphanous: Transparency Disclosures About the Sexual Exploitation of Minors. Zenodo, 12 Dec. 2024, .

The Code

To run the code in this repository, you'll need the following tools:

According to vermin, the minimmum required Python version is 3.11.
The analysis/platform.ipynb notebook is written in Python and R. The necessary bindings are provided by the rpy2 Python package. The package is installed like other Python packages as described in the next bullet point. But it does require a working R installation (e.g., brew install r).
Required Python packages are listed in the repository's pyproject.toml. The simplest way of installing the project's dependencies is create a local clone of this repository and then installing it thusly: sh $ python -m venv .venv # Create virtual environment $ . .venv/bin/activate # Activate virtual environment $ pip install -e . # Install diaphanous as editable Thanks to the -e option, pip install creates a so-called editable install, i.e., it makes the Python code in the diaphanous package executable without copying it. It also installs all necessary dependencies.

Building the report requires additional tools, i.e., a working LaTeX installation, though the necessary incantations are scripted.

The Data

While a few CSV files contain tidy data, others are decidedly untidy with, for example, individual columns combining two variables. The organization of a dataset usually reflects that of the original disclosure and helps ensuring the correctness of data transcription. The Python package includes several examples for how to tidy up such data.

Dataset 1: CyberTipline Reports per Year (1998 onward)

The CyberTipline reports per year dataset captures the number of reports NCMEC received on its CyberTipline since inception in March 1998, largely based on the table included in Appendix A of its 2022, 2023, and 2024 transparency reports to the Office for Juvenile Justice and Delinquency Prevention at the Department of Justice.

We Are Social & Meltwater's Digital 2024 Global Overview Report includes statistics on the global number of social media accounts. They make for an effective estimate of population size when normalizing yearly CyberTipline report counts.

Dataset 2: CyberTipline Report Contents and Recipients (2020 onward)

The CyberTipline report contents and recipients dataset breaks down the reports NCMEC received by:

the category of sexual exploitation, e.g., whether a report concerns child pornography, misleading words/images, online enticement, child sex trafficking, obscene material sent to a child, misleading domain names, child sexual molestation, or child sex tourism;
the kind of attachments, e.g., photos, videos, or other;
the uniqueness of attachments as determined by a precise hash (MD5) and a perceptual hash (PhotoDNA, Videntifier);
their level of detail, i.e., whether they are actionable or only informational;
their recipients in dedicated units, local, federal, or international law enforcement.

Labels for the uniqueness classification use "unique" for precisely hashed attachments and "similar" for perceptually hashed ones. The dataset combines several tables from NCMEC's 2022, 2023, and 2024 transparency reports to the Office for Juvenile Justice and Delinquency Prevention at the Department of Justice.

Dataset 3: CyberTipline Reports per Platform (2019 onward)

The CyberTipline reports per platform dataset is the project's main dataset. It collects:

disclosures about child sexual exploitation by major non-Chinese social networks and other large service providers;
corresponding disclosures about service providers' reporting by NCMEC.

The above linked JSON format is automatically generated from a Python module. Both formats have the same structure and contain the same information.

The dataset incorporates information about the following technology organizations and their platforms:

Amazon
- Twitch
Apple
Automattic
- Tumblr
- Wordpress
Aylo (née MindGeek)
- Pornhub
Discord
Google
- YouTube
MediaLab
- Amino
- Imgur
- Kik
Meta
- Facebook
- Instagram
- Threads
- WhatsApp
Microsoft
- GitHub
- LinkedIn
Omegle
Pinterest
Quora
Reddit
Snap
Telegram
TikTok
Wikimedia
X (née Twitter)

Surveyed organizations fall into at least one of the following categories:

Social media based on Buffer's list of top social media sites,
Popular platforms based on the European Commission's list of very large online platforms,
Platforms with at least 100,000 CyberTipline reports in one year

A separate codebook documents the JSON and Python formats. Basically, they consist of a top-level object that maps organization names to an object with the data about that organization. Since platforms vary widely in what metrics they disclose, the format necessarily is rather generic and collects all of a platform's quantitative disclosures within one table:

Since platforms make transparency disclosures for quarter, half, and full years, each table also organizes metrics into time periods with the same granularity.
To faithfully capture disclosures, time periods may vary within a table. They may also overlap, both to capture several partial disclosures and to capture several redundant disclosures. A flag clearly marks the latter entries.
Where possible, the table uses standard labels for equivalent metrics:
- reports tallies CyberTipline reports to NCMEC;
- pieces tallies instances of CSAM such as pictures and videos;
- accounts tallies user registrations implicated and terminated for CSAM;
Instead of "account termination," many platforms use a euphemism such as "permanent suspension." User registrations thusly impacted are included under accounts. However, temporarily impacted registrations are not.

Comparable CyberTipline report counts and per-provider comparable CyberTipline report counts are materialized views onto the same data. Both views are in long format and only include rows for counts that were disclosed by both electronic service provider and NCMEC.

The latter, more precise view has year, observer, count, and topic columns, with the topic column enabling the grouping of rows with service provider and NCMEC as observers. The former, simplified view has only id, observer, and count columns, with the ID column effectively combining the other view's year and topic columns and the observer column only distinguishing between a generic ServiceProvider and NCMEC.

Dataset 4: CyberTipline Reports per Country (2019 onward)

CyberTipline reports per country collects NCMEC's per-country breakdown of CyberTipline reports for 2019, 2020, 2021, 2022, 2023, and 2024 in machine-readable form. The CSV table is mostly straightforward: Its first two columns comprise the country name and ISO three-letter code, followed by a column per year from 2019 through 2022.

To preserve all information from NCMEC's disclosures, the table includes rows for the Netherlands Antilles (ANT), "Europe" (EEE), Bouvet Island (BVT), and "No Country Listed" (no code). NCMEC does not explain its inclusion of Europe in addition to individual European countries nor the Netherlands Antilles in addition to its 2010 successors Bonaire, Sint Eustatius, and Saba (BES), Curaçao (CUW), and Sint Maarten (SXM). Neither do they explain the inclusion of Bouvet Island; the subantarctic dependency of Norway is an uninhabited nature reserve and hence rather unlikely to serve as actual location of internet users.

This repository's Python package includes code that enriches this dataset with population counts, social account numbers, geometries, and region/continent information. It leverages the following data:

Per-country population counts by the United Nations Population Division;
Per-country internet user counts prepared by Our World in Data from statistics released by the International Telecommunication Union via WorldBank as well as the United Nations;
Per-country ratios of social accounts per capita based on:
By dividing the reports per capita by social accounts per capita, we can determine per-country reports per social accounts, i.e., report counts normalized by likely population size.
Administrative boundaries for countries by Natural Earth, version 5.1.1;
Per-country ISO 3166 Alpha-2 and Alpha-3 codes scraped from ISO's website and corresponding region names based on Luke Duncalfe's ISO-3166 dataset.

The following choropleths using the Equal Earth projection visualize CyberTipline reports per year per country per capita:

CyberTipline reports per capita per country per
year

Dataset 5: Platform Data (2020 onward)

Discord, Meta, Microsoft, and TikTok have released (some) data in machine-readable form. This dataset contains the corresponding files. Discord's and Meta's data is in CSV format, Microsoft's in Excel format, and TikTok's in Excel and later on CSV format. Meta's and TikTok's files include historical data whereas Discord's and Microsoft's do not. Since Meta re-uses the same URL every quarter, files released before Q2 2022 were retrieved from the Internet Archive's snapshots.

Dataset 6: Relationship between Offender and Victim

The CSAM pieces by relationship to victim dataset captures the relationship between suspected offenders and victims as determined by law enforcement agencies and tabulated by NCMEC. It is included in NCMEC's 2022 and 2023 transparency reports to the Office for Juvenile Justice and Delinquency Prevention at the Department of Justice.

Since the number of victims in NCMEC's database seems to be very small, I pulled in two more datasets characterizing relationships as well. The first stems from OJJDP's Statistical Briefing Book and covers years 2018 and 2019. The data was originally extracted from the FBI's National Incident-Based Reporting System Master Files. Note that all counts are relative to "typical 1,000 sexual assaults." The second stems from LEARCAT and covers the year 2016. It also draws on the FBI's National Incident-Based Reporting System. While the Briefing Book data is helpful indeed, the choice of relationship bins for the LEARCAT data renders it close to useless in this context.

Other Data

The data directory contains a few more tables, including one with global population sizes also provided by the UN Population Division and one with Meta's daily and monthly active people, which captures the number of users who logged into Facebook, Instagram, Messenger, or WhatsApp at least one over a day or month. Both tables are used to calculate Meta's daily and monthly active people as a fraction of the world population.

Repository Layout

In addition to the data, this repository also contains the Python code for analyzing it as well as resulting figures. In particular:

The analysis directory contains notebooks with the high-level analysis code. The index.ipynb notebook includes almost all other notebooks.
The diaphanous directory contains the Python library code used by the notebooks.
- The remaining code in diaphanous.main should be refactored into notebooks.
- The show() function in diaphanous.show is more generally useful. Most of this functionality should be up-streamed to Pandas because it significantly improves on the default table format.
The figure directory contains SVG figures.
The stubs directory contains typing stubs.
The report directory contains the LaTeX sources for the article discussing the work.

Acronyms

CSAM: Child Sexual Abuse Material
CSE: Child Sexual Exploitation
NCMEC: National Center for Missing and Exploited Children
OCSE: Online Child Sexual Exploitation
OJJDP: Office for Juvenile Justice and Delinquency Prevention (at the US Departmet of Justice)

Licensing

The code in this repository is ©️ 2023–2024 by Robert Grimm and has been released under the Apache 2.0 open source license. The datasets in this repository combine disclosures by electronic service providers as well as the National Center for Missing and Exploited Children (NCMEC) and make this data more easily accessible in machine-readable form. It has been released under the CC BY 4.0 license.

Owner

Name: Robert Grimm
Login: apparebit
Kind: user
Location: New York City

Website: https://apparebit.com
Twitter: apparebit
Repositories: 5
Profile: https://github.com/apparebit

Software engineer by day. Apocalyptic prophet at night.

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://citation-file-format.github.io/cff-initializer-javascript/#/
# to generate yours today!

cff-version: 1.2.0
title: >-
  Diaphanous: Transparency Disclosures About the Sexual
  Exploitation of Minors
message: >-
  If you use this dataset and attendant software, please
  cite as below.
type: dataset
authors:
  - given-names: Robert
    family-names: Grimm
    email: rgrimm@alum.mit.edu
    orcid: 'https://orcid.org/0000-0002-8300-2153'
identifiers:
  - type: doi
    value: 10.5281/zenodo.13896437
    description: Zenodo
  - type: doi
    value: 10.48550/arXiv.2402.14625
    description: arXiv
abstract: >-
  This dataset collects transparency disclosures about
  online child sexual exploitation. Sources are the US
  clearinghouse for legally mandated reports about such
  activities and materials, the National Center for Missing
  and Exploited Children, as well as the technology industry
  firms filing the reports. Criteria for inclusion of the
  latter include industry rankings, population reach, and
  report volume.
keywords:
  - child sexual exploitation
  - child sexual abuse material
  - CSAM
  - transparency report
  - social media
  - CyberTipline
  - National Center for Missing and Exploited Children
  - NCMEC
license: CC-BY-4.0
commit: 549a2fd55c7b22d79098adbdc37ade8dbc5f9761
version: v0.4
date-released: '2024-12-12'

GitHub Events

Total

Release event: 2
Watch event: 1
Push event: 26
Create event: 2

Last Year

Release event: 2
Watch event: 1
Push event: 26
Create event: 2

Issues and Pull Requests

Last synced: over 1 year ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0