main-database

Main database of generative AI systems

https://github.com/language-technology-assessment/main-database

Last synced: 10 months ago · JSON representation

Repository

Main database of generative AI systems

Basic Info

Host: GitHub
Owner: Language-Technology-Assessment
Default Branch: preview
Homepage:
Size: 729 KB

Statistics

Stars: 11
Watchers: 2
Forks: 2
Open Issues: 16
Releases: 2

Created almost 2 years ago · Last pushed 10 months ago

Metadata Files

Readme Zenodo

European Open Source AI Index main database

How to contribute

Every system is a separate yaml file. The first few fields contain basic metadata about the system/model, the rest of the file is a set of triples of _class, _link and _notes. Class can be one of three values: 🟩 open, 🟧 partial or 🟥 closed (leave empty to signify NA). Link is a URL providing evidence for the openness classification. Notes provide context and reasoning for the classification.

You're free to build on this work and reuse the data. It is licensed under CC-BY 4.0, with the stipulation that attribution should come in the form of a citation of the index files. doi:10.5281/zenodo.15386042

Which models are included?

The index aims to include any instruct-tuned generative AI system or model that is described by the responsible organisation or builder as "open-source" or "open", or that is marketed as such by offical outlets of the responsible organisation or builder. Generally, the index aims to:

Refer to the model by its most recent version, without naming the model size. Evaluation is then based around the largest model in the family. This may skip over some nuances with how different sizes might use different base models and fails to capture how models have evolved over time.
The index is periodically updated by our small team (or community contributers!) to capture how models and the information supplied related to them evolve. For instance, as new models first get released and then get preprints, related 'preprint' and 'paper' entries may be updated in due course.
Models spanning across different modalities may be included in more than one modality category (text; image; video etc) leading to multiple entries in the index.

Openness Criteria and system information

For each model, the yaml files in this database collect (1) some general information about the system, (2) about the organization behind it, (3) and about 14 dimension of openness. The below list spells out for the openness criteria for features in the areas of system information and organisation, followed by openness criteria groupt into 'Availability', 'Documentation' and 'Access'. Use these guidelines to document determinations of openness levels as precisely as possible, including links to evidence. Notes are optional.

System

name: Name of the model including eventual version number or size indication, e.g. Llama 3.1 or Olmo-7B-instruct

link: Link to official model publisher website or, if that does not exist, platform hosting the model.

type: Model type in one word, e.g. text, video, audio. Multiple keywords possible.

performanceclass: The performance class of the model. See https://osai-index.eu/news/performance-classes.

basemodelname: If applicable, name of base model ("foundation model") that was used.

endmodelname: Name of the model the enduser interacts with.

endmodellicense: License that applies to enduser interaction with the model.

releasedate: Earliest release date of the model through any offical source, in YYYY MMM format, e.g. 2024 NOV.

Organisation

name:  Organisation that released the model. Usually synonymous with model builder.

link: Link to offical source of information about model release, e.g. offical website or blog.

Availability

Datasources Basemodel

Are datasources for training the base model comprehensively documented and made available? In case a distinction between base (foundation) and end (user) model is not applicable, this mirrors the end model data entries.

🟥 Training data sources of base large language model are not open for inspection or shared.

🟧 Some of the training data sources of base large language model are open for inspection or shared.

🟩 All training  data sources of base large language model are open for inspection or shared.

Datasources Endmodel

Are datasources for training the model that the end user interacts with comprehensively documented and made available?

🟥 Training data sources of the end model are not open for inspection or shared.

🟧 Some of the training data sources of end large language model are open for inspection or shared.

🟩 All training  data sources of end large language model are open for inspection or shared.

Weights basemodel

Are the weights of the base models made freely available? In case a distinction between base (foundation) and end (user) model is not applicable, this mirrors the end model data entries.

🟥 Weights of the base model are not shared.

🟧 Weights of the base model are partially/not fully shared.

🟩 Weights of the base model are shared.

Weights endmodel

Are the weights of the model that the end user interacts with made freely available?

🟥 Weights of the user-facing end model are not shared.

🟧 Weights of the user-facing end model are partially/not fully shared.

🟩 Weights of the user-facing end model are shared.

Training Code

Is the source code of dataset processing, model training and tuning comprehensively made available?

🟥 No source code available.

🟧 Some source code is open.

🟩 Project source code openly available and fully open available for inspection.

Documentation

Code Documentation

Is the source code of datasource processing, model training and tuning comprehensively documented?

🟥 Code documentation not available.

🟧 Some components of the system features code documentation, but not every step of base and/or end model training and tuning  is documented (irrespective of whether these components are shared).

🟩 All components of the system features a comprehensive code documentation.

Hardware Architecture Documentation

Is the hardware architecture used for datasource processing and model training comprehensively documented?

🟥 System architecture and model training setup are not documented.

🟧 System architecture and model training setup is partially documented.

🟩 System architecture and model training setup is fully documented.

Preprint

Are archived preprint(s) are available that detail all major parts of the system including datasource processing, model training and tuning steps?

🟥 No archived preprint(s) available.

🟧 Archived preprint(s) that detail some parts of the system including datasource processing, model training and tuning steps.

🟩 Archived preprint(s) are available that detail all major parts of the system including datasource processing, model training and tuning steps.

Paper

Are peer-reviewed scientific publications available that detail all major parts of the system including datasource processing, model training and tuning steps?

🟥 No peer-reviewed paper(s) available.

🟧 Peer-reviewed paper(s) detail parts of the software including base models, fine-tuning, or RLHF components.

🟩 Peer-reviewed paper(s) are available that cover all parts of the software including base models, fine-tuning, and RLHF components.

Model card

Is a model card available in standardized format that provides comprehensive insight on model architecture, training, fine-tuning, and evaluation?

🟥 Model card(s) not available.

🟧 Model card(s) that provide partial insight on model architecture, training, fine-tuning, and evaluation are available.

🟩 Model card(s) are available that provide comprehensive insight on model architecture, training, fine-tuning, and evaluation are available.

Datasheet

Is a datasheet as defined in "Datasheets for Datasets" (Gebru et al. 2021) available?

🟥 Datasheet(s) are not available.

🟧 Datasheet(s) that provide partial insight on data collection and curation are available.

🟩 Datasheet(s) are available that provide comprehensive insight on data collection and curation are available following the standards defined in Datasheets for Datasets (https://doi.org/10.1145/3458723) by Gebru et al. (2021)

Access methods

Package

Is a packaged release of the model available on a software repository (e.g. a Python Package Index, Homebrew)?

🟥 No index software package is available.

🟧 User-oriented code or web-interface is available but not as a versioned package.

🟩 A packaged release of the model available on a software repository is available (e.g. a Python Package Index, Homebrew).

API

Is an API available that provides unrestricted access to the model (other than security and CDN restrictions)? If applicable, this entry also collects information on the use and availability of meta prompts.

🟥 No API access.

🟧 Commerial or restricted-access user API is available.

🟩 An API available that provides unrestricted access to the model (other than security and CDN restrictions).

Licenses

Is the project fully covered by Open Source Initiative (OSI)-approved licenses, including all data sources and training pipeline code?

🟥 The project is not licensed clearly or does not use an Open Source Initiative (OSI)-approved license.

🟧 Only parts of the model and data sources are released under an  Open Source Initiative (OSI)-approved license, such as model weights.

🟩 The project is fully covered by Open Source Initiative (OSI)-approved license, including all data sources and training pipeline code.

Owner

Name: Language technology assessment
Login: Language-Technology-Assessment
Kind: organization

Repositories: 1
Profile: https://github.com/Language-Technology-Assessment

The language technology assessment group is based at Radboud University (The Netherlands) and conducts technology assessment research on generative AI systems

GitHub Events

Total

Create event: 5
Release event: 1
Issues event: 88
Watch event: 14
Delete event: 1
Issue comment event: 79
Member event: 3
Push event: 326
Public event: 1
Pull request event: 22
Fork event: 1

Last Year

Create event: 5
Release event: 1
Issues event: 88
Watch event: 14
Delete event: 1
Issue comment event: 79
Member event: 3
Push event: 326
Public event: 1
Pull request event: 22
Fork event: 1

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 40
Total pull requests: 9
Average time to close issues: about 2 months
Average time to close pull requests: 3 days
Total issue authors: 9
Total pull request authors: 2
Average comments per issue: 1.13
Average comments per pull request: 1.11
Merged pull requests: 7
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 40
Pull requests: 9
Average time to close issues: about 2 months
Average time to close pull requests: 3 days
Issue authors: 9
Pull request authors: 2
Average comments per issue: 1.13
Average comments per pull request: 1.11
Merged pull requests: 7
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

DBlankvoort (18)
liesenf (9)
mdingemanse (9)
behnaz-motavali (5)
Adekiii (4)
loxoron218 (2)
sandbergja (1)
LinguList (1)
nityaak5 (1)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science