https://github.com/commoncrawl/web-languages

Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ for the code

https://github.com/commoncrawl/web-languages

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.7%) to scientific vocabulary

Keywords

crawling dataset language-detection

Keywords from Contributors

transformer cryptocurrency cryptography jax audio deepseek gemma glm model-hub pretrained-models
Last synced: 6 months ago · JSON representation

Repository

Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ for the code

Basic Info
Statistics
  • Stars: 52
  • Watchers: 7
  • Forks: 68
  • Open Issues: 6
  • Releases: 0
Topics
crawling dataset language-detection
Created over 1 year ago · Last pushed 6 months ago
Metadata Files
Readme

README.md

ccf-web-languages-logo

Web Languages Project

Welcome! This is a crowd-sourced effort to improve crawling of low-resource languages. This dataset is public.

Common Crawl recognizes a lot of languages, and we can see that we don't have enough of languages like Hindi (500 million speakers!), smaller country languages like Hungarian, and regional languages like Catalan. We are interested in languages from all over the world. If you choose to help, you'll be helping create lists of websites related to languages that you read or speak.

How can I contribute?

If you look below you'll see a huge list of living languages. If you see one that looks interesting, click on it. You'll see a language-specific document, probably mostly blank, that you can fill out.

There are 2 ways to add to this document. If you aren't very familiar with Github, you can copy the entire document into an email, fill it out, and send it to web-languages ZAT commoncrawl ZOT org. We'll do the rest.

If you are familiar with Github, and are logged in, click on the pen icon in the upper right corner to start editing the document. Github will request that you fork the repo. Do that, edit the document, and finally create a pull request.

To see a partially completed example, look at the Welsh entry.

Sometimes asking a Large Language Model can be helpful: "What are some top websites written in the Welsh language?"

You can also join our Discord server where we have a dedicated channel for discussing this project.

What kind of websites are you looking for?

If you look at the template, we have requested urls in a few categories: News, Culture/History, Government, Political Parties, and Other. Remember that we're looking for websites in this particular language. If the language is only a part of the website, and that's visible in the URL as https://example.com/catalan/, then that's the URL you should add.

For a language like Hindi, with 500 million speakers, there are a lot of websites to choose from. Please suggest websites that are important and influential, and please think about diversity. Are all geographic regions represented?

For a country-wide language like Hungarian, there are still probably a wide variety of websites to choose from. If a website is all English, however, that's not what we're looking for.

For a regional language like Catalan, things are trickier. Catalan has multiple names -- it's called Valencian in some parts of Spain -- and use of the Catalan language is a part of a vigorous debate in Spanish national and regional politics. You might not be able to find Catalan-language content for every political party, and government websites might offer Catalan content one day and remove it after the next election. In that case, please do the best you can.

If your favorite language has its own Wikipedia -- check the list here -- please include this link under "Other".

What if my favorite language isn't in the list?

If you don't see your language, please open a Github issue, or send us an email at web-languages ZAT commoncrawl ZOT org. It could be that your language is here but has an unfamiliar name, or perhaps we need to add it. This list was started with the list in ISO-639-3, which is, like any world-wide standard, an imperfect list.

See also: Constructed, Extinct, Historical, Special

Languages with more than 50mm speakers

Languages

License

This work is marked with CC0 1.0

By editing this file, contributors are agreeing to release their contributions under the CC0 license.

Owner

  • Name: Common Crawl Foundation
  • Login: commoncrawl
  • Kind: organization
  • Email: info@commoncrawl.org

Common Crawl provides an archive of webpages going back to 2007.

GitHub Events

Total
  • Create event: 43
  • Commit comment event: 1
  • Issues event: 5
  • Watch event: 52
  • Delete event: 40
  • Member event: 1
  • Issue comment event: 67
  • Push event: 122
  • Pull request review comment event: 76
  • Pull request review event: 162
  • Pull request event: 222
  • Fork event: 58
Last Year
  • Create event: 43
  • Commit comment event: 1
  • Issues event: 5
  • Watch event: 52
  • Delete event: 40
  • Member event: 1
  • Issue comment event: 67
  • Push event: 122
  • Pull request review comment event: 76
  • Pull request review event: 162
  • Pull request event: 222
  • Fork event: 58

Committers

Last synced: 9 months ago

All Time
  • Total Commits: 131
  • Total Committers: 40
  • Avg Commits per committer: 3.275
  • Development Distribution Score (DDS): 0.832
Past Year
  • Commits: 131
  • Committers: 40
  • Avg Commits per committer: 3.275
  • Development Distribution Score (DDS): 0.832
Top Committers
Name Email Commits
Evan Pacini e****i@g****m 22
Greg Lindahl g****g@c****g 12
underwood 5****t 12
Twan Goosen t****n@c****u 7
Swapnil Tripathi s****t@g****m 6
Joy 7****s 5
Samuel R. s****a@y****r 4
Prakash Rajendran hi@p****h 4
Ethan Wenokur e****r@g****m 4
Mejans 6****s 4
Greg Lindahl l****l@p****m 4
Dmitry Gaynullin 1****i 4
Kirӧ j****n@g****m 3
Alex 1****d 3
ButterflyOfFire 4****e 3
Sujith K 8****8 3
jenenglish 4****h 3
Chris Emezue c****e@g****m 2
Jean Maillard j****n@m****t 2
Christopher Nguyen c****n 2
Manuel Goulão m****o@g****m 2
Wannaphong Phatthiyaphaibun w****g@y****m 2
codemurt e****e@i****u 1
Tomáš Mlynář 4****m 1
Swati Rajwal 1****l 1
Samuel Arcadinho s****v@h****m 1
Pierre 6****a 1
Neechalkaran n****n@g****m 1
Maja Swieczkowska 7****a 1
Hanna Yukhymenko 4****h 1
and 10 more...
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 5
  • Total pull requests: 175
  • Average time to close issues: 2 days
  • Average time to close pull requests: about 22 hours
  • Total issue authors: 5
  • Total pull request authors: 57
  • Average comments per issue: 1.4
  • Average comments per pull request: 0.27
  • Merged pull requests: 147
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 5
  • Pull requests: 175
  • Average time to close issues: 2 days
  • Average time to close pull requests: about 22 hours
  • Issue authors: 5
  • Pull request authors: 57
  • Average comments per issue: 1.4
  • Average comments per pull request: 0.27
  • Merged pull requests: 147
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • tiendung (1)
  • rutsam (1)
  • jenenglish (1)
  • evanpacini (1)
  • thunderpoot (1)
Pull Request Authors
  • e-Winnie (33)
  • thunderpoot (29)
  • VojvyvKiro (6)
  • twagoo (6)
  • BitsandGits (6)
  • gaydmi (6)
  • Mejans (6)
  • steveisd (6)
  • rutsam (4)
  • rasuljasirdent (4)
  • jeanm (4)
  • ctn (4)
  • swaptr (4)
  • dearprakash (4)
  • jenenglish (4)
Top Labels
Issue Labels
Pull Request Labels
documentation (1)