generate-a-corpus-with-an-llm
The notebook in this repository is provided for students in DIGI405 at the University of Canterbury to query a Large Language Model (LLM) to generate a corpus. Students can adapt the code to generate their own data for an assignment.
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 1 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (15.9%) to scientific vocabulary
Repository
The notebook in this repository is provided for students in DIGI405 at the University of Canterbury to query a Large Language Model (LLM) to generate a corpus. Students can adapt the code to generate their own data for an assignment.
Basic Info
- Host: GitHub
- Owner: polsci
- License: mit
- Language: Jupyter Notebook
- Default Branch: master
- Homepage: https://geoffford.nz/generate-a-corpus
- Size: 35.2 KB
Statistics
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 3
Metadata Files
README.md
Generate a corpus with an LLM
Geoff Ford
https://geoffford.nz
The notebook in this repository is provided for students in DIGI405 at the University of Canterbury to query a Large Language Model (LLM) to generate a corpus. Students can adapt the examples to generate their own data.
I appreciate this notebook may be relevant for others. If you use it please retain the authorship information and links or cite it.
To learn more take a look at the notebook. I've also written a post on my website about how we are used this in class in 2024.
Changes are documented in the CHANGELOG.
Note on OpenRouter support
Note: that version 1.1.1 of this repository used the OpenRouter API. From version 1.1.2 the notebook uses Cerebras. You can download the 1.1.1 release from the repository if you want to use OpenRouter.
Note on Cerebras
The notebook provides code to query Cerebras's API. Cerebras provides an API endpoint that provides access to multiple LLMs with generous rate limits for development and testing. Cerebras provides good documentation and access to a range of well-regarded models. API calls are rate limited.
If anyone from Cerebras sees this - free access to API calls and the provided rate limits are very helpful for educators. Thank you!
Create a Cerebras API key
Go to Cerebras and click the link to "Get an API key". For students in DIGI405, you can signup with your UC email address. You should indicate you are a student. You will be shown an API key (partially obscured) and sample code. Copy and paste the key into your password manager for future use. There is a field in the notebook where you need to paste in your key. Don't share your key with anyone else.
Instructions for DIGI405 students - warning about excessive, rapid or repeated requests during lab times
This is the first semester we are using the Cerebras service in DIGI405, please avoid making excessive, rapid or repeated requests during the lab times as there is the potential this could cause our network to be flagged as malicious and create problems for your classmates accessing the API.
Installation
If you are a DIGI405 student running this on our JupyterHub instance, all required libraries are pre-installed. If you want to install this on your own machine, there is a requirements.txt file with required libraries. To install the required libraries run:
pip install -r requirements.txt
Owner
- Name: Geoff Ford
- Login: polsci
- Kind: user
- Location: Ōtautahi, NZ
- Company: University of Canterbury Arts Digital Lab
- Website: https://polsci.github.io/
- Twitter: ageoffford
- Repositories: 17
- Profile: https://github.com/polsci
Citation (CITATION.cff)
cff-version: 1.2.0 message: "If you use this repository for your teaching or research, please cite it as below." authors: - family-names: "Ford" given-names: "Geoffrey" orcid: "https://orcid.org/0000-0001-7088-4073" title: "Generate a corpus with an LLM" version: 1.1.2 doi: 10.5281/zenodo.13364418 date-released: 2024-08-23 url: "https://github.com/polsci/generate-a-corpus-with-an-LLM"
GitHub Events
Total
- Release event: 1
- Watch event: 2
- Push event: 1
- Create event: 1
Last Year
- Release event: 1
- Watch event: 2
- Push event: 1
- Create event: 1
Issues and Pull Requests
Last synced: 12 months ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- python-slugify *
- requests *