https://github.com/spencerpresley/academicmetrics

AI-powered toolkit for analyzing and classifying academic research publications using LLMs and automated data collection. Output options: Mongodb database via providing your databse url. Json. Excel spreadsheet. See README for the quick setup, see documentation for implementation details.

Keywords

ai automation data-science developer-tools llm llms research-analytics

Last synced: 10 months ago · JSON representation

Repository

AI-powered toolkit for analyzing and classifying academic research publications using LLMs and automated data collection. Output options: Mongodb database via providing your databse url. Json. Excel spreadsheet. See README for the quick setup, see documentation for implementation details.

Basic Info

Host: GitHub
Owner: SpencerPresley
License: mit
Language: Python
Default Branch: master
Homepage: https://academicmetrics.readthedocs.io/en/latest/
Size: 44.3 MB

Statistics

Stars: 3
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 1

Topics

ai automation data-science developer-tools llm llms research-analytics

Created over 1 year ago · Last pushed over 1 year ago

Metadata Files

Readme License

README.md

Header

Welcome to Academic Metrics

bash pip install academic-metrics

What is Academic Metrics?

Academic Metrics is an AI-powered toolkit for collecting, classifying, and analyzing academic publications.

The system can be used to:

Collect publication data from Crossref API based on institutional affiliation
Classify research into NSF PhD research focus areas utilizing LLMs
Extract and analyze themes and methodologies from abstracts
Generate comprehensive analytics at article, author, and category levels
Store results in MongoDB (local or live via atlas), local JSON files, and Excel files

[!TIP] Academic Metrics utilizes an early version of AIChainComposer for working with LLMs.

AIChainComposer is a powerful tool to quickly, easily, and efficiently build out programmatic workflows with LLMs.

AIChainComposer is now provided as a standalone package, and is available on PyPI, allowing you to use the same tools that allowed for the development of Academic Metrics in your own projects.

Features

| Category | Features | Benefits | |----------|----------|-----------| | 📊 Data Collection | • Crossref API Integration
• Smart Web Scraping
• Automated DOI Processing
• Multi-Source Data Fusion | • Comprehensive data gathering
• Enhanced data completeness
• Reliable source tracking
• Efficient data collection | | 🤖 AI Classification | • LLM-Powered Analysis
• NSF PhD Focus Areas
• Theme Extraction
• Methodology Detection | • Accurate categorization
• Standardized classifications
• Insightful themes
• Research trend analysis | | 📈 Analytics Engine | • Citation Tracking
• Author Statistics
• Department Analytics
• Category Analysis | • Impact measurement
• Performance tracking
• Department insights
• Research trends | | 💾 Data Management | • MongoDB Integration
• JSON Export
• Excel Reports
• Flexible Storage | • Scalable storage
• Easy data sharing
• Familiar formats
• Data accessibility | | 🔄 Processing Pipeline | • Async Processing
• Error Handling
• Rate Limiting
• Retry Logic | • Fast performance
• Reliable operation
• API compliance
• Robust processing | | 🎯 Research Metrics | • Citation Impact
• Author Collaboration
• Research Focus
• Publication Trends | • Research evaluation
• Collaboration insights
• Focus area tracking
• Trend analysis | | 🛠️ Developer Tools | • AIChainComposer Integration
• Modular Design
• Extensive Documentation
• CLI Interface | • Easy LLM integration
• Simple customization
• Quick learning
• Flexible usage | | 📱 Integration Ready | • Example Site
• Data Export
• Custom Endpoints | • Quick visualization
• Data portability
• System integration | | 🔐 Security & Control | • API Key Management
• Rate Control
• Error Logging
• Data Validation | • Secure operation
• Resource protection
• Better monitoring
• Data integrity |

Documentation

To be able to see any and all implementation details regarding code logic, structure, prompts, and more you can check out our documentation. The documentation is built with Sphinx, allowing for easy use and a sense of famliarity.

Academic Metrics Documentation

Example Site and Demo

We also built an example site with the data we collected so that you can get a small idea of the potential uses for the data. This is by no means the only use case, but it does serve as a nice introduction to decide if this package would be useful for you.

[!NOTE] The source code for the example site is available here

Example Site

[!TIP] You can use our site source code for your own site! To easily launch your own website using the data you collect and classify via Academic Metrics see Site Creation Guide

To see a demo of the site, you can watch the below video:

Installation and Setup Steps

Hey all, Spencer here, we are pleased to announce as of January 1st, 2025, you can now install the Academic Metrics package via pip and easily run the entire system via a short script or command line interface.

Below is a step by step walkthrough of each piece of the process starting with installing python and setting up your environment, this is to help you get started if you are a non-technical user.

If you don't need help with starting steps such as python setup, mongodb setup, or virtual environment setup, or want to jump straight to the code, first see 1. Installation (or just install it via pip install academic-metrics), then skip to 6. Using the package.****

0. External Setup

Installing and setting up Python 3.12:

While you should be able to use any version of Python >= 3.7, we recommend using Python 3.12 as that is the version we used to develop the system, and the one it's been tested on.

For a detailed Python installation guide, see our Python Installation Guide.
Installing and setting up MongoDB:

For a detailed MongoDB installation and setup guide, see our MongoDB Installation Guide.

Once you have MongoDB installed and running, you can create a database to store your data in, if you haven't already.

To create a new database, you can run:

bash use <db_name>

If you need more help, the MongoDB Installation Guide goes into more detail on how to create a database and verify it exists.

Collection creation is handled by the system, you do not need to create them.

1. Installation

Install academic_metrics>=1.0.98 via pip.

To install the latest version of the package, you can run the following command:

bash pip install academic-metrics

2. Creating the directory and necessary files

Create the directory and navigate into it:

For this example we will be using am_data_collection as the name of the directory, but you can name it whatever you want.

**All systems (seperate commands):**

```bash
mkdir am_data_collection
cd am_data_collection
```

Or as a single line:

**Linux / Mac / Windows Command Prompt**:

```bash
mkdir am_data_collection && cd am_data_collection
```

**Windows Powershell**:

```powershell
mkdir am_data_collection; cd am_data_collection
```

3. Virtual Environment (Optional but Recommended)

Now that you've created and entered your project directory, you can set up a virtual environment.

For detailed instructions on setting up and using virtual environments, see our Python Installation Guide - Virtual Environments Section.

After setting up your virtual environment, return here to continue with the next steps.

4. Environment Variables

Create a .env file inside the directory you just created.

Linux/Mac:

bash touch .env

Windows (Command Prompt):

cmd type nul > .env

Windows (PowerShell):

powershell New-Item -Path .env -Type File

You should now have a .env file in your directory.

5. Setting required environment variables

1. Open the `.env` file you just created, and add the following variables

a variable to store your MongoDB URI, I recommend MONGODB_URI
a variable to store your database name, I recommend DB_NAME
a variable to store your OpenAI API Key, I recommend OPENAI_API_KEY

After each variable you should add ="" to the end of the variable.

Once you've done this, your .env file should look something like this:

python MONGODB_URI="" DB_NAME="" OPENAI_API_KEY=""

2. Retrieve and set your MongoDB URI

For local MongoDB it's typically:

python MONGODB_URI="mongodb://localhost:27017"

For live MongoDB:

For a live version you should use the MongoDB Atlas URI. It should look something like this:

bash mongodb+srv://<username>:<password>@<cluster-name>.<unique-id>.mongodb.net/?retryWrites=true&w=majority&appName=<YourAppNameOnAtlas>

So in the .env file you should have something that looks like this:

Local:

python MONGODB_URI="mongodb://localhost:27017"

Live:

python MONGODB_URI="mongodb+srv://<username>:<password>@<cluster-name>.<unique-id>.mongodb.net/?retryWrites=true&w=majority&appName=<YourAppNameOnAtlas>"

[!WARNING] I recommend starting locally unless you need to use a live MongoDB instance. This will avoid the need to deal with setting up MongoDB Atlas, which while not difficult, it is an added step.

3. Set your database name

You can pick any name you want for DB_NAME, but it needs to be a name of a valid database on your mongodb server. To make one on the command line you can run:

bash mongosh use <db_name>

For this demonstration we will be using academic_metrics_data as the DB_NAME.

First we'll create the database on the command line:

bash mongosh use academic_metrics_data

This is to ensure the database actually exists so that the system can access it.

Now that the database exists, we'll set the DB_NAME in the .env file.

python DB_NAME="academic_metrics_data"

4. Set your OpenAI API Key

If you do not have an OpenAI API key you will need to create one, but do not worry, it's easy.

Go to the following link and click on "+ Create new secret key":

https://platform.openai.com/api-keys

Give the key a name, and then copy the key.

Then in the .env file paste the key in the OPENAI_API_KEY variable.

It should look similar to this, but with the full key instead of sk-proj...:

python OPENAI_API_KEY="sk-proj..."

[!IMPORTANT] You will need to add funds to your OpenAI account to use the API.

When using the default model for the system (gpt-4o-mini), it cost us about $3-4 dollars to process all of the data from Salisbury University from 2009-2024.

For larger models such as gpt-4o, the cost will be much higher.

We saw good results using gpt-4o-mini, and it's also the most cost effective. So I recommend starting with that.

Additionally, whether you opt to use our command line interface or your own script, the data is processed one month at a time and saved to the database, so if you run out of funds on your OpenAI account you will not lose data for the entire run, only the current month being processed. Simply add funds to your account and continue.

You do not have to change anything in the code once you run it again, the system checks for existing data and only processes data that has not yet been processed.

All together your .env file should look like this:

python MONGODB_URI="mongodb://localhost:27017" DB_NAME="academic_metrics_data" OPENAI_API_KEY="sk-proj..."

6. Using the package

To use the system, you have 2 options:

Writing a short script (code provided) to loop over a range of dates you'd like to collect.
Using a provided function to run a command line interface version.

For most users, I recommend the second option, it's only a few lines of code which you can copy and paste, the rest of the usage is handled by the command line interface and doesn't require any additional coding, you can find the second option in the Option 2 (Command Line Interface) section.

On the other hand, if you plan on using the main system, or other tools within the package within your own scripts, or just don't enjoy using command line interfaces, I recommend the first option.

While I recommend the second option unless you're planning on using the package's offerings in a more complex manner, the basic code to run the system for the first option is provided in full in Option 1 (Short Script) section.

To see some examples of more complex use cases with examples, you can check out the Other Uses section.

Option 1 (Short Script)

For this option you need to do the following:

1. Create the python file

Within your directory, create a new python file, for this example we will be using run_am.py, but you can name it whatever you want.

Linux/Mac:

bash touch run_am.py

Windows (Command Prompt):

cmd type nul > run_am.py

Windows (PowerShell):

powershell New-Item -Path run_am.py -Type File

You should now have a python file in your directory whose name matches the one you created.

2. Copy paste the following code into the file you just created

```python

dotenv is the python package responsible for handling env files

from dotenv import load_dotenv

os is used to get the environment variables from the .env file

import os

PipelineRunner is the main class used to run the pipeline

from academic_metrics.runners import PipelineRunner

load_dotenv is used to load the environment variables from the .env file

load_dotenv()

Get the environment variables from the .env file

aiapikey = os.getenv("OPENAIAPIKEY") mongodburi = os.getenv("MONGODBURI") dbname = os.getenv("DBNAME")

Set the date range you want to process

Years is a list of years as strings you want to process

Months is a list of strings representing the months you want processed for each year

For example if you want to process data from 2009-2024 for all months out of the year, you would do:

Note: the process runs left to right, so from beginning of list to the end of the list,

so this will process 2024, then 2023, then 2022, etc.

Data will be saved after each month is processed.

years = [ "2024", "2023", "2022", "2021", "2020", "2019", "2018", "2017", "2016", "2015", "2014", "2013", "2012", "2011", "2010", "2009", ] months = ["1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"]

Loop over the years and months and run the pipeline for each month

New objects are created for each month to avoid memory issues as well as to avoid overwriting data

for year in years: for month in months:

    # Create a new PipelineRunner object for each month
    # parameters:
    # ai_api_key: the OpenAI API key
    # crossref_affiliation: the affiliation to use for the Crossref API
    # data_from_month: the month to start collecting data from
    # data_to_month: the month to end collecting data on
    # data_from_year: the year to start collecting data from
    # data_to_year: the year to end collecting data on
    # mongodb_uri: the URL of the MongoDB server
    # db_name: the name of the database to use
    pipeline_runner = PipelineRunner(
        ai_api_key=ai_api_key,
        crossref_affiliation="Salisbury University",
        data_from_month=int(month),
        data_to_month=int(month),
        data_from_year=int(year),
        data_to_year=int(year),
        mongodb_uri=mongodb_uri,
        db_name=db_name,
    ) 

    # Run the pipeline for the current month
    pipeline_runner.run_pipeline()

```

If you'd like to save the data to excel files in addition to the other data formats, you can do so via importing the function get_excel_report from academic_metrics.runners and calling it at the end of the script.

Full code for convenience:

```python

dotenv is the python package responsible for handling env files

from dotenv import load_dotenv

os is used to get the environment variables from the .env file

import os

PipelineRunner is the main class used to run the pipeline

getexcelreport is the function used to save the data to excel files

it takes in a DatabaseWrapper object as a parameter, which connects to the database

and retrives the data before writing it to 3 seperate excel files. One for each data type.

from academicmetrics.runners import PipelineRunner, getexcel_report

DatabaseWrapper is the class used to connect to the database and retrieve the data

from academic_metrics.DB import DatabaseWrapper

load_dotenv is used to load the environment variables from the .env file

load_dotenv()

Get the environment variables from the .env file

If you used the same names as the ones in the examples, you can just copy paste these

if you used different names, you will need to change them to match the ones in your .env file

aiapikey = os.getenv("OPENAIAPIKEY") mongodburi = os.getenv("MONGODBURI") dbname = os.getenv("DBNAME")

Set the date range you want to process

Years is a list of years as strings you want to process

Months is a list of strings representing the months you want processed for each year

For example if you want to process data from 2009-2024 for all months out of the year, you would do:

Note: the process runs left to right, so from beginning of list to the end of the list,

so this will process 2024, then 2023, then 2022, etc.

Data will be saved after each month is processed.

years = [ "2024", "2023", "2022", "2021", "2020", "2019", "2018", "2017", "2016", "2015", "2014", "2013", "2012", "2011", "2010", "2009", ] months = ["1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"]

Loop over the years and months and run the pipeline for each month

New objects are created for each month

to avoid memory issues as well as to avoid overwriting data

for year in years: for month in months:

    # Create a new PipelineRunner object for each month
    # parameters:
    # ai_api_key: the OpenAI API key
    # crossref_affiliation: the affiliation to use for the Crossref API
    # data_from_month: the month to start collecting data from
    # data_to_month: the month to end collecting data on
    # data_from_year: the year to start collecting data from
    # data_to_year: the year to end collecting data on
    # mongodb_uri: the URL of the MongoDB server
    # db_name: the name of the database to use
    pipeline_runner = PipelineRunner(
        ai_api_key=ai_api_key,
        crossref_affiliation="Salisbury University",
        data_from_month=int(month),
        data_to_month=int(month),
        data_from_year=int(year),
        data_to_year=int(year),
        mongodb_uri=mongodb_uri,
        db_name=db_name,
    ) 

    # Run the pipeline for the current month
    pipeline_runner.run_pipeline()

Create a new DatabaseWrapper object so it can be given to getexcelreport

db = DatabaseWrapper(dbname=dbname, mongouri=mongodburi)

Call the getexcelreport function, passing in the db object, to save the data to excel files

Once this finishes running, you should have 3 excel files in your directory:

articledata.xlsx, facultydata.xlsx, and category_data.xlsx

getexcelreport(db) ```

3. Run the script

bash python run_am.py

Option 2 (Command Line Interface)

For this options you will still need to create a python file, but the code will only be a couple lines long as you'll be passing in your arguments via the command line.

1. Create the python file

Within your directory, create a new python file, for this example we will be using run_am.py, but you can name it whatever you want.

Linux/Mac:

bash touch run_am.py

Windows (Command Prompt):

cmd type nul > run_am.py

Windows (PowerShell):

powershell New-Item -Path run_am.py -Type File

You should now have a python file in your directory whose name matches the one you created.

2. Copy and paste the following code into the file you just created

```python from dotenv import loaddotenv from academicmetrics.runners import commandlinerunner

load_dotenv()

commandlinerunner() ```

[!WARNING] If you did not use MONGODB_URI and OPENAI_API_KEY as the variable names in the .env file, you will need to make a couple changes to the above code.

How to use with different variable names:

The command_line_runner function takes in 2 optional arguments:

openai_api_key_env_var_name
mongodb_uri_env_var_name

Which correspond to the names of the environment variables you used in your .env file.

To use the different names, do the following:

```python from dotenv import loaddotenv from academicmetrics.runners import commandlinerunner

load_dotenv()

The strings should be changes to match the names you used in your .env file

commandlinerunner( openaiapikeyenvvarname="YOUROPENAIAPIKEYENVVARNAME", mongodburienvvarname="YOURMONGODBURIENVVARNAME", ) ```

3. Run the script

For this option you will still run the script from command line, but you will also be passing in arguments, details laid out below.

There are various command line arguments you can pass in, almost all are detailed here, but to see a complete list you can run:

bash python run_am.py --help

When running the script, you can configure the pipeline by passing in the following arguments:

--from-month - The month to start collecting data from, defaults to 1
--to-month - The month to end collecting data on, defaults to 12
--from-year - The year to start collecting data from, defaults to 2024
--to-year - The year to end collecting data on, defaults to 2024
--db-name - The name of the database to use (required)
--crossref-affiliation - The affiliation to use for the Crossref API, defaults to Salisbury University (required)

If you want to save the data to excel files you can pass in the --as-excel argument.

[!NOTE] The --as-excel argument is an additional action, it doesn't remove the the saving to other data formats, but merely adds the excel saving functionality.

Examples

Say you want to collect data for every month from 2019 to 2024 for Salisbury University and save it to excel files. You would run the following command:

bash python run_am.py --from-month=1 \ --to-month=12 \ --from-year=2019 \ --to-year=2024 \ --crossref-affiliation="Salisbury University" \ --as-excel \ --db-name="Your_Database_Name"

To make this simpler, we can actually take advantage of the default values for some of the arguments.

Recall from before:

--from-month defaults to 1
--to-month defaults to 12
--from-year defaults to 2024
--to-year defaults to 2024
--crossref-affiliation defaults to Salisbury University

Using the defaults, we can make that command much more concise:

bash python run_am.py \ --from-year=2019 \ --as-excel \ --db-name="Your_Database_Name"

On AI Models:

The default AI (LLM) model used for all phases is gpt-4o-mini. You can specify a different model for each phase independently by passing in the following arguments:

--pre-classification-model - The model to use for the pre-classification step
--classification-model - The model to use for the classification step
--theme-model - The model to use for the theme extraction step

Here's how you would run the pipeline using the larger gpt-4o model:

bash python run_am.py --from-month=1 \ --to-month=12 \ --from-year=2019 \ --to-year=2024 \ --crossref-affiliation="Salisbury University" \ --as-excel \ --db-name="Your_Database_Name" \ --pre-classification-model="gpt-4o" \ --classification-model="gpt-4o" \ --theme-model="gpt-4o"

and taking advantage of the defaults:

bash python run_am.py \ --from-year=2019 \ --as-excel \ --db-name="Your_Database_Name" \ --pre-classification-model="gpt-4o" \ --classification-model="gpt-4o" \ --theme-model="gpt-4o"

[!WARNING] This process consumes a lot of tokens, and OpenAI API service usage is based off the number of input/output tokens used, with each model having different cost per input/output token.

You can check the cost of each model at https://openai.com/api/pricing/.

During testing we found that using gpt-4o-mini was the most cost effective.

In addition we spent a lot of time testing prompts and models, our prompts have been tuned to a point where they elicit good results from gpt-4o-mini, thus a larger model may not be necessary to get the results you want.

If you want to use a larger model like gpt-4o, whether it be out of curiosity or you want to see if it provides better results, I still recommend you start with a smaller date range to get an idea of what it will cost. If you find the cost to be acceptable, then you can start expanding the date range.

Other institutions:

Our system uses the Crossref API to collect available data, then it scrapes the DOI link to get any missing data as well as any additional data that may be available.

We found that the Crossref API sometimes misses some Abstracts for example, our scraping process will fill in nearly all, if not all, of the missing abstracts.

Due to this, and the wealth of institutions Crossref covers, you can use the system for any institution that has a DOI link.

Here's how you'd run the same query on the system but for University of Maryland data:

bash python run_am.py \ --from-year=2019 \ --as-excel \ --db-name="Your_Database_Name" \ --crossref-affiliation="University of Maryland"

You can even go back as far as you want, for example say you want to collect all data from the beginning of the 21st century:

bash python run_am.py \ --from-year=2000 \ --as-excel \ --db-name="Your_Database_Name" \ --crossref-affiliation="University of Maryland"

Or maybe you want to collect all data as far back as possible, so you can see longterm trends and history of the institution:

bash python run_am.py \ --from-year=1900 \ --as-excel \ --db-name="Your_Database_Name" \ --crossref-affiliation="University of Maryland"

The from year does not require that there be data that far back, it simply means that is the cutoff point for the data you want to collect.

So say you're not entirely sure what year your University started, or aren't sure how far back Crossref covers, you can simply enter a very far back year, like 1900, and the system will collect all data from that year and onwards.

Wrapping Up

That's it! You've now successfully installed and run the system.

If you have any questions, need help, or have interest in collaborating on this project or others, feel free to reach out to me, contact information is provided below.

If you are a potential employer, please reach out to me by email or linkedin, contact information is provided below.

Contact information:

Email: spencerpresley96@gmail.com
LinkedIn: https://www.linkedin.com/in/spencerpresley96/

Happy coding!

Owner

Name: Spencer Presley
Login: SpencerPresley
Kind: user
Location: Salisbury, MD

Repositories: 1
Profile: https://github.com/SpencerPresley

Computer Science Major at Salisbury University | Class of 2025

GitHub Events

Total

Release event: 1
Watch event: 3
Push event: 18
Create event: 3

Last Year

Release event: 1
Watch event: 3
Push event: 18
Create event: 3

Committers

Last synced: about 1 year ago

All Time

Total Commits: 255
Total Committers: 4
Avg Commits per committer: 63.75
Development Distribution Score (DDS): 0.141

Past Year

Commits: 248
Committers: 4
Avg Commits per committer: 62.0
Development Distribution Score (DDS): 0.133

Top Committers

Name	Email	Commits
Spencer Presley	1****y	219
Cole	c**2@g**m	34
Jude Maggitti	j**3@g**m	1
Cole Barbes	c**s@d**p	1

Committer Domains (Top 20 + Academic)

debian.desktop: 1

Issues and Pull Requests

Last synced: 11 months ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 184 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 20
Total maintainers: 1

pypi.org: academic-metrics

A tool to extract and format academic data from Web of Science and Crossref

Homepage: https://github.com/SpencerPresley/AcademicMetrics
Documentation: https://academic-metrics.readthedocs.io/
License: MIT License
Latest release: 1.0.99
published over 1 year ago

Versions: 20
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 184 Last month

Rankings

Dependent packages count: 10.1%

Average: 42.1%

Dependent repos count: 56.6%

Downloads: 59.5%

Maintainers (1)

spresley1

Last synced: 11 months ago

https://github.com/spencerpresley/academicmetrics

Science Score: 26.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Welcome to Academic Metrics

Table of Contents

Features

Documentation

Example Site and Demo

Installation and Setup Steps

0. External Setup

1. Installation

2. Creating the directory and necessary files

3. Virtual Environment (Optional but Recommended)

4. Environment Variables

5. Setting required environment variables

1. Open the .env file you just created, and add the following variables

2. Retrieve and set your MongoDB URI

3. Set your database name

4. Set your OpenAI API Key

6. Using the package

Option 1 (Short Script)

1. Create the python file

2. Copy paste the following code into the file you just created

dotenv is the python package responsible for handling env files

os is used to get the environment variables from the .env file

PipelineRunner is the main class used to run the pipeline

load_dotenv is used to load the environment variables from the .env file

Get the environment variables from the .env file

Set the date range you want to process

Years is a list of years as strings you want to process

Months is a list of strings representing the months you want processed for each year

For example if you want to process data from 2009-2024 for all months out of the year, you would do:

Note: the process runs left to right, so from beginning of list to the end of the list,

so this will process 2024, then 2023, then 2022, etc.

Data will be saved after each month is processed.

Loop over the years and months and run the pipeline for each month

New objects are created for each month to avoid memory issues as well as to avoid overwriting data

dotenv is the python package responsible for handling env files

os is used to get the environment variables from the .env file

PipelineRunner is the main class used to run the pipeline

getexcelreport is the function used to save the data to excel files

it takes in a DatabaseWrapper object as a parameter, which connects to the database

and retrives the data before writing it to 3 seperate excel files. One for each data type.

DatabaseWrapper is the class used to connect to the database and retrieve the data

load_dotenv is used to load the environment variables from the .env file

Get the environment variables from the .env file

If you used the same names as the ones in the examples, you can just copy paste these

if you used different names, you will need to change them to match the ones in your .env file

Set the date range you want to process

Years is a list of years as strings you want to process

Months is a list of strings representing the months you want processed for each year

For example if you want to process data from 2009-2024 for all months out of the year, you would do:

Note: the process runs left to right, so from beginning of list to the end of the list,

so this will process 2024, then 2023, then 2022, etc.

Data will be saved after each month is processed.

Loop over the years and months and run the pipeline for each month

New objects are created for each month

to avoid memory issues as well as to avoid overwriting data

Create a new DatabaseWrapper object so it can be given to getexcelreport

Call the getexcelreport function, passing in the db object, to save the data to excel files

Once this finishes running, you should have 3 excel files in your directory:

articledata.xlsx, facultydata.xlsx, and category_data.xlsx

3. Run the script

Option 2 (Command Line Interface)

1. Create the python file

2. Copy and paste the following code into the file you just created

The strings should be changes to match the names you used in your .env file

3. Run the script

Examples

Wrapping Up

Owner

GitHub Events

Total

Last Year

1. Open the `.env` file you just created, and add the following variables