https://github.com/amatofrancesco99/bigquery-sensei

Delivering actionable, data-driven insights from BigQuery logs to empower users and administrators. By monitoring query performance, cost, and quality, it leverages GenAI to provide personalized suggestions and reports that drive optimization and enhance the BigQuery experience.

https://github.com/amatofrancesco99/bigquery-sensei

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.1%) to scientific vocabulary

Keywords

bigquery-logging generative-ai google-cloud langgraph mail-report python terraform
Last synced: 5 months ago · JSON representation

Repository

Delivering actionable, data-driven insights from BigQuery logs to empower users and administrators. By monitoring query performance, cost, and quality, it leverages GenAI to provide personalized suggestions and reports that drive optimization and enhance the BigQuery experience.

Basic Info
  • Host: GitHub
  • Owner: Amatofrancesco99
  • License: other
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 7.83 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Topics
bigquery-logging generative-ai google-cloud langgraph mail-report python terraform
Created 10 months ago · Last pushed 7 months ago
Metadata Files
Readme License

README.md

# **[`bigquery-sensei`](./): Harnessing AI Agents for Optimizing BigQuery Performance, Enhancing Efficiency, and Streamlining Data Management for Smarter Insights** ![Terraform](https://img.shields.io/badge/terraform-7C4DFF?style=for-the-badge&logo=terraform&logoColor=white) ![Google Cloud](https://img.shields.io/badge/GoogleCloud-4285F4?style=for-the-badge&logo=google-cloud&logoColor=white) ![Python](https://img.shields.io/badge/python-EA4335?style=for-the-badge&logo=python&logoColor=white) ![LangGraph](https://img.shields.io/badge/langgraph-0F9D58?style=for-the-badge&logo=langgraph&logoColor=white)
![Views](https://komarev.com/ghpvc/?username=BigQuery-Sensei&label=Views&color=F4B400&style=for-the-badge) [![image](https://licensebuttons.net/l/by-nc-nd/4.0/88x31.png)](LICENSE.md)

[!TIP] Transform your BigQuery experience and cut costs—all for just cents a day! Powered by intelligent AI agents, bigquery-sensei automatically uncovers opportunities to supercharge query performance, significantly reduce cloud spend, and perfect your BigQuery SQL. Receive actionable, data-driven insights conveniently via automated, personalized email reports, empowering your team for continuous learning and smarter data management! 🏆

🌱 Aim

bigquery-sensei is a cutting-edge, open-source initiative designed to revolutionize your interaction with Google Cloud's BigQuery. We provide intelligent, automated analysis and deliver actionable insights directly to data practitioners and platform administrators.

By continuously monitoring and evaluating your BigQuery query performance, cost efficiency, and overall quality, bigquery-sensei fosters a data-driven culture where users are empowered with the knowledge to write more efficient, cost-effective, and performant SQL. Our core purpose is to transform raw query logs into valuable, personalized suggestions and reports, driving optimization and enhancing the overall BigQuery experience across your organization.

Imagine a world where your data team automatically receives tailored feedback on their queries, helping them grow their skills and reduce cloud spend – that’s the power of bigquery-sensei.


🚀 Getting Started

Getting started with bigquery-sensei is designed to be straightforward, enabling you to deploy this powerful analysis engine quickly within your Google Cloud environment. Due to the infrastructure requirements on GCP, the setup process is detailed in a separate, dedicated guide.

Please refer to the SETUP.md document for comprehensive instructions on prerequisites, environment configuration, and deployment steps using the provided scripts and Terraform code.


🧠 Architecture Explained (How it Works)

architecture

bigquery-sensei operates as a robust, serverless, and batch-driven system built entirely on Google Cloud Platform, orchestrated by Terraform. The core workflow revolves around fetching BigQuery job logs, analyzing them using custom AI Agents, and delivering insights through automated reports, always validating provided information. Here’s an overview of the system’s key components:

Key Components:

  • User Interaction: Deployment and destruction are initiated through the provided init.sh and destroy.sh scripts.
  • Terraform: Manages GCP infrastructure, including APIs, IAM roles, BigQuery datasets/tables, Cloud Storage, Cloud Functions, Cloud Scheduler jobs, Cloud Tasks queues, and Secret Manager.
  • BigQuery: Central data warehousing for querying BQ logs (via INFORMATION_SCHEMA) and storage cloud functions outputs layer.
  • Cloud Storage: Securely stores zipped source code for Cloud Functions.
  • Cloud Functions: Performs tasks like analyzing logs, running queries, and sending reports:
    • cf_query_simulator: Simulates predefined SQL queries to populate initial test data (optional).
    • cf_stats_builder: Periodically fetches and filters relevant job logs for deep analysis with a BigQuery Jobs evaluator Agent.
    • cf_mail_notifier: Aggregates analyzed job data and sends personalized email reports, with a custom Agent that composes the mail content and build useful graphs.
  • Cloud Scheduler: Cron-based scheduling for tasks like log analysis and report sending.
  • Cloud Tasks: Triggers asynchronous tasks like backfilling historical data or query simulations.
  • Secret Manager: Securely stores sensitive credentials.
  • IAM: Controls access to GCP resources with fine-grained roles.
  • Cloud Logging: Centralized logging for monitoring and debugging.

While the current implementation focuses on core analysis and email reporting, the architecture is designed for extensibility, enabling future integration with more sophisticated workflows like Agentspace and Looker Studio.


✨ Features

bigquery-sensei offers a powerful suite of features to help you optimize your BigQuery usage:

  • Automated BigQuery Job Analysis: Fetches and analyzes job logs for insights into cost, performance, security, readability, complexity and many more (can be added).
  • Structured Analysis Results: Stores detailed analysis outcomes in dedicated BigQuery tables for easy querying and reporting.
  • Failed Analysis Tracking: Tracks and logs analysis failures for continuous pipeline improvement.
  • Run Summary Logging: Maintains a historical record of analysis execution, including processed job counts and missing windows.
  • Automated User Reporting: Sends personalized email reports summarizing information about query performance, improvements, and learning resources. This information is generated by analyzing verified sources: relevant content is extracted, sources are compared to find those adding new details, and exclusive sources are prioritized.
  • Developer Activity Monitoring: Provides high-level usage statistics for administrators.
  • Configurable Reporting Frequency: Flexible email report schedules (daily, weekly, monthly).
  • Secure Credential Management: Sensitive credentials are stored in Google Cloud Secret Manager.
  • Optional Query Simulation: Simulate queries to test the analysis pipeline and initialize data.
  • Historical Backfill: Analyze historical data by triggering a Cloud Task for backfilling.

🤝 Contributions

We are thrilled to welcome contributions from the community to help make bigquery-sensei even better! Your ideas, bug fixes, feature requests, and documentation improvements are highly valued and essential to the project's growth. Join us in enhancing this tool!


🤔 Why bigquery-sensei?

In modern data-driven organizations, managing BigQuery costs, query performance, and adhering to best practices can be a significant challenge. Often, users lack visibility into the true impact of their queries, leading to inefficiencies and unexpected costs. Motivating continuous improvement in query writing across a team can also be enhanced with engaging features.

bigquery-sensei addresses these challenges by:

  • Proactively Identifying Issues: Automatically scans job logs to flag potentially costly or inefficient queries before they become major problems.
  • Empowering Users: Provides personalized, actionable feedback and curated learning resources, directly improving individual SQL skills and understanding.
  • Driving Improvement through Gamification: Possibility to introduce gamification features, such as user rankings, performance classifications (e.g., "Top Performer", "Needs Review"), and leaderboards, to foster friendly competition and motivate continuous improvement in query practices.
  • Providing Visibility: Offers aggregated insights into BigQuery usage patterns team-wide, helping administrators identify cost drivers, optimization opportunities, and areas where training might be needed.
  • Leveraging Cutting-Edge AI: Uses Large Language Models and custom AI agents to offer nuanced, human-readable recommendations that go beyond simple rule-based checks, providing deeper context.
  • Automating the Process: Significantly reduces manual effort required for monitoring, analyzing, and providing feedback on queries across the organization.

By providing clear, automated feedback and insights, and with planned engaging features like gamification, bigquery-sensei aims to help organizations significantly optimize BigQuery costs, boost performance, and cultivate a more skilled, cost-aware, and motivated data team.


🎨 Examples

To illustrate the actionable insights and rich visualizations provided by bigquery-sensei, explore samples of the automated reports delivered to both individual users and platform administrators.

[!NOTE] The values, dates, and periods presented in these examples are generated for demonstration purposes only and are fictitious. They serve solely to showcase the behavior and potential in producing reports and analyses.

📧 User Reports

Personalized reports empower individual data practitioners with tailored feedback on their BigQuery query performance and cost efficiency, helping them continuously refine their SQL skills.

Below you can find a sample user report, including a spider graph of performance metrics and trendlines showing historical improvement, divided into several key sections:

Example > ![User Report Example - part 1](examples/user_report/cards.png) > ![User Report Example - part 2](examples/user_report/spider_graph.png) > ![User Report Example - part 3](examples/user_report/trendlines_summary.png) > ![User Report Example - part 4](examples/user_report/improvements_urls.png)

📊 Developer Reports

Developer reports provide platform administrators and data leads with a high-level overview of activity across the organization, highlighting overall usage patterns and key statistics for human users and service accounts.

See a sample developer activity summary below:

Example > ![Developers Report Example](examples/developers_report/summary.png)

Owner

  • Name: Francesco Amato
  • Login: Amatofrancesco99
  • Kind: user
  • Location: Orio Litta (LO), 26863, Italy
  • Company: Università degli Studi di Pavia

Student of data science who loves coding. Born in 27/12/1999.

GitHub Events

Total
  • Push event: 9
  • Public event: 1
Last Year
  • Push event: 9
  • Public event: 1

Issues and Pull Requests

Last synced: 9 months ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Dependencies

src/cloud-functions/mail-notifier/requirements.txt pypi
  • beautifulsoup4 ==4.13.4
  • google-cloud-aiplatform ==1.87.0
  • google-cloud-bigquery ==3.31.0
  • google-cloud-secret-manager ==2.23.3
  • langchain-core ==0.3.54
  • langchain-google-vertexai ==2.0.20
  • langgraph ==0.3.31
  • lxml ==5.4.0
  • mailersend ==0.5.8
  • markdown ==3.8
  • matplotlib ==3.10.1
  • numpy ==2.2.5
  • pydantic ==2.11.3
  • requests ==2.32.3
src/cloud-functions/query-simulator/requirements.txt pypi
  • google-api-core ==2.24.2
  • google-cloud-bigquery ==3.31.0
  • google-cloud-storage ==3.1.0
src/cloud-functions/table-stats-builder/requirements.txt pypi
  • functions-framework ==3.5.0
  • google-cloud-aiplatform ==1.87.0
  • google-cloud-bigquery ==3.31.0
  • langchain-core ==0.3.54
  • langchain-google-vertexai ==2.0.20
  • langgraph ==0.3.31
  • pydantic ==2.11.3