https://github.com/australian-text-analytics-platform/docframe

https://github.com/australian-text-analytics-platform/docframe

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.0%) to scientific vocabulary
Last synced: 9 months ago · JSON representation

Repository

Basic Info
  • Host: GitHub
  • Owner: Australian-Text-Analytics-Platform
  • Language: Python
  • Default Branch: main
  • Size: 3.92 MB
Statistics
  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created 10 months ago · Last pushed 10 months ago
Metadata Files
Readme

README.md

DocFrame

A powerful text analysis library inspired by GeoPandas design philosophy, using Polars as the backend for efficient document processing and analysis.

🚀 Features

  • DocDataFrame & DocLazyFrame: Document-aware DataFrames with automatic document column detection
  • Polars Backend: Leverages Polars' performance advantages for large-scale text processing
  • Text Namespace: Unified text processing API via Polars namespace registration (df.text, series.text, pl.col().text)
  • Intelligent Auto-Detection: Automatically identifies document columns using longest average text length heuristic
  • Rich Text Processing: Built-in tokenization, cleaning, n-grams, word/character/sentence counting
  • Memory Efficient: Lazy evaluation and optimized memory usage through Polars
  • Comprehensive I/O: Support for CSV, Parquet, JSON, Excel, and more with document column preservation
  • Serialization: JSON-based serialization with complete metadata preservation
  • Document Management: Easy document column switching, renaming, and manipulation

📦 Installation

bash pip install docframe

🚀 Quick Start

Creating DocDataFrames

```python import docframe as dp

From dictionary (auto-detects document column)

df = dp.DocDataFrame({ 'title': ['Short title', 'Another title'], 'content': [ 'This is a much longer document with substantial content for analysis', 'Another detailed document with comprehensive text for processing' ], 'category': ['news', 'blog'] })

DocFrame automatically detects 'content' as the document column

print(f"Document column: {df.activedocumentname}") # content

From list of texts with metadata

df = dp.DocDataFrame.from_texts( texts=['Hello world!', 'Text analysis is fun.', 'Polars is fast.'], metadata={ 'author': ['Alice', 'Bob', 'Charlie'], 'category': ['greeting', 'opinion', 'fact'] } ) ```

Text Processing

```python

Access document text directly

documents = df.document # Returns polars Series

Add text statistics

dfstats = (df .addwordcount() .addcharcount() .addsentence_count() )

Text cleaning and processing

dfprocessed = df.cleandocuments( lowercase=True, removepunct=True, removeextra_whitespace=True )

Filter by text properties

longdocs = df.filterbylength(minwords=10) patterndocs = df.filterby_pattern(r'\b(analysis|processing)\b')

Get text statistics summary

stats = df.describe_text() print(stats) ```

Text Namespace Usage

```python import polars as pl import docframe # Registers text namespace

Use text namespace on expressions

dfwithtokens = df.select([ pl.col('*'), pl.col('document').text.tokenize().alias('tokens'), pl.col('document').text.wordcount().alias('wordcount'), pl.col('document').text.charcount().alias('charcount'), pl.col('document').text.clean().alias('cleaned_text') ])

Advanced text processing

dfadvanced = df.select([ pl.col('*'), pl.col('document').text.ngrams(n=2).alias('bigrams'), pl.col('document').text.sentencecount().alias('sentences') ]) ```

Document-Term Matrix

```python

Create document-term matrix for text analysis

dtm = df.to_dtm(method='count') print(dtm.head())

Binary DTM

dtmbinary = df.todtm(method='binary')

TF-IDF (requires additional dependencies)

dtmtfidf = df.todtm(method='tfidf') ```

I/O Operations

```python

Read files with automatic document column detection

df = dp.readcsv('documents.csv') # Auto-detects document column df = dp.readparquet('data.parquet', documentcolumn='text') df = dp.readjson('data.json', document_column='content')

Write preserving DocDataFrame structure

df.writecsv('output.csv') df.writeparquet('output.parquet')

Lazy operations for large datasets

lazydf = dp.scancsv('largefile.csv') processed = (lazydf .filter(pl.col('category') == 'news') .select([ pl.col('*'), pl.col('document').text.word_count().alias('words') ]) .collect() # Returns DocDataFrame ) ```

Data Conversion

```python

Convert from pandas

import pandas as pd pdf = pd.DataFrame({'text': ['hello', 'world'], 'label': ['A', 'B']}) df = dp.frompandas(pdf, documentcolumn='text')

Convert to regular polars DataFrame

polarsdf = df.topolars()

Convert to lazy frame

lazydf = df.todoclazyframe() ```

Document Column Management

```python

Switch document column

dfswitched = df.setdocument('title') # Use 'title' as document column

Rename document column

dfrenamed = df.renamedocument('text') # Rename 'document' to 'text'

Join with document preservation

otherdf = pl.DataFrame({'id': [1, 2], 'extra': ['A', 'B']}) joined = df.join(otherdf, on='id') # Preserves DocDataFrame type ```

Serialization

```python

Serialize with complete metadata preservation

json_str = df.serialize('json')

Restore exact DocDataFrame

dfrestored = dp.DocDataFrame.deserialize(jsonstr, format='json') assert dfrestored.activedocumentname == df.activedocument_name ```

🎯 Advanced Examples

Large-Scale Text Analysis

```python

Process large document collections efficiently

largedf = (dp.scancsv('largecorpus.csv') .filter(pl.col('language') == 'en') .withcolumns([ pl.col('document').text.wordcount().alias('wordcount'), pl.col('document').text.charcount().alias('charcount') ]) .filter(pl.col('word_count') > 50) .collect() )

Text analysis pipeline

analysisresults = (largedf .addsentencecount() .filterbylength(minwords=100, maxwords=1000) .sample(n=1000) .describe_text() ) ```

Multi-Document Processing

```python

Concatenate multiple document collections

newsdocs = dp.readcsv('news.csv') blogdocs = dp.readcsv('blogs.csv') academicdocs = dp.readcsv('papers.csv')

alldocs = dp.concatdocuments([newsdocs, blogdocs, academic_docs])

Process by category

results = {} for category in alldocs['category'].unique(): categorydocs = alldocs.filter(pl.col('category') == category) results[category] = { 'count': len(categorydocs), 'avglength': categorydocs.describetext()['wordcountmean'][0], 'vocabulary': categorydocs.to_dtm().shape[1] } ```

Custom Text Processing

```python

Combine DocFrame with custom processing

def analyze_sentiment(text: str) -> float: # Your sentiment analysis logic return 0.5 # placeholder

Apply custom functions

dfsentiment = df.withcolumns([ pl.col('document').mapelements(analyzesentiment, return_dtype=pl.Float64).alias('sentiment') ])

Complex text filtering

complexfilter = (df .filter( (pl.col('document').text.wordcount() > 20) & (pl.col('document').text.sentencecount() > 2) & (pl.col('category').isin(['news', 'academic'])) ) ) ```

🏗️ Architecture

DocFrame follows GeoPandas' design philosophy adapted for text data:

  • Document Column: Like GeoPandas' geometry column, DocFrame centers around a designated document column
  • Transparent Operations: All Polars operations work seamlessly while preserving document metadata
  • Namespace Integration: Text processing capabilities integrate directly into Polars' expression system
  • Lazy Evaluation: Full support for Polars' lazy evaluation for memory-efficient processing

📚 API Reference

Core Classes

  • DocDataFrame: Document-aware DataFrame for eager evaluation
  • DocLazyFrame: Document-aware LazyFrame for lazy evaluation

I/O Functions

  • read_csv(), read_parquet(), read_json(), read_excel() - Read various formats
  • scan_csv(), scan_parquet() - Lazy reading operations
  • from_pandas(), from_arrow() - Convert from other formats

Utility Functions

  • concat_documents() - Concatenate DocDataFrames
  • info() - Library information

Text Namespace Methods

Available on pl.col().text, series.text, and df.text:

  • tokenize() - Tokenize text
  • clean() - Clean text with various options
  • word_count(), char_count(), sentence_count() - Count statistics
  • ngrams() - Extract n-grams
  • contains_pattern() - Pattern matching

🚧 Performance

DocFrame leverages Polars' performance advantages:

  • Memory Efficiency: Lazy evaluation and zero-copy operations
  • Parallel Processing: Automatic parallelization of text operations
  • Columnar Storage: Efficient memory layout for text data
  • Query Optimization: Polars' query optimizer works with text operations

🤝 Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

Development Setup

bash git clone https://github.com/your-org/docframe.git cd docframe pip install -e ".[dev]" pytest

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

  • Polars: For the excellent backend DataFrame library
  • GeoPandas: For the design philosophy inspiration
  • NLTK/spaCy: For text processing concepts

📞 Support


DocFrame - Making text analysis as intuitive as data analysis. 🚀

Owner

  • Name: Australian-Text-Analytics-Platform
  • Login: Australian-Text-Analytics-Platform
  • Kind: organization

GitHub Events

Total
  • Push event: 5
  • Create event: 2
Last Year
  • Push event: 5
  • Create event: 2

Dependencies

pyproject.toml pypi
  • nltk >=3.8
  • polars >=1.0.0
  • spacy >=3.7.0
  • tqdm >=4.66.0