https://github.com/australian-text-analytics-platform/docframe

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (9.0%) to scientific vocabulary

Last synced: 9 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: Australian-Text-Analytics-Platform
Language: Python
Default Branch: main
Size: 3.92 MB

Statistics

Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Releases: 0

Created 10 months ago · Last pushed 10 months ago

Metadata Files

Readme

DocFrame

A powerful text analysis library inspired by GeoPandas design philosophy, using Polars as the backend for efficient document processing and analysis.

🚀 Features

DocDataFrame & DocLazyFrame: Document-aware DataFrames with automatic document column detection
Polars Backend: Leverages Polars' performance advantages for large-scale text processing
Text Namespace: Unified text processing API via Polars namespace registration (df.text, series.text, pl.col().text)
Intelligent Auto-Detection: Automatically identifies document columns using longest average text length heuristic
Rich Text Processing: Built-in tokenization, cleaning, n-grams, word/character/sentence counting
Memory Efficient: Lazy evaluation and optimized memory usage through Polars
Comprehensive I/O: Support for CSV, Parquet, JSON, Excel, and more with document column preservation
Serialization: JSON-based serialization with complete metadata preservation
Document Management: Easy document column switching, renaming, and manipulation

📦 Installation

bash pip install docframe

🚀 Quick Start

Creating DocDataFrames

```python import docframe as dp

From dictionary (auto-detects document column)

df = dp.DocDataFrame({ 'title': ['Short title', 'Another title'], 'content': [ 'This is a much longer document with substantial content for analysis', 'Another detailed document with comprehensive text for processing' ], 'category': ['news', 'blog'] })

DocFrame automatically detects 'content' as the document column

print(f"Document column: {df.activedocumentname}") # content

From list of texts with metadata

df = dp.DocDataFrame.from_texts( texts=['Hello world!', 'Text analysis is fun.', 'Polars is fast.'], metadata={ 'author': ['Alice', 'Bob', 'Charlie'], 'category': ['greeting', 'opinion', 'fact'] } ) ```

Text Processing

```python

Access document text directly

documents = df.document # Returns polars Series

Add text statistics

dfstats = (df .addwordcount() .addcharcount() .addsentence_count() )

Text cleaning and processing

dfprocessed = df.cleandocuments( lowercase=True, removepunct=True, removeextra_whitespace=True )

Filter by text properties

longdocs = df.filterbylength(minwords=10) patterndocs = df.filterby_pattern(r'\b(analysis|processing)\b')

Get text statistics summary

stats = df.describe_text() print(stats) ```

Text Namespace Usage

```python import polars as pl import docframe # Registers text namespace

Use text namespace on expressions

dfwithtokens = df.select([ pl.col('*'), pl.col('document').text.tokenize().alias('tokens'), pl.col('document').text.wordcount().alias('wordcount'), pl.col('document').text.charcount().alias('charcount'), pl.col('document').text.clean().alias('cleaned_text') ])

Advanced text processing

dfadvanced = df.select([ pl.col('*'), pl.col('document').text.ngrams(n=2).alias('bigrams'), pl.col('document').text.sentencecount().alias('sentences') ]) ```

Document-Term Matrix

```python

Create document-term matrix for text analysis

dtm = df.to_dtm(method='count') print(dtm.head())

Binary DTM

dtmbinary = df.todtm(method='binary')

TF-IDF (requires additional dependencies)

dtmtfidf = df.todtm(method='tfidf') ```

I/O Operations

```python

Read files with automatic document column detection

df = dp.readcsv('documents.csv') # Auto-detects document column df = dp.readparquet('data.parquet', documentcolumn='text') df = dp.readjson('data.json', document_column='content')

Write preserving DocDataFrame structure

df.writecsv('output.csv') df.writeparquet('output.parquet')

Lazy operations for large datasets

lazydf = dp.scancsv('largefile.csv') processed = (lazydf .filter(pl.col('category') == 'news') .select([ pl.col('*'), pl.col('document').text.word_count().alias('words') ]) .collect() # Returns DocDataFrame ) ```

Data Conversion

```python

Convert from pandas

import pandas as pd pdf = pd.DataFrame({'text': ['hello', 'world'], 'label': ['A', 'B']}) df = dp.frompandas(pdf, documentcolumn='text')

Convert to regular polars DataFrame

polarsdf = df.topolars()

Convert to lazy frame

lazydf = df.todoclazyframe() ```

Document Column Management

```python

Switch document column

dfswitched = df.setdocument('title') # Use 'title' as document column

Rename document column

dfrenamed = df.renamedocument('text') # Rename 'document' to 'text'

Join with document preservation

otherdf = pl.DataFrame({'id': [1, 2], 'extra': ['A', 'B']}) joined = df.join(otherdf, on='id') # Preserves DocDataFrame type ```

Serialization

```python

Serialize with complete metadata preservation

json_str = df.serialize('json')

Restore exact DocDataFrame

dfrestored = dp.DocDataFrame.deserialize(jsonstr, format='json') assert dfrestored.activedocumentname == df.activedocument_name ```

🎯 Advanced Examples

Large-Scale Text Analysis

```python

Process large document collections efficiently

largedf = (dp.scancsv('largecorpus.csv') .filter(pl.col('language') == 'en') .withcolumns([ pl.col('document').text.wordcount().alias('wordcount'), pl.col('document').text.charcount().alias('charcount') ]) .filter(pl.col('word_count') > 50) .collect() )

Text analysis pipeline

analysisresults = (largedf .addsentencecount() .filterbylength(minwords=100, maxwords=1000) .sample(n=1000) .describe_text() ) ```

Multi-Document Processing

```python

Concatenate multiple document collections

newsdocs = dp.readcsv('news.csv') blogdocs = dp.readcsv('blogs.csv') academicdocs = dp.readcsv('papers.csv')

alldocs = dp.concatdocuments([newsdocs, blogdocs, academic_docs])

Process by category

results = {} for category in alldocs['category'].unique(): categorydocs = alldocs.filter(pl.col('category') == category) results[category] = { 'count': len(categorydocs), 'avglength': categorydocs.describetext()['wordcountmean'][0], 'vocabulary': categorydocs.to_dtm().shape[1] } ```

Custom Text Processing

```python

Combine DocFrame with custom processing

def analyze_sentiment(text: str) -> float: # Your sentiment analysis logic return 0.5 # placeholder

Apply custom functions

dfsentiment = df.withcolumns([ pl.col('document').mapelements(analyzesentiment, return_dtype=pl.Float64).alias('sentiment') ])

Complex text filtering

complexfilter = (df .filter( (pl.col('document').text.wordcount() > 20) & (pl.col('document').text.sentencecount() > 2) & (pl.col('category').isin(['news', 'academic'])) ) ) ```

🏗️ Architecture

DocFrame follows GeoPandas' design philosophy adapted for text data:

Document Column: Like GeoPandas' geometry column, DocFrame centers around a designated document column
Transparent Operations: All Polars operations work seamlessly while preserving document metadata
Namespace Integration: Text processing capabilities integrate directly into Polars' expression system
Lazy Evaluation: Full support for Polars' lazy evaluation for memory-efficient processing

📚 API Reference

Core Classes

DocDataFrame: Document-aware DataFrame for eager evaluation
DocLazyFrame: Document-aware LazyFrame for lazy evaluation

I/O Functions

read_csv(), read_parquet(), read_json(), read_excel() - Read various formats
scan_csv(), scan_parquet() - Lazy reading operations
from_pandas(), from_arrow() - Convert from other formats

Utility Functions

concat_documents() - Concatenate DocDataFrames
info() - Library information

Text Namespace Methods

Available on pl.col().text, series.text, and df.text:

tokenize() - Tokenize text
clean() - Clean text with various options
word_count(), char_count(), sentence_count() - Count statistics
ngrams() - Extract n-grams
contains_pattern() - Pattern matching

🚧 Performance

DocFrame leverages Polars' performance advantages:

Memory Efficiency: Lazy evaluation and zero-copy operations
Parallel Processing: Automatic parallelization of text operations
Columnar Storage: Efficient memory layout for text data
Query Optimization: Polars' query optimizer works with text operations

🤝 Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

Development Setup

bash git clone https://github.com/your-org/docframe.git cd docframe pip install -e ".[dev]" pytest

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Polars: For the excellent backend DataFrame library
GeoPandas: For the design philosophy inspiration
NLTK/spaCy: For text processing concepts

📞 Support

Documentation: Full documentation
Issues: GitHub Issues
Discussions: GitHub Discussions

DocFrame - Making text analysis as intuitive as data analysis. 🚀

Owner

Name: Australian-Text-Analytics-Platform
Login: Australian-Text-Analytics-Platform
Kind: organization

Website: https://atap.edu.au
Repositories: 9
Profile: https://github.com/Australian-Text-Analytics-Platform

GitHub Events

Total

Push event: 5
Create event: 2

Last Year

Push event: 5
Create event: 2

Dependencies

pyproject.toml pypi

nltk >=3.8
polars >=1.0.0
spacy >=3.7.0
tqdm >=4.66.0

https://github.com/australian-text-analytics-platform/docframe

Science Score: 26.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

DocFrame

🚀 Features

📦 Installation

🚀 Quick Start

Creating DocDataFrames

From dictionary (auto-detects document column)

DocFrame automatically detects 'content' as the document column

From list of texts with metadata

Text Processing

Access document text directly

Add text statistics

Text cleaning and processing

Filter by text properties

Get text statistics summary

Text Namespace Usage

Use text namespace on expressions

Advanced text processing

Document-Term Matrix

Create document-term matrix for text analysis

Binary DTM

TF-IDF (requires additional dependencies)

I/O Operations

Read files with automatic document column detection

Write preserving DocDataFrame structure

Lazy operations for large datasets

Data Conversion

Convert from pandas

Convert to regular polars DataFrame

Convert to lazy frame

Document Column Management

Switch document column

Rename document column

Join with document preservation

Serialization

Serialize with complete metadata preservation

Restore exact DocDataFrame

🎯 Advanced Examples

Large-Scale Text Analysis

Process large document collections efficiently

Text analysis pipeline

Multi-Document Processing

Concatenate multiple document collections

Process by category

Custom Text Processing

Combine DocFrame with custom processing

Apply custom functions

Complex text filtering

🏗️ Architecture

📚 API Reference

Core Classes

I/O Functions

Utility Functions

Text Namespace Methods

🚧 Performance

🤝 Contributing

Development Setup

📄 License

🙏 Acknowledgments

📞 Support

Owner

GitHub Events

Total

Last Year

Dependencies