https://github.com/australian-text-analytics-platform/docframe
https://github.com/australian-text-analytics-platform/docframe
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.0%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: Australian-Text-Analytics-Platform
- Language: Python
- Default Branch: main
- Size: 3.92 MB
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
DocFrame
A powerful text analysis library inspired by GeoPandas design philosophy, using Polars as the backend for efficient document processing and analysis.
🚀 Features
- DocDataFrame & DocLazyFrame: Document-aware DataFrames with automatic document column detection
- Polars Backend: Leverages Polars' performance advantages for large-scale text processing
- Text Namespace: Unified text processing API via Polars namespace registration (
df.text,series.text,pl.col().text) - Intelligent Auto-Detection: Automatically identifies document columns using longest average text length heuristic
- Rich Text Processing: Built-in tokenization, cleaning, n-grams, word/character/sentence counting
- Memory Efficient: Lazy evaluation and optimized memory usage through Polars
- Comprehensive I/O: Support for CSV, Parquet, JSON, Excel, and more with document column preservation
- Serialization: JSON-based serialization with complete metadata preservation
- Document Management: Easy document column switching, renaming, and manipulation
📦 Installation
bash
pip install docframe
🚀 Quick Start
Creating DocDataFrames
```python import docframe as dp
From dictionary (auto-detects document column)
df = dp.DocDataFrame({ 'title': ['Short title', 'Another title'], 'content': [ 'This is a much longer document with substantial content for analysis', 'Another detailed document with comprehensive text for processing' ], 'category': ['news', 'blog'] })
DocFrame automatically detects 'content' as the document column
print(f"Document column: {df.activedocumentname}") # content
From list of texts with metadata
df = dp.DocDataFrame.from_texts( texts=['Hello world!', 'Text analysis is fun.', 'Polars is fast.'], metadata={ 'author': ['Alice', 'Bob', 'Charlie'], 'category': ['greeting', 'opinion', 'fact'] } ) ```
Text Processing
```python
Access document text directly
documents = df.document # Returns polars Series
Add text statistics
dfstats = (df .addwordcount() .addcharcount() .addsentence_count() )
Text cleaning and processing
dfprocessed = df.cleandocuments( lowercase=True, removepunct=True, removeextra_whitespace=True )
Filter by text properties
longdocs = df.filterbylength(minwords=10) patterndocs = df.filterby_pattern(r'\b(analysis|processing)\b')
Get text statistics summary
stats = df.describe_text() print(stats) ```
Text Namespace Usage
```python import polars as pl import docframe # Registers text namespace
Use text namespace on expressions
dfwithtokens = df.select([ pl.col('*'), pl.col('document').text.tokenize().alias('tokens'), pl.col('document').text.wordcount().alias('wordcount'), pl.col('document').text.charcount().alias('charcount'), pl.col('document').text.clean().alias('cleaned_text') ])
Advanced text processing
dfadvanced = df.select([ pl.col('*'), pl.col('document').text.ngrams(n=2).alias('bigrams'), pl.col('document').text.sentencecount().alias('sentences') ]) ```
Document-Term Matrix
```python
Create document-term matrix for text analysis
dtm = df.to_dtm(method='count') print(dtm.head())
Binary DTM
dtmbinary = df.todtm(method='binary')
TF-IDF (requires additional dependencies)
dtmtfidf = df.todtm(method='tfidf') ```
I/O Operations
```python
Read files with automatic document column detection
df = dp.readcsv('documents.csv') # Auto-detects document column df = dp.readparquet('data.parquet', documentcolumn='text') df = dp.readjson('data.json', document_column='content')
Write preserving DocDataFrame structure
df.writecsv('output.csv') df.writeparquet('output.parquet')
Lazy operations for large datasets
lazydf = dp.scancsv('largefile.csv') processed = (lazydf .filter(pl.col('category') == 'news') .select([ pl.col('*'), pl.col('document').text.word_count().alias('words') ]) .collect() # Returns DocDataFrame ) ```
Data Conversion
```python
Convert from pandas
import pandas as pd pdf = pd.DataFrame({'text': ['hello', 'world'], 'label': ['A', 'B']}) df = dp.frompandas(pdf, documentcolumn='text')
Convert to regular polars DataFrame
polarsdf = df.topolars()
Convert to lazy frame
lazydf = df.todoclazyframe() ```
Document Column Management
```python
Switch document column
dfswitched = df.setdocument('title') # Use 'title' as document column
Rename document column
dfrenamed = df.renamedocument('text') # Rename 'document' to 'text'
Join with document preservation
otherdf = pl.DataFrame({'id': [1, 2], 'extra': ['A', 'B']}) joined = df.join(otherdf, on='id') # Preserves DocDataFrame type ```
Serialization
```python
Serialize with complete metadata preservation
json_str = df.serialize('json')
Restore exact DocDataFrame
dfrestored = dp.DocDataFrame.deserialize(jsonstr, format='json') assert dfrestored.activedocumentname == df.activedocument_name ```
🎯 Advanced Examples
Large-Scale Text Analysis
```python
Process large document collections efficiently
largedf = (dp.scancsv('largecorpus.csv') .filter(pl.col('language') == 'en') .withcolumns([ pl.col('document').text.wordcount().alias('wordcount'), pl.col('document').text.charcount().alias('charcount') ]) .filter(pl.col('word_count') > 50) .collect() )
Text analysis pipeline
analysisresults = (largedf .addsentencecount() .filterbylength(minwords=100, maxwords=1000) .sample(n=1000) .describe_text() ) ```
Multi-Document Processing
```python
Concatenate multiple document collections
newsdocs = dp.readcsv('news.csv') blogdocs = dp.readcsv('blogs.csv') academicdocs = dp.readcsv('papers.csv')
alldocs = dp.concatdocuments([newsdocs, blogdocs, academic_docs])
Process by category
results = {} for category in alldocs['category'].unique(): categorydocs = alldocs.filter(pl.col('category') == category) results[category] = { 'count': len(categorydocs), 'avglength': categorydocs.describetext()['wordcountmean'][0], 'vocabulary': categorydocs.to_dtm().shape[1] } ```
Custom Text Processing
```python
Combine DocFrame with custom processing
def analyze_sentiment(text: str) -> float: # Your sentiment analysis logic return 0.5 # placeholder
Apply custom functions
dfsentiment = df.withcolumns([ pl.col('document').mapelements(analyzesentiment, return_dtype=pl.Float64).alias('sentiment') ])
Complex text filtering
complexfilter = (df .filter( (pl.col('document').text.wordcount() > 20) & (pl.col('document').text.sentencecount() > 2) & (pl.col('category').isin(['news', 'academic'])) ) ) ```
🏗️ Architecture
DocFrame follows GeoPandas' design philosophy adapted for text data:
- Document Column: Like GeoPandas' geometry column, DocFrame centers around a designated document column
- Transparent Operations: All Polars operations work seamlessly while preserving document metadata
- Namespace Integration: Text processing capabilities integrate directly into Polars' expression system
- Lazy Evaluation: Full support for Polars' lazy evaluation for memory-efficient processing
📚 API Reference
Core Classes
- DocDataFrame: Document-aware DataFrame for eager evaluation
- DocLazyFrame: Document-aware LazyFrame for lazy evaluation
I/O Functions
read_csv(),read_parquet(),read_json(),read_excel()- Read various formatsscan_csv(),scan_parquet()- Lazy reading operationsfrom_pandas(),from_arrow()- Convert from other formats
Utility Functions
concat_documents()- Concatenate DocDataFramesinfo()- Library information
Text Namespace Methods
Available on pl.col().text, series.text, and df.text:
tokenize()- Tokenize textclean()- Clean text with various optionsword_count(),char_count(),sentence_count()- Count statisticsngrams()- Extract n-gramscontains_pattern()- Pattern matching
🚧 Performance
DocFrame leverages Polars' performance advantages:
- Memory Efficiency: Lazy evaluation and zero-copy operations
- Parallel Processing: Automatic parallelization of text operations
- Columnar Storage: Efficient memory layout for text data
- Query Optimization: Polars' query optimizer works with text operations
🤝 Contributing
We welcome contributions! Please see our Contributing Guidelines for details.
Development Setup
bash
git clone https://github.com/your-org/docframe.git
cd docframe
pip install -e ".[dev]"
pytest
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
🙏 Acknowledgments
- Polars: For the excellent backend DataFrame library
- GeoPandas: For the design philosophy inspiration
- NLTK/spaCy: For text processing concepts
📞 Support
- Documentation: Full documentation
- Issues: GitHub Issues
- Discussions: GitHub Discussions
DocFrame - Making text analysis as intuitive as data analysis. 🚀
Owner
- Name: Australian-Text-Analytics-Platform
- Login: Australian-Text-Analytics-Platform
- Kind: organization
- Website: https://atap.edu.au
- Repositories: 9
- Profile: https://github.com/Australian-Text-Analytics-Platform
GitHub Events
Total
- Push event: 5
- Create event: 2
Last Year
- Push event: 5
- Create event: 2
Dependencies
- nltk >=3.8
- polars >=1.0.0
- spacy >=3.7.0
- tqdm >=4.66.0