https://github.com/awslabs/observation-extractor

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (15.0%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: awslabs
License: apache-2.0
Language: Python
Default Branch: main
Size: 131 KB

Statistics

Stars: 4
Watchers: 2
Forks: 1
Open Issues: 1
Releases: 0

Created over 1 year ago · Last pushed over 1 year ago

Metadata Files

Readme License Threat model

Observation Extractor

tl; dr - Observation Extractor is a tool for collecting observations from data.

Observations are useful bits of data related to questions that you define that is extracted from the data you pass in. Use Observation Extractor to process pdf (and maybe someday other files) into formats like csv (and later parqet) to turn unstructured documents into structured observations that you can query and use directly or through your application. When you output to a format like csv or parquet, observations are the row level records.

What does it do?

Observation Extractor takes an unstructured data file as input (like a pdf) and outputs a list of Observation objects. Each observation includes standard fields that are extracted from the document together with metadata like the document name and page number.

So what?

You can populate observations into a datastore and make them available to your human and AI users. They can be queried based on metadata like date and the specific questions they relate too. You can define question sets that represent thought process of a subject-matter-expert coming up to speed on this case to start mapping a document into useful observations.

Diagram shows a document split into chunks the processed into observations and put in a datastore.

Where does it fit?

You can use Observation Extractor as a local script for ad-hoc extractions or as a component in data ingestion pipelines. The CLI provides simple interface that you can use in evaluation, tuning, and scaled ingestion.

What does this look like on AWS?

Here is one way. You could use almost any compute environment. Including SageMaker, EC2, Lambda (with potential max run time limitation).

Scaled Ingestion Example

How do I use it?

Setup

bash virtualenv .venv source ./.venv/bin/activate python -m pip install -r requirements.txt python -m pip install -e . # local install for dev

Usage

Observation Extractor uses AWS credentials from the runtime environment. AWS IAM lets you securely manage identities and access to AWS services and resources. Credentials can be managed using the AWS CLI for local operations or by adding IAM permissions to the compute runtime (ie: AWS Lambda, ECS, EC2, or SageMaker). A best practice is to use separate roles for local development and automated pipelines to mitigate risk of data corruption.

Use --help to view available options

```bash (.venv) localhost % observer -h Initializing main class Parsing arguments usage: Observer [-h] [-v] [-f FILE] [-i case-id] [-d DYNAMODB_TABLE] [-t TYPE] [-o OUT] [-j out-type] [-q QUESTIONS] [-c COUNT]

A tool for collecting observations from data

options: -h, --help show this help message and exit -v, --verbose Enable verbose outputs -f, --file FILE input file path -i, --case-id case-id a case id to associate with observations from this document -d, --dynamodb-table DYNAMODB_TABLE name of an Amazon DynamoDB table to write observations to -t, --type TYPE type of input [pdf] # todo: more -o, --out OUT output file path or table name -j, --out-type out-type output file format [csv, ddb] # todo: more -q, --questions QUESTIONS path to a text file with questions for your data -c, --count COUNT maximum questions to include in a prompt

Use --help to see more options

```

By example

bash (.venv) localhost % observer -v \ -f sample-record-2.pdf \ -t pdf \ -c 1 \ -o sample-record-2-auto-filtered-out.csv \ -q observer/examples/auto-accident.txt

Owner

Name: Amazon Web Services - Labs
Login: awslabs
Kind: organization
Location: Seattle, WA

Website: http://amazon.com/aws/
Repositories: 914
Profile: https://github.com/awslabs

AWS Labs

GitHub Events

Total

Watch event: 2
Public event: 1
Push event: 3
Pull request event: 3
Fork event: 1

Last Year

Watch event: 2
Public event: 1
Push event: 3
Pull request event: 3
Fork event: 1

Dependencies

pyproject.toml pypi

requirements.txt pypi

boto3 *
pycryptodome ==3.15.0
pydantic *
pypdf *
shortuuid *

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science