nv-ingest

NeMo Retriever extraction is a scalable, performance-oriented document content and metadata extraction microservice. NeMo Retriever extraction uses specialized NVIDIA NIM microservices to find, contextualize, and extract text, tables, charts and images that you can use in downstream generative applications.

https://github.com/nvidia/nv-ingest

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.2%) to scientific vocabulary

Keywords from Contributors

agents optimism multi-agents interactive embedded diffusion sequencers hacking projects multi-modality
Last synced: 7 months ago · JSON representation ·

Repository

NeMo Retriever extraction is a scalable, performance-oriented document content and metadata extraction microservice. NeMo Retriever extraction uses specialized NVIDIA NIM microservices to find, contextualize, and extract text, tables, charts and images that you can use in downstream generative applications.

Basic Info
Statistics
  • Stars: 2,735
  • Watchers: 30
  • Forks: 261
  • Open Issues: 122
  • Releases: 7
Created over 1 year ago · Last pushed 7 months ago
Metadata Files
Readme Changelog Contributing License Code of conduct Citation Codeowners Security

README.md

What is NeMo Retriever Extraction?

NeMo Retriever extraction is a scalable, performance-oriented document content and metadata extraction microservice. NeMo Retriever extraction uses specialized NVIDIA NIM microservices to find, contextualize, and extract text, tables, charts and images that you can use in downstream generative applications.

[!Note] NeMo Retriever extraction is also known as NVIDIA Ingest and nv-ingest.

NeMo Retriever extraction enables parallelization of splitting documents into pages where artifacts are classified (such as text, tables, charts, and images), extracted, and further contextualized through optical character recognition (OCR) into a well defined JSON schema. From there, NeMo Retriever extraction can optionally manage computation of embeddings for the extracted content, and optionally manage storing into a vector database Milvus.

[!Note] Cached and Deplot are deprecated. Instead, NeMo Retriever extraction now uses the yolox-graphic-elements NIM. With this change, you should now be able to run NeMo Retriever Extraction on a single 24GB A10G or better GPU. If you want to use the old pipeline, with Cached and Deplot, use the NeMo Retriever Extraction 24.12.1 release.

The following diagram shows the Nemo Retriever extraction pipeline.

Pipeline Overview

Table of Contents

  1. What NeMo Retriever Extraction Is
  2. Prerequisites
  3. Quickstart
  4. GitHub Repository Structure
  5. Notices

What NeMo Retriever Extraction Is

NeMo Retriever Extraction is a library and microservice service that does the following:

  • Accept a job specification that contains a document payload and a set of ingestion tasks to perform on that payload.
  • Store the result of each job to retrieve later. The result is a dictionary that contains a list of metadata that describes the objects extracted from the base document, and processing annotations and timing/trace data.
  • Support multiple methods of extraction for each document type to balance trade-offs between throughput and accuracy. For example, for .pdf documents, extraction is performed by using pdfium, nemoretriever-parse, Unstructured.io, and Adobe Content Extraction Services.
  • Support various types of before and after processing operations, including text splitting and chunking, transform and filtering, embedding generation, and image offloading to storage.

NeMo Retriever Extraction supports the following file types:

  • bmp
  • docx
  • html (converted to markdown format)
  • jpeg
  • json (treated as text)
  • md (treated as text)
  • pdf
  • png
  • pptx
  • sh (treated as text)
  • tiff
  • txt

What NeMo Retriever Extraction Isn't

NeMo Retriever extraction does not do the following:

  • Run a static pipeline or fixed set of operations on every submitted document.
  • Act as a wrapper for any specific document parsing library.

For more information, see the full NeMo Retriever Extraction documentation.

Prerequisites

For production-level performance and scalability, we recommend that you deploy the pipeline and supporting NIMs by using Docker Compose or Kubernetes (helm charts). For more information, refer to prerequisites.

Library Mode Quickstart

For small-scale workloads, such as workloads of fewer than 100 PDFs, you can use library mode setup. Library mode set up depends on NIMs that are already self-hosted, or, by default, NIMs that are hosted on build.nvidia.com.

Library mode deployment of nv-ingest requires:

  • Linux operating systems (Ubuntu 22.04 or later recommended)
  • Python 3.12
  • We strongly advise using an isolated Python virtual env, such as provided by uv or conda

Step 1: Prepare Your Environment

Create a fresh Conda environment to install nv-ingest and dependencies.

shell uv venv --python 3.12 nvingest && \ source nvingest/bin/activate && \ uv pip install nv-ingest==25.6.2 nv-ingest-api==25.6.2 nv-ingest-client==25.6.3

Set your NVIDIABUILDAPIKEY and NVIDIAAPI_KEY. If you don't have a key, you can get one on build.nvidia.com. For instructions, refer to Generate Your NGC Keys.

```

Note: these should be the same value

export NVIDIABUILDAPIKEY=nvapi-... export NVIDIAAPI_KEY=nvapi-... ```

Step 2: Ingest Documents

You can submit jobs programmatically in Python.

To confirm that you have activated your Conda environment, run which python and confirm that you see nvingest in the result. You can do this before any python command that you run.

which python /home/dev/projects/nv-ingest/nvingest/bin/python

If you have a very high number of CPUs, and see the process hang without progress, we recommend that you use taskset to limit the number of CPUs visible to the process. Use the following code.

taskset -c 0-3 python your_ingestion_script.py

On a 4 CPU core low end laptop, the following code should take about 10 seconds.

```python import logging, os, time, sys

from nvingest.framework.orchestration.ray.util.pipeline.pipelinerunners import runpipeline from nvingest.framework.orchestration.ray.util.pipeline.pipelinerunners import PipelineCreationSchema from nvingestapi.util.logging.configuration import configurelogging as configurelocallogging from nvingestclient.client import Ingestor, NvIngestClient from nvingestapi.util.messagebrokers.simplemessagebroker import SimpleClient from nvingestclient.util.processjsonfiles import ingestjsonresultsto_blob

Start the pipeline subprocess for library mode

config = PipelineCreationSchema()

runpipeline(config, block=False, disabledynamicscaling=True, runin_subprocess=True)

client = NvIngestClient( messageclientallocator=SimpleClient, messageclientport=7671, messageclienthostname="localhost" )

gpu_cagra accelerated indexing is not available in milvus-lite

Provide a filename for milvus_uri to use milvus-lite

milvusuri = "milvus.db" collectionname = "test" sparse = False

do content extraction from files

ingestor = ( Ingestor(client=client) .files("data/multimodaltest.pdf") .extract( extracttext=True, extracttables=True, extractcharts=True, extractimages=True, paddleoutputformat="markdown", extractinfographics=True, # extractmethod="nemoretrieverparse", #Slower, but maximally accurate, especially for PDFs with pages that are scanned images textdepth="page" ).embed() .vdbupload( collectionname=collectionname, milvusuri=milvusuri, sparse=sparse, # for llama-3.2 embedder, use 1024 for e5-v5 dense_dim=2048 ) )

print("Starting ingestion..") t0 = time.time()

Return both successes and failures

Use for large batches where you want successful chunks/pages to be committed, while collecting detailed diagnostics for failures.

results, failures = ingestor.ingest(showprogress=True, returnfailures=True)

Return only successes

results = ingestor.ingest(show_progress=True)

t1 = time.time() print(f"Total time: {t1 - t0} seconds")

results blob is directly inspectable

print(ingestjsonresultstoblob(results[0])) ```

You can see the extracted text that represents the content of the ingested test document.

```shell Starting ingestion.. Total time: 9.243880033493042 seconds

TestingDocument A sample document with headings and placeholder text Introduction This is a placeholder document that can be used for any purpose. It contains some headings and some placeholder text to fill the space. The text is not important and contains no real value, but it is useful for testing. Below, we will have some simple tables and charts that we can use to confirm Ingest is working as expected. Table 1 This table describes some animals, and some activities they might be doing in specific locations. Animal Activity Place Gira@e Driving a car At the beach Lion Putting on sunscreen At the park Cat Jumping onto a laptop In a home o@ice Dog Chasing a squirrel In the front yard Chart 1 This chart shows some gadgets, and some very fictitious costs. ... document extract continues ... ```

Step 3: Query Ingested Content

To query for relevant snippets of the ingested content, and use them with an LLM to generate answers, use the following code.

```python from openai import OpenAI from nvingestclient.util.milvus import nvingest_retrieval import os

milvusuri = "milvus.db" collectionname = "test" sparse=False

queries = ["Which animal is responsible for the typos?"]

retrieveddocs = nvingestretrieval( queries, collectionname, milvusuri=milvusuri, hybrid=sparse, topk=1, )

simple generation example

extract = retrieveddocs[0][0]["entity"]["text"] client = OpenAI( baseurl = "https://integrate.api.nvidia.com/v1", apikey = os.environ["NVIDIABUILDAPIKEY"] )

prompt = f"Using the following content: {extract}\n\n Answer the user query: {queries[0]}" print(f"Prompt: {prompt}") completion = client.chat.completions.create( model="nvidia/llama-3.1-nemotron-70b-instruct", messages=[{"role":"user","content": prompt}], ) response = completion.choices[0].message.content

print(f"Answer: {response}") ```

```shell Prompt: Using the following content: TestingDocument A sample document with headings and placeholder text Introduction This is a placeholder document that can be used for any purpose. It contains some headings and some placeholder text to fill the space. The text is not important and contains no real value, but it is useful for testing. Below, we will have some simple tables and charts that we can use to confirm Ingest is working as expected. Table 1 This table describes some animals, and some activities they might be doing in specific locations. Animal Activity Place Gira@e Driving a car At the beach Lion Putting on sunscreen At the park Cat Jumping onto a laptop In a home o@ice Dog Chasing a squirrel In the front yard Chart 1 This chart shows some gadgets, and some very fictitious costs.

Answer the user query: Which animal is responsible for the typos? Answer: A clever query!

After carefully examining the provided content, I'd like to point out the potential "typos" (assuming you're referring to the unusual or intentionally incorrect text) and attempt to playfully "assign blame" to an animal based on the context:

  1. Gira@e (instead of Giraffe) - Animal blamed: Giraffe (Table 1, first row)
    • The "@" symbol in "Gira@e" suggests a possible typo or placeholder character, which we'll humorously attribute to the Giraffe's alleged carelessness.
  2. o@ice (instead of Office) - Animal blamed: Cat
    • The same "@" symbol appears in "o@ice", which is related to the Cat's activity in the same table. Perhaps the Cat was in a hurry while typing and introduced the error?

So, according to this whimsical analysis, both the Giraffe and the Cat are "responsible" for the typos, with the Giraffe possibly being the more egregious offender given the more blatant character substitution in its name. ```

[!TIP] Beyond inspecting the results, you can read them into things like llama-index or langchain retrieval pipelines.

Please also checkout our demo using a retrieval pipeline on build.nvidia.com to query over document content pre-extracted w/ NVIDIA Ingest.

GitHub Repository Structure

The following is a description of the folders in the GitHub repository.

  • .devcontainer — VSCode containers for local development
  • .github — GitHub repo configuration files
  • api — Core API logic shared across python modules
  • ci — Scripts used to build the nv-ingest container and other packages
  • client — Readme, examples, and source code for the nv-ingest-cli utility
  • conda — Conda environment and packaging definitions
  • config — Various .yaml files defining configuration for OTEL, Prometheus
  • data — Sample PDFs for testing
  • deploy — Brev.dev-hosted launchable
  • docker — Scripts used by the nv-ingest docker container
  • docs — Documentation for NV Ingest
  • evaluation — Notebooks that demonstrate how to test recall accuracy
  • examples — Notebooks, scripts, and tutorial content
  • helm — Documentation for deploying nv-ingest to a Kubernetes cluster via Helm chart
  • skaffold — Skaffold configuration
  • src — Source code for the nv-ingest pipelines and service
  • tests — Unit tests for nv-ingest

Notices

Third Party License Notice:

If configured to do so, this project will download and install additional third-party open source software projects. Review the license terms of these open source projects before use:

https://pypi.org/project/pdfservices-sdk/

  • INSTALL_ADOBE_SDK:
    • Description: If set to true, the Adobe SDK will be installed in the container at launch time. This is required if you want to use the Adobe extraction service for PDF decomposition. Please review the license agreement for the pdfservices-sdk before enabling this option.
  • DOWNLOAD_LLAMA_TOKENIZER (Built With Llama)::
    • Description: The Split task uses the meta-llama/Llama-3.2-1B tokenizer, which will be downloaded from HuggingFace at build time if DOWNLOAD_LLAMA_TOKENIZER is set to True. Please review the license agreement for Llama 3.2 materials before using this. This is a gated model so you'll need to request access and set HF_ACCESS_TOKEN to your HuggingFace access token in order to use it.

Contributing

We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.

Any contribution which contains commits that are not signed off are not accepted.

To sign off on a commit, use the --signoff (or -s) option when you commit your changes as shown following.

$ git commit --signoff --message "Add cool feature."

This appends the following text to your commit message.

Signed-off-by: Your Name <your@email.com>

Developer Certificate of Origin (DCO)

The following is the full text of the Developer Certificate of Origin (DCO)

``` Developer Certificate of Origin Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors. 1 Letterman Drive Suite D4700 San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. ```

``` Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I have the right to submit it under the open source license indicated in the file; or

(b) The contribution is based upon previous work that, to the best of my knowledge, is covered under an appropriate open source license and I have the right under that license to submit that work with modifications, whether created in whole or in part by me, under the same open source license (unless I am permitted to submit under a different license), as indicated in the file; or

(c) The contribution was provided directly to me by some other person who certified (a), (b) or (c) and I have not modified it.

(d) I understand and agree that this project and the contribution are public and that a record of the contribution (including all personal information I submit with it, including my sign-off) is maintained indefinitely and may be redistributed consistent with this project or the open source license(s) involved. ```

Owner

  • Name: NVIDIA Corporation
  • Login: NVIDIA
  • Kind: organization
  • Location: 2788 San Tomas Expressway, Santa Clara, CA, 95051

Citation (CITATION.md)

# Citation Guide

## To Cite NVIDIA Ingest
If you use NVIDIA Ingest in a publication, please use citations in the following format (BibTeX entry for LaTeX):
```tex
@Manual{,
  title = {NVIDIA Ingest: An accelerated pipeline for document ingestion},
  author = {NVIDIA Ingest Development Team},
  year = {2024},
  url = {https://github.com/NVIDIA/nv-ingest},
}
```


## Sample Citations:

Using [RAPIDS](https://rapids.ai/) citations for reference.

### Bringing UMAP Closer to the Speed of Light <br> with GPU Acceleration
```tex
@misc{
      nolet2020bringing,
      title={Bringing UMAP Closer to the Speed of Light with GPU Acceleration},
      author={Corey J. Nolet, Victor Lafargue, Edward Raff, Thejaswi Nanditale, Tim Oates, John Zedlewski, and Joshua Patterson},
      year={2020},
      eprint={2008.00325},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}
```

### Machine Learning in Python: <br> Main developments and technology trends in data science, machine learning, and artificial intelligence
```tex
@article{
  raschka2020machine,
  title={Machine Learning in Python: Main developments and technology trends in data science, machine learning, and artificial intelligence},
  author={Raschka, Sebastian and Patterson, Joshua and Nolet, Corey},
  journal={Information},
  volume={11},
  number={4},
  pages={193},
  year={2020},
  publisher={Multidisciplinary Digital Publishing Institute}
}
```

Committers

Last synced: 11 months ago

All Time
  • Total Commits: 475
  • Total Committers: 25
  • Avg Commits per committer: 19.0
  • Development Distribution Score (DDS): 0.768
Past Year
  • Commits: 475
  • Committers: 25
  • Avg Commits per committer: 19.0
  • Development Distribution Score (DDS): 0.768
Top Committers
Name Email Commits
Jeremy Dyer j****4@g****m 110
Edward Kim 1****v 102
Devin Robison d****0 84
Julio Perez 3****9 43
nkmcalli n****r@n****m 41
ChrisJar c****0@g****m 21
Randy Gelhausen r****n@n****m 15
Chris Jarrett c****t@i****m 11
sosahi s****i@n****m 11
Sam Oluwalana s****a@n****m 5
mpenn m****n@n****m 4
dependabot[bot] 4****] 4
Zenobia "Z" Redeaux 1****7 4
Guilherme Pombo c****o@g****m 4
Ben Jarmak 1****v 4
David Gardner 9****v 2
Jeffrey Carpenter j****r@d****m 2
Filippo Broggini f****2 1
Ikko Eltociear Ashimine e****r@g****m 1
Lior 4****t 1
faywang123 5****3 1
lihoang6 l****g@n****m 1
mara004 g****l@g****m 1
reliseinv r****i@n****m 1
Deborah Shekinah Jacob d****3@g****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 7 months ago

All Time
  • Total issues: 163
  • Total pull requests: 1,360
  • Average time to close issues: 13 days
  • Average time to close pull requests: 4 days
  • Total issue authors: 33
  • Total pull request authors: 42
  • Average comments per issue: 0.23
  • Average comments per pull request: 0.64
  • Merged pull requests: 1,011
  • Bot issues: 0
  • Bot pull requests: 11
Past Year
  • Issues: 150
  • Pull requests: 1,325
  • Average time to close issues: 13 days
  • Average time to close pull requests: 4 days
  • Issue authors: 33
  • Pull request authors: 41
  • Average comments per issue: 0.22
  • Average comments per pull request: 0.65
  • Merged pull requests: 991
  • Bot issues: 0
  • Bot pull requests: 7
Top Authors
Issue Authors
  • sosahi (41)
  • drobison00 (35)
  • jdye64 (17)
  • randerzander (16)
  • edknv (7)
  • jperez999 (4)
  • ChrisJar (4)
  • jarmak-nv (4)
  • abeltre1 (4)
  • nkmcalli (3)
  • FengJi2021 (3)
  • dagardner-nv (2)
  • nikhildigde (2)
  • SURFLOU (2)
  • duongkstn (1)
Pull Request Authors
  • jdye64 (307)
  • edknv (283)
  • drobison00 (216)
  • nkmcalli (128)
  • jperez999 (119)
  • ChrisJar (85)
  • randerzander (44)
  • sosahi (35)
  • VibhuJawa (18)
  • jioffe502 (14)
  • zredeaux07 (12)
  • jarmak-nv (11)
  • dependabot[bot] (11)
  • guilherme-pombo (8)
  • soluwalana (7)
Top Labels
Issue Labels
feature request (60) doc (54) bug (39) epic (2) good first issue (1) 24.10 (1) 24.12 (1)
Pull Request Labels
doc (165) bug (63) feature request (57) dependencies (11) ok-to-test (9) help wanted (4) 25.07 (4) epic (2) 24.12 (2) VDR (2) 25.08 (25.07) (2) 25.09 (25.08 25.07) (1)

Packages

  • Total packages: 3
  • Total downloads:
    • pypi 31,950 last-month
  • Total dependent packages: 0
    (may contain duplicates)
  • Total dependent repositories: 0
    (may contain duplicates)
  • Total versions: 633
  • Total maintainers: 2
pypi.org: nv-ingest

Python module for multimodal document ingestion

  • Documentation: https://nv-ingest.readthedocs.io/
  • License: Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 1. Definitions. "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document. "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License. "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity. "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License. "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types. "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below). "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof. "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution." "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work. 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form. 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed. 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: (a) You must give any other recipients of the Work or Derivative Works a copy of this License; and (b) You must cause any modified files to carry prominent notices stating that You changed the files; and (c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and (d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License. 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions. 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file. 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License. 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages. 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability. END OF TERMS AND CONDITIONS APPENDIX: How to apply the Apache License to your work. To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets "[]" replaced with your own identifying information. (Don't include the brackets!) The text should be enclosed in the appropriate comment syntax for the file format. We also recommend that a file or class name and description of purpose be included on the same "printed page" as the copyright notice for easier identification within third-party archives. Copyright [yyyy] [name of copyright owner] Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
  • Latest release: 25.6.2
    published 10 months ago
  • Versions: 94
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 4,374 Last month
Rankings
Dependent packages count: 9.1%
Average: 30.2%
Dependent repos count: 51.3%
Maintainers (1)
Last synced: 7 months ago
pypi.org: nv-ingest-api

Python module with core document ingestion functions.

  • Documentation: https://nv-ingest-api.readthedocs.io/
  • License: Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 1. Definitions. "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document. "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License. "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity. "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License. "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types. "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below). "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof. "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution." "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work. 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form. 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed. 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: (a) You must give any other recipients of the Work or Derivative Works a copy of this License; and (b) You must cause any modified files to carry prominent notices stating that You changed the files; and (c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and (d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License. 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions. 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file. 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License. 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages. 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability. END OF TERMS AND CONDITIONS APPENDIX: How to apply the Apache License to your work. To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets "[]" replaced with your own identifying information. (Don't include the brackets!) The text should be enclosed in the appropriate comment syntax for the file format. We also recommend that a file or class name and description of purpose be included on the same "printed page" as the copyright notice for easier identification within third-party archives. Copyright [yyyy] [name of copyright owner] Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
  • Latest release: 25.6.2
    published 10 months ago
  • Versions: 274
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 13,009 Last month
Rankings
Dependent packages count: 9.6%
Average: 31.9%
Dependent repos count: 54.3%
Maintainers (1)
Last synced: 7 months ago
pypi.org: nv-ingest-client

Python client for the nv-ingest service

  • Documentation: https://nv-ingest-client.readthedocs.io/
  • License: Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 1. Definitions. "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document. "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License. "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity. "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License. "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types. "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below). "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof. "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution." "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work. 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form. 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed. 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: (a) You must give any other recipients of the Work or Derivative Works a copy of this License; and (b) You must cause any modified files to carry prominent notices stating that You changed the files; and (c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and (d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License. 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions. 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file. 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License. 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages. 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability. END OF TERMS AND CONDITIONS APPENDIX: How to apply the Apache License to your work. To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets "[]" replaced with your own identifying information. (Don't include the brackets!) The text should be enclosed in the appropriate comment syntax for the file format. We also recommend that a file or class name and description of purpose be included on the same "printed page" as the copyright notice for easier identification within third-party archives. Copyright [yyyy] [name of copyright owner] Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
  • Latest release: 25.6.3
    published 9 months ago
  • Versions: 265
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 14,567 Last month
Rankings
Dependent packages count: 9.7%
Average: 32.0%
Dependent repos count: 54.4%
Maintainers (2)
Last synced: 7 months ago

Dependencies

Dockerfile docker
  • $BASE_IMG $BASE_IMG_TAG build
  • base latest build
client/requirements.txt pypi
  • charset-normalizer *
  • pydantic *
  • python-magic *
  • redis *
  • setuptools *
client/setup.py pypi
pyproject.toml pypi
requirements.txt pypi
  • aiohttp ==3.9.4
  • charset-normalizer *
  • click *
  • dataclasses *
  • farm-haystack *
  • fastparquet ==2024.2.0
  • fsspec *
  • minio *
  • more_itertools *
  • nltk ==3.9.1
  • olefile ==0.47
  • onnx ==1.16.0
  • openai ==1.40.6
  • opencv-python ==4.10.0.84
  • opentelemetry-api *
  • opentelemetry-exporter-otlp *
  • opentelemetry-sdk *
  • pandas *
  • pydantic ==1.10.14
  • pyinstrument *
  • pypdfium2 *
  • python-docx *
  • python-pptx ==0.6.23
  • redis *
  • setuptools ==70.0.0
  • tabulate *
  • torchvision ==0.18.0
  • unstructured-client ==0.23.3
setup.py pypi
test-requirements.txt pypi
  • autoflake ==2.3.1 test
  • black ==23.11.0 test
  • flake8 ==7.0.0 test
  • isort ==5.13.2 test
  • pre-commit ==3.5.0 test
  • pytest ==7.4.3 test
  • pytest-cov ==4.1.0 test
  • pytest-mock * test
  • pytest-mock ==3.14.0 test
  • yapf ==0.40.2 test
util-requirements.txt pypi
  • click *
  • tqdm *
.github/workflows/docker-build.yml actions
  • actions/checkout v4 composite
  • actions/upload-artifact v4 composite
  • docker/setup-buildx-action v3 composite
.github/workflows/docker-nightly-publish.yml actions
  • actions/checkout v4 composite
  • docker/setup-buildx-action v3 composite
extra-requirements.txt pypi
  • nv_ingest_client *