Recent Releases of tarsier

tarsier - v.0.6.0 - Microsoft OCR Support

Highlights 🔥

  • Added support for azure ocr service, previously the only provider was AWS
  • Improved positioning of text chunks and fonts

What's Changed 👀

  • 🍌 Check rectangle by @asim-shrestha in https://github.com/reworkd/tarsier/pull/19
  • 🍌 Absolutely position some tags by @asim-shrestha in https://github.com/reworkd/tarsier/pull/24
  • 📸 Snapshot by @asim-shrestha in https://github.com/reworkd/tarsier/pull/40
  • 🍌 Consistent font sizes by @asim-shrestha in https://github.com/reworkd/tarsier/pull/42
  • 🍌 Spaces instead of tabs by @asim-shrestha in https://github.com/reworkd/tarsier/pull/43
  • ✨ Fix absolute positioning to be left of element instead of on top by @asim-shrestha in https://github.com/reworkd/tarsier/pull/45
  • ✨ Group text chunks to fix paragraphs/sentence spacing by @asim-shrestha in https://github.com/reworkd/tarsier/pull/46
  • 🚫 ignore all descendants of interactable elements by @asim-shrestha in https://github.com/reworkd/tarsier/pull/50
  • 🔎 Add support for MS Azure Vision OCR by @ml5ah in https://github.com/reworkd/tarsier/pull/85

New Contributors ❤️

  • @hargup made their first contribution in https://github.com/reworkd/tarsier/pull/33
  • @ml5ah made their first contribution in https://github.com/reworkd/tarsier/pull/85

Full Changelog: https://github.com/reworkd/tarsier/compare/v0.5.0...v0.6.0

- Jupyter Notebook
Published by awtkns over 1 year ago

tarsier - v0.5.0 - Multiple Tag Types

What's Changed

  • Tag interfering with Xpath fix by @KhoomeiK in https://github.com/reworkd/tarsier/pull/14
  • Bump mypy from 1.7.0 to 1.7.1 by @dependabot in https://github.com/reworkd/tarsier/pull/13
  • fixed leaf text tagging by @KhoomeiK in https://github.com/reworkd/tarsier/pull/16
  • Tagging improvements by @KhoomeiK in https://github.com/reworkd/tarsier/pull/18

New Contributors

  • @KhoomeiK made their first contribution in https://github.com/reworkd/tarsier/pull/14
  • @dependabot made their first contribution in https://github.com/reworkd/tarsier/pull/13

Full Changelog: https://github.com/reworkd/tarsier/compare/v0.4.0...v0.5.0

- Jupyter Notebook
Published by awtkns about 2 years ago

tarsier - v0.4.0 - Improved Tagging

🎉 What's Changed

  • ✍️ Fix readme citation link by @Krupskis in https://github.com/reworkd/tarsier/pull/3
  • ✍️Fix Citation Repository URL in Readme by @debanjum in https://github.com/reworkd/tarsier/pull/4
  • 🚀 Remove Annotations and Tag All text elements (optionally) by @awtkns in https://github.com/reworkd/tarsier/pull/8
  • 🆑 Make spans have red background with white text by @awtkns in https://github.com/reworkd/tarsier/pull/9

👀 New Contributors

  • @Krupskis made their first contribution in https://github.com/reworkd/tarsier/pull/3
  • @debanjum made their first contribution in https://github.com/reworkd/tarsier/pull/4
  • @awtkns made their first contribution in https://github.com/reworkd/tarsier/pull/8

Full Changelog: https://github.com/reworkd/tarsier/compare/v0.3.1...v0.4.0

- Jupyter Notebook
Published by awtkns over 2 years ago

tarsier - v0.3.1 - Initial Release

Tarsier Monkey

🙈 Vision utilities for web interaction agents 🙈

Python

🔗 Main site   •   🐦 Twitter   •   📢 Discord

Announcing Tarsier

If you've tried using GPT-4(V) to automate web interactions, you've probably run into questions like: - How do you map LLM responses back into web elements? - How can you mark up a page for an LLM better understand its action space? - How do you feed a "screenshot" to a text-only LLM?

At Reworkd, we found ourselves reusing the same utility libraries to solve these problems across multiple projects. Because of this we're now open-sourcing this simple utility library for multimodal web agents... Tarsier! The video below demonstrates Tarsier usage by feeding a page snapshot into a langchain agent and letting it take actions.

https://github.com/reworkd/tarsier/assets/50181239/af12beda-89b5-4add-b888-d780b353304b

How does it work?

Tarsier works by visually "tagging" interactable elements on a page via brackets + an id such as [1]. In doing this, we provide a mapping between elements and ids for GPT-4(V) to take actions upon. We define interactable elements as buttons, links, or input fields that are visible on the page.

Can provide a textual representation of the page. This means that Tarsier enables deeper interaction for even non multi-modal LLMs. This is important to note given performance issues with existing vision language models. Tarsier also provides OCR utils to convert a page screenshot into a whitespace-structured string that an LLM without vision can understand.

Usage

Visit our cookbook for agent examples using Tarsier: - An autonomous LangChain web agent 🦜⛓️ - An autonomous LlamaIndex web agent 🦙

Otherwise, basic Tarsier usage might look like the following: ```python import asyncio

from playwright.asyncapi import asyncplaywright from tarsier import Tarsier, GoogleVisionOCRService

async def main(): googlecloudcredentials = {}

ocr_service = GoogleVisionOCRService(google_cloud_credentials)
tarsier = Tarsier(ocr_service)

async with async_playwright() as p:
    browser = await p.chromium.launch(headless=False)
    page = await browser.new_page()
    await page.goto("https://news.ycombinator.com")

    page_text, tag_to_xpath = await tarsier.page_to_text(page)

    print(tag_to_xpath)  # Mapping of tags to x_paths
    print(page_text)  # My Text representation of the page

if name == 'main': asyncio.run(main()) ```

Supported OCR Services

Special shoutout to @KhoomeiK for making this happen! ❤️

- Jupyter Notebook
Published by awtkns over 2 years ago