Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.4%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Basic Info
  • Host: GitHub
  • Owner: idobenshaul10
  • License: apache-2.0
  • Language: Jupyter Notebook
  • Default Branch: main
  • Size: 84.1 MB
Statistics
  • Stars: 7
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created about 2 years ago · Last pushed about 2 years ago
Metadata Files
Readme License Citation

README.md

SoftPandas - Pandas with Semantic Querying

SoftPandas

X (formerly Twitter) Follow

https://github.com/idobenshaul10/SoftPandas/assets/41121256/82c467b7-701a-4cfd-9277-df7b63a66330

Description:

SoftPandas is an initial package that allows you to work with pandas DataFrames and query them using semantic similarity. This allows you to have conditions which are soft (e.g. all products that are similar to "red and black swim shorts"). Current version supports text and image data types, where if an image link is present, the image is downloaded and embedded using OpenClip. Currently supports: 1. Language Encoder Model: any model using SentenceTransformer 2. MultiModal Encoder Model: any model using OpenClip

Querying at the moment is only done using a text query.

This project is a work in progress! If you find any issues - please report them

Installation:

Python version 3.10 or later installed. Latest version from the GitHub repository:

//: # ()

pip install git+https://github.com/idobenshaul10/SoftPandas.git

and requirements:

pip install -r requirements.txt

Example Usage:

Let's say we want to get all red and black swim shorts that cost less than 600$: We can load example data from a csv file and then query it using SoftPandas:

For full script:

demo.ipynb

Imports: import pandas as pd from softpandas.core.data_types import InputDataType from softpandas.core.soft_dataframe import SoftDataFrame from softpandas.embedders.clip_embedder import OpenClipEmbedder from softpandas.embedders.sentence_transformer_embedder import SentenceTransformerEmbedder from sklearn.metrics.pairwise import cosine_similarity

Let's set up our encoders: ```commandline langmodel = SentenceTransformerEmbedder('thenlper/gte-small', metric=cosinesimilarity, threshold=0.8, device="cpu")

visionmodel = OpenClipEmbedder('ViT-B-32-256', metric=cosinesimilarity, threshold=0.22, pretrained="datacomps34bb86k") ``` Then let's query using soft + hard queries:

``` df = pd.readcsv("sampledata/men-swimwear.csv") df = SoftDataFrame(df, softcolumns={'NAME': InputDataType.text, 'DESCRIPTION & COLOR': InputDataType.text, 'IMAGE': InputDataType.image}, models={InputDataType.text: langmodel, InputDataType.image: vision_model} )

df = df.softquery("'DESCRIPTION & COLOR' ~= 'swim shorts'") df = df.softquery("'IMAGE' ~= 'red and black'") df = df.query("PRICE < 600") print(df.head()['DESCRIPTION & COLOR'].values) ```

Saving and loading:

commandline df.to_pickle("relevant_items.p") df = pd.read_pickle("relevant_items.p")

TODOs:

  1. ~~Add saving methods for SoftDataFrame~~
  2. ~~Method for adding new columns~~
  3. Add dealing with Nans
    • ~~if a column is Nan, just ignore it~~
    • If value isn't there, it shouldn't pass condition - similar to normal querying
  4. Add handling of multiple queries - ATM if it's more than one predicate, it'll crash.
  5. Add indices instead of cosine - it's too slow
  6. Batching of initial encoding -
    • don't do it one by one
    • ~~use device (cuda, mps, tpu, etc.)~~

Long Term Goals:

  1. Add automatic feature extraction from images into new columns
    • allows hard querying using visual data!
  2. Add ability to soft query based on Image
  3. Expand to more modalities

Citation (CITATION.cff)

cff-version: 1.2.0
preferred-citation:
  type: software
  message: If you use SoftPandas, please cite it as below.
  authors:
  - family-names: Ben-Shaul
    given-names: Ido
    orcid: "https://orcid.org/0000-0002-3954-035X"
  title: "SoftPandas"
  version: 0.01
  doi: 10.5281/zenodo.3908559
  date-released: 2024-02-03
  license: Apache-2.0
  url: "https://github.com/idobenshaul10/SoftPandas?tab=readme-ov-file"

GitHub Events

Total
Last Year