eland

Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (16.1%) to scientific vocabulary

Keywords

big-data data-analysis dataframe dataframes eland elasticsearch etl lightgbm machine-learning pandas python scikit-learn time-series-forecasting

Last synced: 6 months ago · JSON representation

Repository

Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch

Basic Info

Host: GitHub
Owner: elastic
License: apache-2.0
Language: Python
Default Branch: main
Homepage: https://eland.readthedocs.io
Size: 20.9 MB

Statistics

Stars: 685
Watchers: 207
Forks: 111
Open Issues: 83
Releases: 37

Topics

big-data data-analysis dataframe dataframes eland elasticsearch etl lightgbm machine-learning pandas python scikit-learn time-series-forecasting

Created over 6 years ago · Last pushed 6 months ago

Metadata Files

Readme Changelog Contributing License

About

Eland is a Python Elasticsearch client for exploring and analyzing data in Elasticsearch with a familiar Pandas-compatible API.

Where possible the package uses existing Python APIs and data structures to make it easy to switch between numpy, pandas, or scikit-learn to their Elasticsearch powered equivalents. In general, the data resides in Elasticsearch and not in memory, which allows Eland to access large datasets stored in Elasticsearch.

Eland also provides tools to upload trained machine learning models from common libraries like scikit-learn, XGBoost, and LightGBM into Elasticsearch.

Getting Started

Eland can be installed from PyPI with Pip:

bash $ python -m pip install eland

If using Eland to upload NLP models to Elasticsearch install the PyTorch extras: bash $ python -m pip install 'eland[pytorch]'

Eland can also be installed from Conda Forge with Conda:

bash $ conda install -c conda-forge eland

Compatibility

Supports Python 3.9, 3.10, 3.11 and 3.12.
Supports Pandas 1.5 and 2.
Supports Elasticsearch 8+ clusters, recommended 8.16 or later for all features to work. If you are using the NLP with PyTorch feature make sure your Eland minor version matches the minor version of your Elasticsearch cluster. For all other features it is sufficient for the major versions to match.
You need to install the appropriate version of PyTorch to import an NLP model. Run python -m pip install 'eland[pytorch]' to install that version.

Prerequisites

Users installing Eland on Debian-based distributions may need to install prerequisite packages for the transitive dependencies of Eland:

bash $ sudo apt-get install -y \ build-essential pkg-config cmake \ python3-dev libzip-dev libjpeg-dev

Note that other distributions such as CentOS, RedHat, Arch, etc. may require using a different package manager and specifying different package names.

Docker

If you want to use Eland without installing it just to run the available scripts, use the Docker image. It can be used interactively:

bash $ docker run -it --rm --network host docker.elastic.co/eland/eland

Running installed scripts is also possible without an interactive shell, e.g.:

bash $ docker run -it --rm --network host \ docker.elastic.co/eland/eland \ eland_import_hub_model \ --url http://host.docker.internal:9200/ \ --hub-model-id elastic/distilbert-base-cased-finetuned-conll03-english \ --task-type ner

Connecting to Elasticsearch

Eland uses the Elasticsearch low level client to connect to Elasticsearch. This client supports a range of connection options and authentication options.

You can pass either an instance of elasticsearch.Elasticsearch to Eland APIs or a string containing the host to connect to:

```python import eland as ed

Connecting to an Elasticsearch instance running on 'http://localhost:9200'

df = ed.DataFrame("http://localhost:9200", esindexpattern="flights")

Connecting to an Elastic Cloud instance

from elasticsearch import Elasticsearch

es = Elasticsearch( cloudid="cluster-name:...", basicauth=("elastic", "") ) df = ed.DataFrame(es, esindexpattern="flights") ```

DataFrames in Eland

eland.DataFrame wraps an Elasticsearch index in a Pandas-like API and defers all processing and filtering of data to Elasticsearch instead of your local machine. This means you can process large amounts of data within Elasticsearch from a Jupyter Notebook without overloading your machine.

➤ Eland DataFrame API documentation

➤ Advanced examples in a Jupyter Notebook

```python

import eland as ed

Connect to 'flights' index via localhost Elasticsearch node

df = ed.DataFrame('http://localhost:9200', 'flights')

eland.DataFrame instance has the same API as pandas.DataFrame

except all data is in Elasticsearch. See .info() memory usage.

df.head() AvgTicketPrice Cancelled ... dayOfWeek timestamp 0 841.265642 False ... 0 2018-01-01 00:00:00 1 882.982662 False ... 0 2018-01-01 18:27:00 2 190.636904 False ... 0 2018-01-01 17:11:14 3 181.694216 True ... 0 2018-01-01 10:33:28 4 730.041778 False ... 0 2018-01-01 05:13:00

[5 rows x 27 columns]

df.info() Index: 13059 entries, 0 to 13058 Data columns (total 27 columns): # Column Non-Null Count Dtype

0 AvgTicketPrice 13059 non-null float64
1 Cancelled 13059 non-null bool
2 Carrier 13059 non-null object
...
24 OriginWeather 13059 non-null object
25 dayOfWeek 13059 non-null int64
26 timestamp 13059 non-null datetime64[ns] dtypes: bool(2), datetime64ns, float64(5), int64(2), object(17) memory usage: 80.0 bytes Elasticsearch storage usage: 5.043 MB

Filtering of rows using comparisons

df[(df.Carrier=="Kibana Airlines") & (df.AvgTicketPrice > 900.0) & (df.Cancelled == True)].head() AvgTicketPrice Cancelled ... dayOfWeek timestamp 8 960.869736 True ... 0 2018-01-01 12:09:35 26 975.812632 True ... 0 2018-01-01 15:38:32 311 946.358410 True ... 0 2018-01-01 11:51:12 651 975.383864 True ... 2 2018-01-03 21:13:17 950 907.836523 True ... 2 2018-01-03 05:14:51

[5 rows x 27 columns]

Running aggregations across an index

df[['DistanceKilometers', 'AvgTicketPrice']].aggregate(['sum', 'min', 'std']) DistanceKilometers AvgTicketPrice sum 9.261629e+07 8.204365e+06 min 0.000000e+00 1.000205e+02 std 4.578263e+03 2.663867e+02 ```

Machine Learning in Eland

Regression and classification

Eland allows transforming trained regression and classification models from scikit-learn, XGBoost, and LightGBM libraries to be serialized and used as an inference model in Elasticsearch.

➤ Eland Machine Learning API documentation

```python

from sklearn import datasets from xgboost import XGBClassifier from eland.ml import MLModel

Train and exercise an XGBoost ML model locally

trainingdata = datasets.makeclassification(nfeatures=5) xgbmodel = XGBClassifier(booster="gbtree") xgbmodel.fit(trainingdata[0], training_data[1])

xgbmodel.predict(trainingdata[0]) [0 1 1 0 1 0 0 0 1 0]

Import the model into Elasticsearch

esmodel = MLModel.importmodel( esclient="http://localhost:9200", modelid="xgb-classifier", model=xgbmodel, featurenames=["f0", "f1", "f2", "f3", "f4"], )

Exercise the ML model in Elasticsearch with the training data

esmodel.predict(trainingdata[0]) [0 1 1 0 1 0 0 0 1 0] ```

NLP with PyTorch

For NLP tasks, Eland allows importing PyTorch trained BERT models into Elasticsearch. Models can be either plain PyTorch models, or supported transformers models from the Hugging Face model hub.

bash $ eland_import_hub_model \ --url http://localhost:9200/ \ --hub-model-id elastic/distilbert-base-cased-finetuned-conll03-english \ --task-type ner \ --start

The example above will automatically start a model deployment. This is a good shortcut for initial experimentation, but for anything that needs good throughput you should omit the --start argument from the Eland command line and instead start the model using the ML UI in Kibana. The --start argument will deploy the model with one allocation and one thread per allocation, which will not offer good performance. When starting the model deployment using the ML UI in Kibana or the Elasticsearch API you will be able to set the threading options to make the best use of your hardware.

```python

import elasticsearch from pathlib import Path from eland.common import es_version from eland.ml.pytorch import PyTorchModel from eland.ml.pytorch.transformers import TransformerModel

es = elasticsearch.Elasticsearch("http://elastic:mlqaadmin@localhost:9200") esclusterversion = esversion(es)

Load a Hugging Face transformers model directly from the model hub

tm = TransformerModel(modelid="elastic/distilbert-base-cased-finetuned-conll03-english", tasktype="ner", esversion=escluster_version) Downloading: 100%|██████████| 257/257 [00:00<00:00, 108kB/s] Downloading: 100%|██████████| 954/954 [00:00<00:00, 372kB/s] Downloading: 100%|██████████| 208k/208k [00:00<00:00, 668kB/s] Downloading: 100%|██████████| 112/112 [00:00<00:00, 43.9kB/s] Downloading: 100%|██████████| 249M/249M [00:23<00:00, 11.2MB/s]

Export the model in a TorchScrpt representation which Elasticsearch uses

tmppath = "models" Path(tmppath).mkdir(parents=True, existok=True) modelpath, config, vocabpath = tm.save(tmppath)

Import model into Elasticsearch

ptm = PyTorchModel(es, tm.elasticsearchmodelid()) ptm.importmodel(modelpath=modelpath, configpath=None, vocabpath=vocabpath, config=config) 100%|██████████| 63/63 [00:12<00:00, 5.02it/s] ```

Owner

Name: elastic
Login: elastic
Kind: organization
Email: info@elastic.co

Website: https://www.elastic.co/
Repositories: 844
Profile: https://github.com/elastic

GitHub Events

Total

Create event: 27
Release event: 7
Issues event: 21
Watch event: 39
Delete event: 22
Issue comment event: 69
Public event: 1
Push event: 61
Pull request review comment event: 42
Pull request review event: 73
Pull request event: 111
Fork event: 12

Last Year

Create event: 27
Release event: 7
Issues event: 21
Watch event: 39
Delete event: 22
Issue comment event: 69
Public event: 1
Push event: 61
Pull request review comment event: 42
Pull request review event: 73
Pull request event: 111
Fork event: 12

Committers

Last synced: 9 months ago

All Time

Total Commits: 545
Total Committers: 43
Avg Commits per committer: 12.674
Development Distribution Score (DDS): 0.774

Past Year

Commits: 60
Committers: 14
Avg Commits per committer: 4.286
Development Distribution Score (DDS): 0.5

Top Committers

Name	Email	Commits
Stephen Dodson	s**n@e**o	123
Seth Michael Larson	s**n@e**o	101
Quentin Pradet	q**t@e**o	61
David Kyle	d**e@e**o	42
P. Sai Vinay	3****8	39
Winterflower	c**n@e**o	30
Benjamin Trent	b**t@g**m	22
Bart Broere	m**l@b**u	15
Michael Hirsch	m**h@g**m	11
Daniel Mesejo-León	m**n@g**m	10
Aurélien FOUCRET	a**t@g**m	9
Josh Devins	j**s@e**o	8
Enrico Zimuel	e**l@g**m	8
Valeriy Khakhutskyy	1****2	7
Michael Hirsch	m**h@e**o	6
Youhei Sakurai	s**i@g**m	4
David Olaru	d**u@v**m	4
István Zoltán Szabó	s**e@g**m	4
Colleen McGinnis	c**s@g**m	3
Fernando Briano	f**o@p**t	3
Liam Thompson	3****o	3
Lisa Cawley	l**y@e**o	3
Miguel Grinberg	m**g@g**m	2
Mark J. Hoy	m**y@e**o	2
Jabin Kong	1**4@q**m	2
Iulia Feroli	i**i@g**m	2
Ed Savage	e**e@e**o	2
Dai Sugimori	d**e@g**m	2
Adam Demjen	d**d@g**m	2
Ashton Sidhu	a**u@e**o	2
and 13 more...

Committer Domains (Top 20 + Academic)

elastic.co: 13 pm.me: 1 qq.com: 1 picandocodigo.net: 1 vaevixen.com: 1 bartbroere.eu: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 128
Total pull requests: 316
Average time to close issues: 8 months
Average time to close pull requests: 26 days
Total issue authors: 76
Total pull request authors: 41
Average comments per issue: 1.65
Average comments per pull request: 0.91
Merged pull requests: 251
Bot issues: 0
Bot pull requests: 19

Past Year

Issues: 9
Pull requests: 128
Average time to close issues: 4 months
Average time to close pull requests: 10 days
Issue authors: 7
Pull request authors: 12
Average comments per issue: 1.78
Average comments per pull request: 0.43
Merged pull requests: 99
Bot issues: 0
Bot pull requests: 17

View more stats

Top Authors

Issue Authors

davidkyle (16)
sethmlarson (8)
joshdevins (6)
pquentin (5)
ppf2 (4)
jt0dd (3)
jeffvestal (3)
igaloly (3)
bartbroere (3)
walkingmug (3)
valeriy42 (2)
OneZe1023 (2)
temiwale88 (2)
tariksetia (2)
tlaraMQ (2)

Pull Request Authors

pquentin (113)
davidkyle (68)
bartbroere (26)
afoucret (18)
github-actions[bot] (17)
benwtrent (14)
sethmlarson (10)
valeriy42 (9)
colleenmcginnis (8)
markjhoy (6)
redcinelli (6)
picandocodigo (5)
V1NAY8 (4)
miguelgrinberg (4)
iuliaferoli (4)

Top Labels

Issue Labels

enhancement (35) topic:dataframe (29) topic:NLP (14) topic:ml (13) bug (12) help wanted (5) good first issue (5) topic:gbdt (4) ci (2) topic:series (1) documentation (1)

Pull Request Labels

topic:ml (36) enhancement (35) documentation (27) topic:NLP (23) ci (18) bug (17) backport 8.x (17) refactor (5) topic:dataframe (3)

Packages

Total packages: 3
Total downloads:
- pypi 14,609 last-month

Total dependent packages: 2
(may contain duplicates)
Total dependent repositories: 22
(may contain duplicates)
Total versions: 70
Total maintainers: 5

pypi.org: eland

Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch

Homepage: https://github.com/elastic/eland
Documentation: https://eland.readthedocs.io/
License: Apache-2.0
Latest release: 9.0.1
published 10 months ago

Versions: 44
Dependent Packages: 2
Dependent Repositories: 22
Downloads: 14,557 Last month
Docker Downloads: 0

Rankings

Dependent packages count: 2.1%

Stargazers count: 2.7%

Dependent repos count: 3.1%

Average: 3.3%

Docker downloads count: 3.4%

Downloads: 3.7%

Forks count: 4.8%

Maintainers (4)

miguelgrinberg quentinp ezimuel sjd171

Last synced: 6 months ago

pypi.org: bartbroere-eland

[Development fork!] Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch

Homepage: https://github.com/elastic/eland
Documentation: https://bartbroere-eland.readthedocs.io/
License: Apache-2.0
Latest release: 8.17.1
published about 1 year ago

Versions: 13
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 52 Last month

Rankings

Stargazers count: 2.7%

Forks count: 4.8%

Dependent packages count: 7.4%

Downloads: 18.2%

Average: 20.4%

Dependent repos count: 69.1%

Maintainers (1)

bartbroere

Last synced: 6 months ago

conda-forge.org: eland

Eland is a Elasticsearch client Python package to analyse, explore and manipulate data that resides in Elasticsearch. Where possible the package uses existing Python APIs and data structures to make it easy to switch between numpy, pandas, scikit-learn to their Elasticsearch powered equivalents. In general, the data resides in Elasticsearch and not in memory, which allows Eland to access large datasets stored in Elasticsearch.

Homepage: https://eland.readthedocs.io
License: Apache-2.0
Latest release: 8.3.0
published over 3 years ago

Versions: 13
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Stargazers count: 16.8%

Forks count: 21.5%

Average: 30.9%

Dependent repos count: 34.0%

Dependent packages count: 51.2%

Last synced: 6 months ago

Dependencies

.buildkite/Dockerfile docker

python ${PYTHON_VERSION} build

Dockerfile docker

python 3.10-slim build

docs/requirements-docs.txt pypi

furo *
matplotlib *
nbsphinx *
nbval *
sphinx ==5.3.0

requirements-dev.txt pypi

build * development
mypy * development
nbval * development
nox * development
numpydoc >=0.9.0 development
pytest >=5.2.1 development
pytest-cov * development
pytest-mock * development
shap ==0.43.0 development
twine * development

setup.py pypi

elasticsearch >=8.3,<9
matplotlib >=3.6
numpy >=1.2.0,<2
packaging *
pandas >=1.5,<2

eland

Science Score: 26.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

About

Getting Started

Compatibility

Prerequisites

Docker

Connecting to Elasticsearch

Connecting to an Elasticsearch instance running on 'http://localhost:9200'

Connecting to an Elastic Cloud instance

DataFrames in Eland

Connect to 'flights' index via localhost Elasticsearch node

eland.DataFrame instance has the same API as pandas.DataFrame

except all data is in Elasticsearch. See .info() memory usage.

Filtering of rows using comparisons

Running aggregations across an index

Machine Learning in Eland

Regression and classification

Train and exercise an XGBoost ML model locally

Import the model into Elasticsearch

Exercise the ML model in Elasticsearch with the training data

NLP with PyTorch

Load a Hugging Face transformers model directly from the model hub

Export the model in a TorchScrpt representation which Elasticsearch uses

Import model into Elasticsearch

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: eland

Rankings

Maintainers (4)

pypi.org: bartbroere-eland

Rankings

Maintainers (1)

conda-forge.org: eland

Rankings

Dependencies