Improving Representations via Similarities

Last update: Jan 08, 2023

Related tags

Miscellaneous embetter

Overview

embetter

warning

I like to build in public, but please don't expect anything yet. This is alpha stuff!

notes

Improving Representations via Similarities

The object to implement:

Embetter(multi_output=True, epochs=50, sampling_kwargs)
  .fit(X, y)
  .fit_sim(X1, X2, y_sim, weights)
  .partial_fit(X, y, classes, weights)
  .partial_fit_sim(X1, X2, y_sim, weights)
  .predict(X)
  .predict_proba(X)
  .predict_sim(X1, X2)
  .transform(X)
  .translate_X_y(X, y, classes=none)

Observation: especially when multi_output=True there's an opportunity with regards to NaN y-values. We can simply choose with values to translate and which to ignore.

Comments

[WIP] Feature/progress bar
Fixes issue #20

[x] Adds progress bar to all text and image embedders.

[x] Tests for SentenceEncoder.

[ ] Use perfplot for progress bar?

[ ] Can we ensure fast NumPy vectorization while using a progress bar?
opened by CarloLepelaars 5
[BUG] `device` should be attribute on `SentenceEncoder`
The device argument in SentenceEncoder is not defined as an attribute. This leads to bugs when using it with sklearn. I encountered attribute errors when trying to print out a Pipeline representation that has SentenceEncoder as a component.

Should be easy to fix by just adding self.device in SentenceEncoder.__init__. We can consider adding tests for text encoders so we can catch these errors beforehand.

The scikit-learn development docs make it clear every argument should be defined as an attribute:

every keyword argument accepted by init should correspond to an attribute on the instance. Scikit-learn relies on this to find the relevant attributes to set on an estimator when doing model selection.

Error message: AttributeError: 'SentenceEncoder' object has no attribute 'device'.

Reproduction: Python 3.8 with embetter = "^0.2.2"

se = SentenceEncoder() repr(se)

Fix:

Add self.device on SentenceEncoder

class SentenceEncoder(EmbetterBase): . . def __init__(self, name="all-MiniLM-L6-v2", device=None): if not device: device = torch.device("cuda" if torch.cuda.is_available() else "cpu") self.device = device self.name = name self.tfm = SBERT(name, device=self.device)
opened by CarloLepelaars 4
Color Histograms - Additional Tricks

This approach could work pretty well as an implementation: https://danielmuellerkomorowska.com/2020/06/17/analyzing-image-histograms-with-scikit-image/

To do something similar to what is explained here: https://www.pinecone.io/learn/color-histograms/

opened by koaning 4
Support for word embeddings
Hi,

Do you think it would be a good idea to add support for static word embeddings (word2vec, glove, etc.)? The embedder would need:

A filename to a local embedding file (e.g., glove.6b.100d.txt)

Either a callable tokenizer or regex string (i.e., the way sci-kit learn's TfIdfVectorizer splits words).

A (name of a) pooling function (e.g., "mean", "max", "sum").

The second and third parameters could easily have sensible defaults, of course. If you think it's a good idea, I can do the PR somewhere next week.

Stéphan
opened by stephantul 3
[FEATURE] SpaCyEmbedder
I think it would be a nice addition to add an embedder that can easily vectorize text through SpaCy. I already have an implementation class for this and would be happy to contribute it here.

SpaCy Docs on vector: https://spacy.io/api/doc#vector

Example code for single string:

import spacy nlp = spacy.load("en_core_web_sm") doc = nlp("This here text") doc.vector
opened by CarloLepelaars 2
`get_feature_names_out` for encoders

I would be happy to implement get_feature_names_out for all the Embetter objects. I will implement them by just adding a new method (without a Mixin).

opened by CarloLepelaars 1
Remove the classification layer in timm models

I was playing a bit with the library and found out that the TimmEncoder returns 1000-dimensional vectors for all the models I selected. That is caused by returning the state of the last FC classification layer and the fact all of the models were trained on ImageNet with 1000 classes. In practice, it's typically replaced with identity.

Are there any reasons for returning the state of that last layer as an embedding? I'd be happy to submit a PR fixing that.

opened by kacperlukawski 1
xception mobilenet

https://keras.io/api/applications/

https://www.tensorflow.org/api_docs/python/tf/keras/applications/mobilenet_v2/MobileNetV2 https://www.tensorflow.org/api_docs/python/tf/keras/applications/xception/Xception

opened by koaning 0

'SentenceEncoder' object has no attribute 'device'

text_emb_pipeline = make_pipeline(
  ColumnGrabber("text"),
  SentenceEncoder('all-MiniLM-L6-v2')
)

# This pipeline can also be trained to make predictions, using
# the embedded features. 
text_clf_pipeline = make_pipeline(
  text_emb_pipeline,
  LogisticRegression()
)

dataf = pd.DataFrame({
  "text": ["positive sentiment", "super negative"],
  "label_col": ["pos", "neg"]
})

X = text_emb_pipeline.fit_transform(dataf, dataf['label_col'])
text_clf_pipeline.fit(dataf, dataf['label_col'])

This code gives this error: 'SentenceEncoder' object has no attribute 'device'

opened by nicholas-dinicola 6

Releases(0.2.2)

0.2.2(Dec 20, 2022)

Adds GPU support for Sentence Encoders.
Source code(tar.gz)
Source code(zip)
0.2.1(Dec 5, 2022)

Fixed some error messages related to installing extra dependencies.
Source code(tar.gz)
Source code(zip)
0.2.0(Oct 10, 2022)

Fixes a bug related to the Timm vision models.
Source code(tar.gz)
Source code(zip)
0.1.0(Sep 19, 2022)

The first original release. Should have enough components to be interesting.
Source code(tar.gz)
Source code(zip)

Owner

vincent d warmerdam

Solving problems involving data. Mostly NLP these days. AskMeAnything[tm].

GitHub Repository

The RAP community of practice includes all analysts and data scientists who are interested in adopting the working practices included in reproducible analytical pipelines (RAP) at NHS Digital.

50 Dec 22, 2022

Generate Gaussian 09 input files for the rotamers of an input compound.

Rotapy Purpose Generate Gaussian 09 input files for the rotamers of an input compound. Distance to the axis of rotation remains constant throughout th

1 Jul 16, 2021

Awesome open-source alternatives to SaaS

Awesome-oss-alternatives - Awesome list of open-source startup alternatives to well-known SaaS products

12.7k Jan 03, 2023

Hospitality app for ERPNext to manage hotels & restaurants.

Hospitality ERPNext Hospitality module is designed to handle workflows for Hotels and Restaurants. Manage Restaurants The Restaurant module in ERPNext

19 Dec 26, 2022

Union oichecklists For Python

OI Checklist Union Auto-Union user's OI Checklists. Just put your checklist's ID in and it works. How to use it? Put all your OI Checklist IDs (that i

4 Mar 30, 2022

An attempt at furthering Factorio Calculator to work in more general contexts.

factorio-optimizer Lets do Factorio Calculator but make it optimize. Why not use Factorio Calculator? Becuase factorio calculator is not general. The

1 Jun 03, 2022

Notebook researcher - Notebook researcher with python

notebook_researcher To run the server, you must follow these instructions: At th

4 Sep 02, 2022

VacationCycleLogicBackEnd - Vacation Cycle Logic BackEnd With Python

Vacation Cycle Logic BackEnd Getting Started Existing virtualenv If your project

0 Jan 03, 2022

RecurrentArchitectures - See the accompanying blog post

Why this? What is the goal? The goal of this repository is to write all the recurrent architectures from scratch in tensorflow for learning purposes.

9 Feb 06, 2022

Demo Python project using Conda and Poetry

Conda Poetry This is a demonstration of how Conda and Poetry can be used in a Python project for dev dependency management and production deployment.

2 Apr 26, 2022

YunoHost is an operating system aiming to simplify as much as possible the administration of a server.

YunoHost is an operating system aiming to simplify as much as possible the administration of a server. This repository corresponds to the core code, written mostly in Python and Bash.

1.5k Jan 09, 2023

aaencode for python，把python代码转换为颜文字

py-aaencode aaencode for python，把python代码转换为颜文字 compile.py: 将python编译成颜文字，编译结果有随机性，可以选择BPE词表压缩代码 compile_min.py: 最小化的编译器 compiled_min.txt: 编译得到的最小的com

11 Dec 30, 2021

Probably the best way to simulate block scopes in Python

This is a package, as it says on the tin, to emulate block scoping in Python, the lack of which being a clever design choice yet sometimes a trouble.

88 Oct 26, 2022

Persian Kaldi profile for Rhasspy built from open speech data

Persian Kaldi Profile A Rhasspy profile for Persian (fa). Installation Get started by first installing Vosk: # Create virtual environment python3 -m v

12 Aug 08, 2022

Find the remote website version based on a git repository

versionshaker Versionshaker is a tool to find a remote website version based on a git repository This tool will help you to find the website version o

110 Oct 23, 2022

Beginner Projects A couple of beginner projects here

Beginner Projects A couple of beginner projects here, listed from easiest to hardest :) selector.py: simply a random selector to tell me who to faceti

272 Jan 07, 2023

Change your Windows background with this program safely & easily!

Background_Changer Table of Contents: About the Program Features Requirements Preview Credits Reach Me See Also About the Program: You can change your

0 Jul 14, 2022

A web interface for a soft serve Git server.

Soft Serve monitor Soft Sevre is a very nice git server. It offers a really nice TUI to browse the repositories on the server. Unfortunately, it does

5 Apr 26, 2022

A python mathematics module

4 Nov 28, 2021

Create a simple program by applying the use of class

TUGAS PRAKTIKUM 8 💻 Nama : Achmad Mahfud NIM : 312110520 Kelas : TI.21.C5 Perintah : Buat program sederhana dengan mengaplikasikan pengguna

1 Dec 23, 2021

Improving Representations via Similarities

Related tags

Overview

embetter

warning

notes

Comments

Releases(0.2.2)

0.2.2(Dec 20, 2022)

0.2.1(Dec 5, 2022)

0.2.0(Oct 10, 2022)

0.1.0(Sep 19, 2022)

Owner

vincent d warmerdam

The RAP community of practice includes all analysts and data scientists who are interested in adopting the working practices included in reproducible analytical pipelines (RAP) at NHS Digital.

Generate Gaussian 09 input files for the rotamers of an input compound.

Awesome open-source alternatives to SaaS

Hospitality app for ERPNext to manage hotels & restaurants.

Union oichecklists For Python

An attempt at furthering Factorio Calculator to work in more general contexts.

Notebook researcher - Notebook researcher with python

VacationCycleLogicBackEnd - Vacation Cycle Logic BackEnd With Python

RecurrentArchitectures - See the accompanying blog post

Demo Python project using Conda and Poetry

YunoHost is an operating system aiming to simplify as much as possible the administration of a server.

aaencode for python，把python代码转换为颜文字

Probably the best way to simulate block scopes in Python

Persian Kaldi profile for Rhasspy built from open speech data

Find the remote website version based on a git repository

Beginner Projects A couple of beginner projects here

Change your Windows background with this program safely & easily!

A web interface for a soft serve Git server.

A python mathematics module

Create a simple program by applying the use of class