NVIDIA Merlin is an open source library designed to accelerate recommender systems on NVIDIA’s GPUs.

Last update: Jan 04, 2023

Related tags

Overview

NVIDIA Merlin

NVIDIA Merlin is an open source library designed to accelerate recommender systems on NVIDIA’s GPUs. It enables data scientists, machine learning engineers, and researchers to build high-performing recommenders at scale. Merlin includes tools to address common ETL, training, and inference challenges. Each stage of the Merlin pipeline is optimized to support hundreds of terabytes of data, which is all accessible through easy-to-use APIs. With Merlin, better predictions and increased click-through rates (CTRs) are within reach. For more information, see NVIDIA Merlin.

Benefits

NVIDIA Merlin is a scalable and GPU-accelerated solution, making it easy to build recommender systems from end to end. With NVIDIA Merlin, you can:

transform data (ETL) for preprocessing and engineering features.
accelerate existing training pipelines in TensorFlow, PyTorch, or FastAI by leveraging optimized, custom-built dataloaders.
scale large deep learning recommender models by distributing large embedding tables that exceed available GPU and CPU memory.
deploy data transformations and trained models to production with only a few lines of code.

Components of NVIDIA Merlin

NVIDIA Merlin is a collection of open source libraries:

NVTabular
HugeCTR
Triton Inference Server

NVTabular
NVTabular is a feature engineering and preprocessing library for tabular data. NVTabular is essentially the ETL component of the Merlin ecosystem. It is designed to quickly and easily manipulate terabyte-size datasets that are used to train deep learning based recommender systems. NVTabular offers a high-level API that can be used to define complex data transformation workflows. NVTabular is also capable of transformation speedups that can be 100 times to 1,000 times faster than transformations taking place on optimized CPU clusters. With NVTabular, you can:

prepare datasets quickly and easily for experimentation so that more models can be trained.
process datasets that exceed GPU and CPU memory without having to worry about scale.
focus on what to do with the data and not how to do it by using abstraction at the operation level.

NVTabular DataLoaders
NVTabular provides seamless integration with common deep learning frameworks, such as TensorFlow, PyTorch, and HugeCTR. When training deep learning recommender system models, dataloading can be a bottleneck. Therefore, we’ve developed custom, highly-optimized dataloaders to accelerate existing TensorFlow and PyTorch training pipelines. The NVTabular dataloaders can lead to a speedup that is 9 times faster than the same training pipeline used with the GPU. With the NVTabular dataloaders, you can:

remove bottlenecks from dataloading by processing large chunks of data at a time instead of item by item.
process datasets that don’t fit within the GPU or CPU memory by streaming from the disk.
prepare batches asynchronously into the GPU to avoid CPU-GPU communication.
integrate easily into existing TensorFlow or PyTorch training pipelines by using a similar API.

HugeCTR
HugeCTR is a GPU-accelerated framework designed to distribute training across multiple GPUs and nodes and estimate click-through rates. HugeCTR contains optimized dataloaders that can be used to prepare batches with GPU-acceleration. In addition, HugeCTR is capable of scaling large deep learning recommendation models. The neural network architectures often contain large embedding tables that represent hundreds of millions of users and items. These embedding tables can easily exceed the CPU and GPU memory. HugeCTR provides strategies for scaling large embedding tables beyond available memory. With HugeCTR, you can:

scale embedding tables over multiple GPUs or nodes.
load a subset of an embedding table into the GPU in a coarse grained, on-demand manner during the training stage.

Triton
NVTabular and HugeCTR both support the Triton Inference Server to provide GPU-accelerated inference. The Triton Inference Server is open-source inference serving software that can be used to simplify the deployment of trained AI models from any framework to production. With Triton, you can:

deploy NVTabular ETL workflows and trained deep learning models to production with a few lines of code.
deploy an ensemble of NVTabular ETL and trained deep learning models to ensure that the same data transformations are applied in production.
deploy models concurrently on GPUs to maximize utilization.
enable low latency inferencing in real time or batch inferencing to maximize GPU and CPU utilization.
scale the production environment with Kubernetes for orchestration, metrics, and auto-scaling using a Docker container.

Examples

A collection of end-to-end examples is available within this repository in the form of Jupyter notebooks. The example notebooks demonstrate how to:

download and prepare the dataset.
use preprocessing and engineering features.
train deep learning recommendation models with TensorFlow, PyTorch, FastAI, or HugeCTR.
deploy the models to production.

These examples are based on different datasets and provide a wide range of real-world use cases.

Resources

For more information about NVIDIA Merlin and its components, see the following:

Comments

Update Criteo Example with Merlin Models and Merlin Systems
Added the example notebook to train DLRM Merlin Model on criteo 1TB dataset.

Modified README.md, examples/README.md and examples/scaling-criteo/README.md to add link and information about the example.

Modified examples/scaling-criteo/02-ETL-with-NVTabular.ipynb to add Target and BINARY_CLASSIFICATION tag to the label.

documentation examples
opened by rvk007 31

[BUG] Cannot load a exported deepfm model with NGC 22.03 inference container

run into following errors

I0318 00:00:18.082645 172 hugectr.cc:1926] TRITONBACKEND_ModelInstanceInitialize: deepfm_0 (device 0)
I0318 00:00:18.082694 172 hugectr.cc:1566] Triton Model Instance Initialization on device 0
I0318 00:00:18.082792 172 hugectr.cc:1576] Dense Feature buffer allocation:
I0318 00:00:18.083026 172 hugectr.cc:1583] Categorical Feature buffer allocation:
I0318 00:00:18.083095 172 hugectr.cc:1601] Categorical Row Index buffer allocation:
I0318 00:00:18.083143 172 hugectr.cc:1611] Predict result buffer allocation:
I0318 00:00:18.083203 172 hugectr.cc:1939] ******Loading HugeCTR Model******
I0318 00:00:18.083217 172 hugectr.cc:1631] The model origin json configuration file path is: /ensemble_models/deepfm/1/deepfm.json
[HCTR][00:00:18][INFO][RK0][main]: Global seed is 1305961709
[HCTR][00:00:19][WARNING][RK0][main]: Peer-to-peer access cannot be fully enabled.
[HCTR][00:00:19][INFO][RK0][main]: Start all2all warmup
[HCTR][00:00:19][INFO][RK0][main]: End all2all warmup
[HCTR][00:00:19][INFO][RK0][main]: Create inference session on device: 0
[HCTR][00:00:19][INFO][RK0][main]: Model name: deepfm
[HCTR][00:00:19][INFO][RK0][main]: Use mixed precision: False
[HCTR][00:00:19][INFO][RK0][main]: Use cuda graph: True
[HCTR][00:00:19][INFO][RK0][main]: Max batchsize: 64
[HCTR][00:00:19][INFO][RK0][main]: Use I64 input key: True
[HCTR][00:00:19][INFO][RK0][main]: start create embedding for inference
[HCTR][00:00:19][INFO][RK0][main]: sparse_input name data1
[HCTR][00:00:19][INFO][RK0][main]: create embedding for inference success
[HCTR][00:00:19][INFO][RK0][main]: Inference stage skip BinaryCrossEntropyLoss layer, replaced by Sigmoid layer
I0318 00:00:19.826815 172 hugectr.cc:1639] ******Loading HugeCTR model successfully
I0318 00:00:19.827763 172 model_repository_manager.cc:1149] successfully loaded 'deepfm' version 1
E0318 00:00:19.827767 172 model_repository_manager.cc:1152] failed to load 'deepfm_nvt' version 1: Internal: TypeError: 'NoneType' object is not subscriptable

At:
  /ensemble_models/deepfm_nvt/1/model.py(91): _set_output_dtype
  /ensemble_models/deepfm_nvt/1/model.py(76): initialize

E0318 00:00:19.827960 172 model_repository_manager.cc:1332] Invalid argument: ensemble 'deepfm_ens' depends on 'deepfm_nvt' which has no loaded version
I0318 00:00:19.828048 172 server.cc:522]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0318 00:00:19.828117 172 server.cc:549]
+---------+---------------------------------------------------------+-----------------------------------------------+
| Backend | Path                                                    | Config                                        |
+---------+---------------------------------------------------------+-----------------------------------------------+
| hugectr | /opt/tritonserver/backends/hugectr/libtriton_hugectr.so | {"cmdline":{"ps":"/ensemble_models/ps.json"}} |
+---------+---------------------------------------------------------+-----------------------------------------------+

I0318 00:00:19.828209 172 server.cc:592]
+------------+---------+--------------------------------------------------------------------------+
| Model      | Version | Status                                                                   |
+------------+---------+--------------------------------------------------------------------------+
| deepfm     | 1       | READY                                                                    |
| deepfm_nvt | 1       | UNAVAILABLE: Internal: TypeError: 'NoneType' object is not subscriptable |
|            |         |                                                                          |
|            |         | At:                                                                      |
|            |         |   /ensemble_models/deepfm_nvt/1/model.py(91): _set_output_dtype          |
|            |         |   /ensemble_models/deepfm_nvt/1/model.py(76): initialize                 |
+------------+---------+--------------------------------------------------------------------------+

I0318 00:00:19.845925 172 metrics.cc:623] Collecting metrics for GPU 0: Tesla T4
I0318 00:00:19.846404 172 tritonserver.cc:1932]
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                              |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                             |
| server_version                   | 2.19.0                                                                                                                             |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_mem |
|                                  | ory cuda_shared_memory binary_tensor_data statistics trace                                                                         |
| model_repository_path[0]         | /ensemble_models                                                                                                                   |
| model_control_mode               | MODE_NONE                                                                                                                          |
| strict_model_config              | 1                                                                                                                                  |
| rate_limit                       | OFF                                                                                                                                |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                          |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                           |
| response_cache_byte_size         | 0                                                                                                                                  |
| min_supported_compute_capability | 6.0                                                                                                                                |
| strict_readiness                 | 1                                                                                                                                  |
| exit_timeout                     | 30                                                                                                                                 |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------+

Aha! Link: https://nvaiinfa.aha.io/features/MERLIN-818

bug

opened by mengdong 26

Add example notebook for training and inference on Sagemaker
This PR adds an example notebook that demonstrates how to use Merlin container for training and inference on the AWS Sagemaker platform. It can be roughly split into three sections:

Test the training script locally to make sure that the script runs as expected,

Train the model on Sagemaker by calling sagemaker.Estimator.fit(entry_point='train.py'), and

Create an endpoint with Triton, and test it by sending a request.

Note:

The notebook builds a custom image (built from merlin-tensorflow image) and uploads it to ECR. This step can be simplified/removed by installing the sagemaker-training library directly in the Merlin image, so I opened a separate PR: https://github.com/NVIDIA-Merlin/Merlin/pull/707.

Steps for testing (NV-specific):

Ask Ben to give you access to the AWS account. After he adds you to the group, wait a couple of hours for permissions to propagate.

If permissions have propagated, you will see AWS NVIDIA Account on https://myapps.microsoft.com/. This can be used to access the console.

Install aws-azure-login and run aws-azure-login --mode=gui to authenticate. This will create the credientials in $HOME/.aws. More details here: https://gitlab-master.nvidia.com/kaizen-efficiency/wiki/-/wikis/Guides/aws-cli-sso

As described in README, add -v $HOME/.aws:/root/.aws to your usual docker run command.

examples
opened by edknv 25
Upgrade upstream container references, get pyarrow from upstream

This PR upgrades the current container references in the docker container to latest available for dlfw. The dockerfile also gets the latest compatible pyarrow directly from the upstream DLFW container and gets a library newly required by tritonserver,
chore ci

opened by jperez999 21
[RMP] Recsys Tutorial & Demo - Flesh out the multi-stage recommender example architecture
Problem:

Customers need a clear example of a multi-stage recommender pipeline that they can follow and use to create their own versions. The upcoming Recsys tutorial will be a public sharing of this example and serves as the deadline for its completion.

Goal:

Provide a clear example of a multi stage recommender pipeline that does retrieval, filtering, ranking and ordering.

Highlight NVTabular, our dataloader, Merlin models, and Merlin systems and how they work together.

Constraints:

NVTabular doesn't support splitting of workflows for user and item features. Two separate workflows for the data are needed to process each.

Starting Point:

[ ] - [ ] https://github.com/NVIDIA-Merlin/systems/issues/99 ( This has been removd from scope)

[ ] #458

[ ] #449 - This is an optional requirement (removed)

examples
opened by karlhigley 18
Migrate the legacy examples to the Merlin repo

We may (or may not) want to keep these examples but they've overstayed their welcome in the NVTabular repo, which is burdened with the accumulation of a lot of historical cruft. Since some of these examples use inference code that's moving to Systems, it makes more sense for them to live in the Merlin repo (if we want to keep them.)

The PR on the other side of this migration is https://github.com/NVIDIA-Merlin/NVTabular/pull/1711
chore

opened by karlhigley 16
Unable to export graph as ensemble when using multi-hot categorical column

I am using inference container 22.05. When materializing feast in cell 6 deploying multi-stage recsys notebook, I am getting exception while materializing user_features. Item features worked just fine. Please download the sample data and notebooks used for this from the link here

Using tensorflow-inference 22.05 container. Attaching the error here
bug P0

opened by mkumari-ed 16
Pin pyarrow version for integration tests
Pin pyarrow version for integration tests.

The pip install 'feast<0.20' command is installing the latest version of pyarrow (currently 10.0.1). This is incompatibile with the current version of cudf.

Follow-up

The container we're testing with already has pyarrow installed. So a follow-up action would be to figure out why the existing pyarrow version is not being detected and take any actions required for this to be found.

import pyarrow; pyarrow.__version__ reports 8.0.0.

However, pip show pyarrow doesn't report anything.

Suggesting that the pyarrow installation is incomplete (missing the dist-info directory)

ci
opened by oliverholworthy 14
[BUG] Cannot run example - Deploying a Multi-Stage Recommender System

Hi! Thanks for providing the multi-stage recommender system example. This is really helpful to learn to use Merlin in a more realistic scenario. I'm trying to run the two examples of Building-and-deploying-multi-stage-RecSys.

My system CUDA driver is 11.5 and I'm using nvcr.io/nvidia/merlin/merlin-tensorflow-inference:latest.

At first I had a problem similar to #158 and I solved it by following the suggestion in #158. More specifically, I downgraded the CUDA toolkit to 11.5 in the container. After this, I could successfully run example 01-Building-Recommender-Systems-with-Merlin.ipynb but I met a problem at almost the last step while running 02-Deploying-multi-stage-RecSys-with-Merlin-Systems.ipynb. As shown in the above figure, I can't start a triton server while all other steps in both examples seem normal.

Could you please help me figure out a solution? Thanks!

opened by future-xy 13
[wip] Split out Systems tests from notebooks
The idea here is to split some of our integration tests away from the notebook examples. This way we can write more complete tests without worrying about the narrative that we provide to new users.

There is currently a single subclass of unittest.TestCase with a heavy setUpClass method that:

Generates synthetic data

Defines NVT workflos

Trains models (TT, DLRM)

Ingests data into Feast

Ingests item vectors into FAISS

The individual tests configure various systems ensembles and ensures, for now, that they compile.

I am having issues with the retrieval model signature including item features, which it shouldn't. I'm opening a draft PR to see how it works on CI.
chore
opened by nv-alaiacano 12
fixes in the PoC first notebook
This PR

fixing the correct raw dataframe to be fed to both models so the user and item catalogs wont be confused

training retrieval model first and then the ranking (logical order)

enhancement examples
opened by rnyak 12
[RMP] Multi-Task Learning with Merlin Models
Problem:

Multi-Task Learning (MTL) is a popular approach to improve models accuracy, robustness and compression by training them to predict multiple outputs.
In the field of RecSys, MTL has been widely used in ranking models, where multiple binary targets (e.g. the likelihood of a click, like, share, purchase) are predicted for user-item pairs. It has been also popular to train retrieval models, like in OTTO – Multi-Objective Recommender System Kaggle competition, where the task is to predict clicks, cart additions, and orders.

Goal:

The goal of this RMP is to provide to our users easy building blocks in Merlin Models for building Multi-Task Learning models.

Constraints:

The model tasks should be inferred automatically from the schema, where columns can be tagged as "target", "regression" and "classification" ("binary" and "multi_class").

The user should be able to define manually the tasks by creating a single or multiple tasks per target column, with or without a task-specific tower

It should be possible to set the loss_weights, so that the tasks weights can be balanced in the final loss

It should be possible to set metrics and weighted_metrics per task

It should be possible to set sample_weight per task

We should provide as building blocks some architectures especially designed for multi-task learning from state-of-the-art: MMOE, PLE

Starting Point:

The early releases of Merlin Models already contained some building blocks for MTL, like

parse_prediction_tasks() which returned a ParallelPredictionBlock containing a number of PredictionTask

A preliminary version of MMOEBlock

Some logic to map tasks outputs to losses and metrics

In this RMP, we list the latest tasks / fixes / improvements to officially release the Multi-Task Learning with Merlin Models:

[x] https://github.com/NVIDIA-Merlin/models/pull/902

[ ] https://github.com/NVIDIA-Merlin/models/pull/772

[ ] https://github.com/NVIDIA-Merlin/models/issues/914

[ ] https://github.com/NVIDIA-Merlin/models/issues/687
opened by gabrielspmoreira 0
Install hps backend and hps trt plugin in hugectr/tf/pytorch
Install common dependencies in merlin-base;

Install related components(hugectr inference, hps_backend, hps trt plugin) in merlin-hugectr/merlin-tf/merlin-pytorch;

chore ci
opened by EmmaQiaoCh 2
[QST]Fail to docker build from dockerfile.ctr, pull access denied for nvcr.io/nvstaging/merlin/merlin-base

❓ Questions & Help

I've suceeded to login in nvcr.io, but still cannot pull from nvcr.io/nvstaging/merlin/merlin-base. Is this NGC still available now?

cmd: cd docker && docker build --pull -t hugectr:devel -f ./dockerfile.ctr --build-arg RELEASE=false --build-arg RMM_VER=vnightly --build-arg CUDF_VER=vnightly --build-arg NVTAB_VER=vnightly --build-arg HUGECTR_DEV_MODE=true --no-cache .
question

opened by heroes999 0
[INF] Unresolved architectural decisions
Problem:

Merlin now has a bunch of libraries that need to interoperate smoothly, but a general lack of shared abstractions, conventions, and standards that would make that possible.

Goal:

Build a solid foundation for the Merlin libraries via improvements in Core

New Functionality

Core:

Shape in column schemas (for consistent tracking across libraries)

Cross-framework dtype translation (e.g. via Merlin dtypes)

Cross-framework data transfer via zero-copy protocols (for Columns and DictArrays -> Series and Dataframes)

Bespoke Merlin schema file format (i.e. a Protobuf schema for Merlin schema that isn't from Tensorflow Metadata)

Corresponding updates in all downstream libraries

Constraints:

All functionality entailed by this issue has to work in and be adoptable by all Merlin libraries

Starting Point:

[ ] Proposals:

[ ] Shapes

[ ] Dtypes

[ ] Data transfer

[ ] Schema file format

roadmap infrastructure
opened by karlhigley 1
[Task] Use pre-commit for linting in GitHub Actions Workflow
Description

We have been using pre-commit in projects for linting in our local commits. However, we've been running checks using a different mechanism in our GitHub actions workflows.

Updating to use pre-commit in a GitHub Actions workflow ensures that the checks we run locally are consistent with the ones we run in CI. Reducing the risk of version discrepancies in CI and Local development.

[X] Models https://github.com/NVIDIA-Merlin/models/pull/106

[X] Dataloader https://github.com/NVIDIA-Merlin/dataloader/pull/55

[X] Transformers4Rec https://github.com/NVIDIA-Merlin/Transformers4Rec/pull/545

[X] NVTabular https://github.com/NVIDIA-Merlin/NVTabular/pull/1723

[x] Core https://github.com/NVIDIA-Merlin/core/pull/184

[x] Systems https://github.com/NVIDIA-Merlin/systems/pull/254

[ ] Merlin

chore ci
opened by oliverholworthy 0

Releases(v22.11.00)

v22.11.00(Nov 22, 2022)
What’s Changed

🐜 Bug Fixes

Update dockerfile.ci to find the NVT dev requirements file @karlhigley (#740)

Restrict cmake<3.25.0 to avoid an issue finding CUDA toolkit @karlhigley (#739)

add required dataloader dependency to run unit tests for dataloader @jperez999 (#735)

anchor xgboost to 1.6.2 to make tests pass @jperez999 (#726)

adding new metric loss_batch from merlin models @jperez999 (#727)

📄 Documentation

Add example notebook for training and inference on Sagemaker @edknv (#692)

🔧 Maintenance

Add Jenkinsfile @AyodeAwe (#734)

Update dockerfile.ci to find the NVT dev requirements file @karlhigley (#740)

Restrict cmake<3.25.0 to avoid an issue finding CUDA toolkit @karlhigley (#739)

add required dataloader dependency to run unit tests for dataloader @jperez999 (#735)

add dataloader unit testing to container run @jperez999 (#728)

Increase timeout of multi-stage notebook test from 120 to 180 @oliverholworthy (#729)

anchor xgboost to 1.6.2 to make tests pass @jperez999 (#726)

adding new metric loss_batch from merlin models @jperez999 (#727)

Source code(tar.gz)
Source code(zip)
untagged-c73c3c7a63c75e538917(Nov 9, 2022)
What’s Changed

Remove nvtabular backend @jperez999 (#378)

Adding instructions for running the E2E example on CPU @AshishSardana (#332)

add fastai to torch dockerfile @jperez999 (#363)

Fail the build if tritonserver is missing from inference containers @benfred (#358)

Use python setup.py install to install SOK @jperez999 (#357)

hugectr dockerfile update @jperez999 (#355)

Add sparse operation kit and distributed embeddings to tf image @jperez999 (#354)

Update hugectr and torch dockerfiles @jperez999 (#353)

Update merlin base dockerfile for 22.05 @jperez999 (#348)

Update Arrow and triton versions @jperez999 (#347)

Sys tests fix @jperez999 (#342)

add Tf keras to dockerfile @jperez999 (#341)

Fix ci routes @jperez999 (#331)

Ci update @jperez999 (#337)

Update CI dockerfile with new cuda keyring and keys for repo @jperez999 (#334)

Fix CVE @jperez999 (#328)

Distributed embeddings failing to install because of PYTHONPATH Edit before hand @jperez999 (#326)

Fix for 'import error' in merlin-tensorflow-training:22.05 @EmmaQiaoCh (#317)

Merlin rm bad key @jperez999 (#316)

Merlin rm bad key @jperez999 (#314)

Remove bad key for Nvidia apt repositories @jperez999 (#311)

Add torch inf @jperez999 (#307)

Add torch inf @jperez999 (#305)

Remove merlin PYTHONPATH edits @jperez999 (#304)

remove horovod upgrade to fix tf perf issue @zehuanw (#302)

Add container infos to notebooks @bschifferer (#298)

Rm pip e @jperez999 (#300)

⚠ Breaking Changes

Update for change of hugectr branch name @EmmaQiaoCh (#705)

Fix ci order @jperez999 (#581)

Fix int test @jperez999 (#578)

Add numpy anchor version after all package building and updates @jperez999 (#566)

Fix unit scaling criteo inference serving @jperez999 (#559)

Moving loss and metrics to model.compile @oliverholworthy (#340)

🐜 Bug Fixes

adding dataloader repo to dockerfile @jperez999 (#722)

add in reinstall for dask and distributed after feast install @jperez999 (#713)

create resilient directory rm @jperez999 (#718)

fix missing lib issue by adding from upstream @jperez999 (#689)

get numba from upstream container @jperez999 (#690)

Adding git pull command to nightly docker to ensure latest commit @jperez999 (#617)

Fix unit scaling criteo inference serving @jperez999 (#559)

Fix integration test and update NB @radekosmulski (#544)

Fix nightly container builds @benfred (#518)

Fix typo in CI script file name @karlhigley (#445)

remove extra CMD from containers @jperez999 (#390)

Add pytorch tritonserver backend to ci dockerfile @jperez999 (#441)

updates from entrypoint and cupy cuda116 @jperez999 (#389)

Add matplotlib to torch container @jperez999 (#386)

Add nvt backend back into containers @jperez999 (#382)

Remove entrypoint from inference/training images. @oliverholworthy (#336)

fix: "illegal instruction" error in hugectr test jobs @EmmaQiaoCh (#295)

🚀 Features

Add Dropna to remove nulls in the dataset that creates error in the integration test of multi-stage deployment nb @rnyak (#629)

Update multi-stage deployment notebooks and integration test @rnyak (#627)

modify multi-stage deployment nbs wrt recent changes in systems @rnyak (#621)

Changes for hugectr @EmmaQiaoCh (#502)

fixes in the PoC first notebook @rnyak (#487)

Switch over HDFS build/install scripts @bashimao (#434)

Add jupyter ENVs in case launching container with normal user @EmmaQiaoCh (#439)

fix poc nbs and move poc unit test @rnyak (#387)

Moving loss and metrics to model.compile @oliverholworthy (#340)

📄 Documentation

docs: semver to calver banner @mikemckiernan (#715)

docs: Add configuration for SEO @mikemckiernan (#723)

docs: Remove SM from the support matrix @mikemckiernan (#721)

Support matrix updates for 22.10 @nvidia-merlin-bot (#719)

fix: Update HugeCTR version to 4.1.1 @mikemckiernan (#717)

Support matrix updates for 22.09 @nvidia-merlin-bot (#711)

Support matrix updates for 22.10 @nvidia-merlin-bot (#710)

fix links in Multi-stage Bulding-and-deployment nbs @rnyak (#697)

Support matrix updates for 22.09 @nvidia-merlin-bot (#636)

Address virtual dev review comments @mikemckiernan (#626)

modify multi-stage deployment nbs wrt recent changes in systems @rnyak (#621)

Add a "New Functionality" section to the Roadmap issue template @karlhigley (#596)

Update Logos of all examples @bschifferer (#569)

Update Merlin libs graphic @mikemckiernan (#560)

Support matrix updates for 22.06 @nvidia-merlin-bot (#500)

Support matrix updates for 22.07 @nvidia-merlin-bot (#499)

Get TF version from Python and not Pip @mikemckiernan (#498)

docs: Readability improvements @ryanrussell (#440)

Update Criteo Example with Merlin Models and Merlin Systems @rvk007 (#380)

fixes in the PoC first notebook @rnyak (#487)

Remove unnecessary dependencies from docs builds @mikemckiernan (#466)

add integration tests @radekosmulski (#310)

Update README with installation steps @rvk007 (#430)

Support matrix updates for 22.06 @nvidia-merlin-bot (#435)

Add NGC overview descriptions for our containers @benfred (#399)

Add run timestamp to data @mikemckiernan (#415)

Update URLs for Criteo dataset @mikemckiernan (#400)

Document the change to three containers @mikemckiernan (#379)

fix poc nbs and move poc unit test @rnyak (#387)

Add issue templates @karlhigley (#345)

Hand-edit the HugeCTR versions @mikemckiernan (#329)

Support matrix updates for 22.05 @nvidia-merlin-bot (#321)

chore: Add release-drafter @mikemckiernan (#308)

🔧 Maintenance

adding in instructions from the jenkins dockerfile to keep in one place @jperez999 (#725)

Pass mpi environment variable to tox @edknv (#724)

adding dataloader repo to dockerfile @jperez999 (#722)

docs: Add two projects to the support matrix @mikemckiernan (#720)

add in reinstall for dask and distributed after feast install @jperez999 (#713)

create resilient directory rm @jperez999 (#718)

Install horovod in the ci-runner image for CI testing @edknv (#712)

Update for change of hugectr branch name @EmmaQiaoCh (#705)

Add dist-info directories for packages copied in the tensorflow image. @oliverholworthy (#704)

revert change from 2209 requirements @jperez999 (#701)

revert to 22.08 @jperez999 (#698)

add llvmlite to base for numba dep @jperez999 (#691)

fix missing lib issue by adding from upstream @jperez999 (#689)

get numba from upstream container @jperez999 (#690)

update TFDE build @FDecaYed (#659)

Upgrade upstream container references, get pyarrow from upstream @jperez999 (#656)

fix: RMM and cuDF are no longer installed with pip @mikemckiernan (#637)

Adding hugectr to nightly build dockerfile @jperez999 (#632)

Increase timeout of unit test for second multi-stage notebook @oliverholworthy (#630)

Add Dropna to remove nulls in the dataset that creates error in the integration test of multi-stage deployment nb @rnyak (#629)

Enable unittest for 2stage notebooks @bschifferer (#628)

Update paths using BASE_DIR in multi-stage notebook to handle non-default value @oliverholworthy (#622)

Update multi-stage deployment notebooks and integration test @rnyak (#627)

anchor tf version to avoid errors in 2.10.0 libnvinfer look ups @jperez999 (#625)

Revert to working tritonserver call in notebook using testbooks @jperez999 (#619)

Remove lint env from tox config @karlhigley (#624)

Add a tox config file @karlhigley (#623)

Update ci container to use nightly base container @jperez999 (#620)

Adding git pull command to nightly docker to ensure latest commit @jperez999 (#617)

Add dock nite @jperez999 (#616)

Add dockerfile for nightly builds @jperez999 (#615)

Args mv @jperez999 (#603)

Args mv @jperez999 (#602)

anchor Tf version @jperez999 (#601)

Add a "New Functionality" section to the Roadmap issue template @karlhigley (#596)

Don't install faiss with the integration tests @benfred (#591)

Install numpy for building faiss @benfred (#590)

Fixes for faiss install @benfred (#587)

Include a SM80 enabled version of faiss on merlin-base container @benfred (#584)

Rearrange testing for faster feedback @jperez999 (#583)

Second fix @jperez999 (#582)

Update triage github actions workflow @benfred (#580)

Fix ci order @jperez999 (#581)

Reduce the size of synthetic data used in Criteo test @karlhigley (#579)

Fix int test @jperez999 (#578)

Fix int + unit tests @jperez999 (#577)

Add hadoop envs @EmmaQiaoCh (#565)

Add numpy anchor version after all package building and updates @jperez999 (#566)

Docker edit @jperez999 (#564)

Docker fix @jperez999 (#563)

Add skip tf crit unit @jperez999 (#561)

Fix unit scaling criteo inference serving @jperez999 (#559)

update base dockerfile @benfred (#556)

Update the Merlin repos in the CI image build @karlhigley (#558)

Remove dependencies of hugectr & install hps tf plugin to merlin-tf @EmmaQiaoCh (#549)

Skip the multi-stage example notebook (for now) @karlhigley (#543)

Skip multi-stage example integration test (for now) @karlhigley (#541)

Revert CMake changes @karlhigley (#538)

Install CMake in Merlin base image (instead of copying from build) @karlhigley (#524)

Mark "scaling Criteo" notebook to be skipped without TF @karlhigley (#537)

install tox in base image @nv-alaiacano (#532)

Pin fsspec==22.5.0 directly in the Merlin base image @karlhigley (#533)

Remove duplicate CMake installs @karlhigley (#523)

Make integration test script executable @karlhigley (#521)

Install Feast/Faiss before running Merlin integration tests @karlhigley (#520)

Downgrade onnxruntime due to security issue of mpmath @EmmaQiaoCh (#486)

Convert the default text in the roadmap issue template to comments @karlhigley (#483)

Add tests in the Merlin repo to the CI test scripts @karlhigley (#450)

Remove unnecessary dependencies from docs builds @mikemckiernan (#466)

Adds integration tests to Merlin Models @gabrielspmoreira (#438)

Add wandb @jperez999 (#470)

added fiddle to container for models testing @jperez999 (#469)

add integration tests @radekosmulski (#310)

Refactor the container test script to run all the SW checks, unit tests, or integration tests before failing @karlhigley (#444)

Switch over HDFS build/install scripts @bashimao (#434)

remove extra CMD from containers @jperez999 (#390)

Add pytorch tritonserver backend to ci dockerfile @jperez999 (#441)

Add jupyter ENVs in case launching container with normal user @EmmaQiaoCh (#439)

remove excess python path setting @jperez999 (#432)

Remove stale doc reviews @mikemckiernan (#417)

Fix failing treelite install @jperez999 (#416)

Add FIL support to Base container, Add e2e support in ci container @jperez999 (#414)

Always run NVT integration tests @benfred (#401)

Inline hugectr container tests @benfred (#398)

Update test_container.sh script @benfred (#396)

Restrict tritonclient to 2.22.0 @EmmaQiaoCh (#391)

Add sok test new @EmmaQiaoCh (#384)

Add skip checks to examples tests @jperez999 (#392)

updates from entrypoint and cupy cuda116 @jperez999 (#389)

Add matplotlib to torch container @jperez999 (#386)

Rm old docker @jperez999 (#383)

Add a GA workflow that requires labels on PR's @benfred (#381)

Support matrix updates for 22.05 @nvidia-merlin-bot (#352)

Support matrix updates for 21.11 @nvidia-merlin-bot (#372)

Support matrix updates for 21.09 @nvidia-merlin-bot (#373)

Support matrix updates for 22.03 @nvidia-merlin-bot (#368)

Support matrix updates for 22.02 @nvidia-merlin-bot (#370)

Support matrix updates for 22.01 @nvidia-merlin-bot (#369)

Support matrix updates for 21.12 @nvidia-merlin-bot (#371)

Refactor SMX for blossom @mikemckiernan (#351)

Support matrix updates for 22.04 @nvidia-merlin-bot (#367)

Update for hdfs @EmmaQiaoCh (#365)

Enable running the SMX data job in Blossom @mikemckiernan (#325)

Request that PRs are labeled @mikemckiernan (#327)

Also use horovod from DLFW for hugectr training container @benfred (#303)

Source code(tar.gz)
Source code(zip)
v22.05(Jun 9, 2022)
What's Changed

Removes --user when installing NVTabular by @albert17 in https://github.com/NVIDIA-Merlin/Merlin/pull/64

Fix typos in docker files by @benfred in https://github.com/NVIDIA-Merlin/Merlin/pull/68

Fix link in README and update merlin examples by @benfred in https://github.com/NVIDIA-Merlin/Merlin/pull/70

Adds package and fixes typo by @albert17 in https://github.com/NVIDIA-Merlin/Merlin/pull/69

21.11 DLFW by @albert17 in https://github.com/NVIDIA-Merlin/Merlin/pull/73

update unified container to support hugectr v3.3 by @zehuanw in https://github.com/NVIDIA-Merlin/Merlin/pull/76

fix illegal instruction by adding "PORTABLE=1" by @zehuanw in https://github.com/NVIDIA-Merlin/Merlin/pull/77

Fixing the issue of missing install.sh in tf container build by @zehuanw in https://github.com/NVIDIA-Merlin/Merlin/pull/79

Rel 21.12 by @albert17 in https://github.com/NVIDIA-Merlin/Merlin/pull/78

Add new issues to the backlog project by @benfred in https://github.com/NVIDIA-Merlin/Merlin/pull/80

Updates dlfw 21.12 by @albert17 in https://github.com/NVIDIA-Merlin/Merlin/pull/86

Cuda compat removes shell by @albert17 in https://github.com/NVIDIA-Merlin/Merlin/pull/89

[Ready to be reviewed] Support Customized HCTR Repo in the unified containers by @zehuanw in https://github.com/NVIDIA-Merlin/Merlin/pull/85

mask_nvt_for_tf1_image by @shijieliu in https://github.com/NVIDIA-Merlin/Merlin/pull/92

added hdfs and s3 support by @jperez999 in https://github.com/NVIDIA-Merlin/Merlin/pull/94

Arrow s3 hdfs by @jperez999 in https://github.com/NVIDIA-Merlin/Merlin/pull/95

activate orc by @jperez999 in https://github.com/NVIDIA-Merlin/Merlin/pull/96

Release 22.02 by @albert17 in https://github.com/NVIDIA-Merlin/Merlin/pull/93

Arrow s3 hdfs by @jperez999 in https://github.com/NVIDIA-Merlin/Merlin/pull/98

Reduce containers size by @albert17 in https://github.com/NVIDIA-Merlin/Merlin/pull/100

Fix security issue jershi by @jershi425 in https://github.com/NVIDIA-Merlin/Merlin/pull/123

add mpi4py/onnx/onnxruntime for ctr/tf by @shijieliu in https://github.com/NVIDIA-Merlin/Merlin/pull/122

New containers by @albert17 in https://github.com/NVIDIA-Merlin/Merlin/pull/119

Rel 22.03 by @albert17 in https://github.com/NVIDIA-Merlin/Merlin/pull/124

CI container by @albert17 in https://github.com/NVIDIA-Merlin/Merlin/pull/132

Ci container fixes by @albert17 in https://github.com/NVIDIA-Merlin/Merlin/pull/133

Merlin Release 22.03 by @albert17 in https://github.com/NVIDIA-Merlin/Merlin/pull/147

Software Versions Tools by @albert17 in https://github.com/NVIDIA-Merlin/Merlin/pull/141

Fix inference container by @albert17 in https://github.com/NVIDIA-Merlin/Merlin/pull/135

Inference PoC with Merlin Systems by @rnyak in https://github.com/NVIDIA-Merlin/Merlin/pull/169

Update the Merlin README with additional edits from docs bash by @karlhigley in https://github.com/NVIDIA-Merlin/Merlin/pull/167

Copy MovieLens and Criteo example notebooks from NVTabular by @karlhigley in https://github.com/NVIDIA-Merlin/Merlin/pull/166

docs: add a docs build by @mikemckiernan in https://github.com/NVIDIA-Merlin/Merlin/pull/174

docs: Add documentation badge by @mikemckiernan in https://github.com/NVIDIA-Merlin/Merlin/pull/175

add basic tests by @jperez999 in https://github.com/NVIDIA-Merlin/Merlin/pull/176

Initial attempt at support matrices by @mikemckiernan in https://github.com/NVIDIA-Merlin/Merlin/pull/182

Updating Example README by @bschifferer in https://github.com/NVIDIA-Merlin/Merlin/pull/183

docs: Add redirect page by @mikemckiernan in https://github.com/NVIDIA-Merlin/Merlin/pull/184

Update deploying multi-stage RecSys PoC nb by @rnyak in https://github.com/NVIDIA-Merlin/Merlin/pull/186

Merlin Container Release 22.04 by @albert17 in https://github.com/NVIDIA-Merlin/Merlin/pull/178

Set pandas version by @albert17 in https://github.com/NVIDIA-Merlin/Merlin/pull/188

Print more testing info by @albert17 in https://github.com/NVIDIA-Merlin/Merlin/pull/190

Test merlin-systems as part of verifying containers by @benfred in https://github.com/NVIDIA-Merlin/Merlin/pull/191

Update packages for scan by @albert17 in https://github.com/NVIDIA-Merlin/Merlin/pull/192

Set dask_cuda version by @albert17 in https://github.com/NVIDIA-Merlin/Merlin/pull/193

Keep HugeCTR source by @albert17 in https://github.com/NVIDIA-Merlin/Merlin/pull/195

Typo by @albert17 in https://github.com/NVIDIA-Merlin/Merlin/pull/198

removed path from feast as updated in systems by @jperez999 in https://github.com/NVIDIA-Merlin/Merlin/pull/196

Fail on errors in test_container.sh by @benfred in https://github.com/NVIDIA-Merlin/Merlin/pull/194

docs: Add 22.04 support matrix by @mikemckiernan in https://github.com/NVIDIA-Merlin/Merlin/pull/201

Set pandas version in CI container by @albert17 in https://github.com/NVIDIA-Merlin/Merlin/pull/200

fix torch horovod: explicitly does not need tensorflow by @jperez999 in https://github.com/NVIDIA-Merlin/Merlin/pull/206

Torch fix by @jperez999 in https://github.com/NVIDIA-Merlin/Merlin/pull/207

Torch container horovod release version by @jperez999 in https://github.com/NVIDIA-Merlin/Merlin/pull/215

Add Arrow S3 support to the tensorflow-training container by @karlhigley in https://github.com/NVIDIA-Merlin/Merlin/pull/217

add distributed-embeddings to merlin dockerfile by @FDecaYed in https://github.com/NVIDIA-Merlin/Merlin/pull/208

Revert "Add Arrow S3 support to the tensorflow-training container (… by @karlhigley in https://github.com/NVIDIA-Merlin/Merlin/pull/220

Install required packages for multi-stage notebooks using %pip by @karlhigley in https://github.com/NVIDIA-Merlin/Merlin/pull/221

WIP: Fix HDFS linking by @bashimao in https://github.com/NVIDIA-Merlin/Merlin/pull/151

Restructure container builds to use multi-stage builds and a Merlin base image by @karlhigley in https://github.com/NVIDIA-Merlin/Merlin/pull/234

add args, ARG can only handle one arg at a time by @jperez999 in https://github.com/NVIDIA-Merlin/Merlin/pull/245

Args fix by @jperez999 in https://github.com/NVIDIA-Merlin/Merlin/pull/246

Replace run_ensemble_on_tritonserver() with send_triton_request() and do minor updates by @rnyak in https://github.com/NVIDIA-Merlin/Merlin/pull/244

Add a few Python and system packages for the example notebooks by @karlhigley in https://github.com/NVIDIA-Merlin/Merlin/pull/247

remove horovod from torch container by @jperez999 in https://github.com/NVIDIA-Merlin/Merlin/pull/249

Apply key rotation fixes to existing container Dockerfiles by @karlhigley in https://github.com/NVIDIA-Merlin/Merlin/pull/252

add entrypoint to all containers by @jperez999 in https://github.com/NVIDIA-Merlin/Merlin/pull/254

Add hadoop scripts by @jperez999 in https://github.com/NVIDIA-Merlin/Merlin/pull/260

Copy hadoop by @jperez999 in https://github.com/NVIDIA-Merlin/Merlin/pull/261

Rm hadoop xtra by @jperez999 in https://github.com/NVIDIA-Merlin/Merlin/pull/262

Rm e pip by @jperez999 in https://github.com/NVIDIA-Merlin/Merlin/pull/283

Add Unit test for Building and deploying multi-stage Recsys nbs by @rnyak in https://github.com/NVIDIA-Merlin/Merlin/pull/281

docs: Use MyST and sphinx-external-toc by @mikemckiernan in https://github.com/NVIDIA-Merlin/Merlin/pull/251

Add an ARG which could disable distributed_embeddings by @EmmaQiaoCh in https://github.com/NVIDIA-Merlin/Merlin/pull/291

Add Unit test for Building and deploying multi-stage Recsys nbs by @rnyak in https://github.com/NVIDIA-Merlin/Merlin/pull/288

docs: Tooling and automation for support matrix by @mikemckiernan in https://github.com/NVIDIA-Merlin/Merlin/pull/290

Add key update mechanism for ci dockerfile by @jperez999 in https://github.com/NVIDIA-Merlin/Merlin/pull/292

Rm nvm by @jperez999 in https://github.com/NVIDIA-Merlin/Merlin/pull/296

Revert rm by @jperez999 in https://github.com/NVIDIA-Merlin/Merlin/pull/297

New Contributors

@jershi425 made their first contribution in https://github.com/NVIDIA-Merlin/Merlin/pull/123

@rnyak made their first contribution in https://github.com/NVIDIA-Merlin/Merlin/pull/169

@FDecaYed made their first contribution in https://github.com/NVIDIA-Merlin/Merlin/pull/208

@bashimao made their first contribution in https://github.com/NVIDIA-Merlin/Merlin/pull/151

Full Changelog: https://github.com/NVIDIA-Merlin/Merlin/compare/v0.7.1...v22.05
Source code(tar.gz)
Source code(zip)

NVIDIA Merlin is an open source library designed to accelerate recommender systems on NVIDIA’s GPUs.

Related tags

Overview

Benefits

Components of NVIDIA Merlin

Examples

Resources

Comments

Problem:

Goal:

Constraints:

Starting Point:

Follow-up

Problem:

Goal:

Constraints:

Starting Point:

❓ Questions & Help

Problem:

Goal:

New Functionality

Constraints:

Starting Point:

Description

Releases(v22.11.00)

v22.11.00(Nov 22, 2022)

What’s Changed

🐜 Bug Fixes

📄 Documentation

🔧 Maintenance

untagged-c73c3c7a63c75e538917(Nov 9, 2022)

What’s Changed

⚠ Breaking Changes

🐜 Bug Fixes

🚀 Features

📄 Documentation

🔧 Maintenance

v22.05(Jun 9, 2022)

What's Changed

New Contributors

Owner

Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk

Knowledge-aware Coupled Graph Neural Network for Social Recommendation

Self-supervised Graph Learning for Recommendation

Detecting Beneficial Feature Interactions for Recommender Systems, AAAI 2021

Plex-recommender - Get movie recommendations based on your current PleX library

Code for MB-GMN, SIGIR 2021

It is a movie recommender web application which is developed using the Python.

Graph Neural Network based Social Recommendation Model. SIGIR2019.

NVIDIA Merlin is an open source library designed to accelerate recommender systems on NVIDIA’s GPUs.

This is our Tensorflow implementation for "Graph-based Embedding Smoothing for Sequential Recommendation" (GES) (TKDE, 2021).

This is our implementation of GHCF: Graph Heterogeneous Collaborative Filtering (AAAI 2021)

Implementation of a hadoop based movie recommendation system

E-Commerce recommender demo with real-time data and a graph database

Real time recommendation playground

This library intends to be a reference for recommendation engines in Python

A tensorflow implementation of the RecoGCN model in a CIKM'19 paper, titled with "Relation-Aware Graph Convolutional Networks for Agent-Initiated Social E-Commerce Recommendation".

Temporal Meta-path Guided Explainable Recommendation (WSDM2021)

Continuous-Time Sequential Recommendation with Temporal Graph Collaborative Transformer

reXmeX is recommender system evaluation metric library.

Dual Graph Attention Networks for Deep Latent Representation of Multifaceted Social Effects in Recommender Systems