Facilitating the design, comparison and sharing of deep text matching models.

Overview
logo

MatchZoo Tweet

Facilitating the design, comparison and sharing of deep text matching models.
MatchZoo 是一个通用的文本匹配工具包,它旨在方便大家快速的实现、比较、以及分享最新的深度文本匹配模型。

Python 3.6 Pypi Downloads Documentation Status Build Status codecov License Requirements Status

🔥 News: MatchZoo-py (PyTorch version of MatchZoo) is ready now.

The goal of MatchZoo is to provide a high-quality codebase for deep text matching research, such as document retrieval, question answering, conversational response ranking, and paraphrase identification. With the unified data processing pipeline, simplified model configuration and automatic hyper-parameters tunning features equipped, MatchZoo is flexible and easy to use.

Tasks Text 1 Text 2 Objective
Paraphrase Indentification string 1 string 2 classification
Textual Entailment text hypothesis classification
Question Answer question answer classification/ranking
Conversation dialog response classification/ranking
Information Retrieval query document ranking

Get Started in 60 Seconds

To train a Deep Semantic Structured Model, import matchzoo and prepare input data.

import matchzoo as mz

train_pack = mz.datasets.wiki_qa.load_data('train', task='ranking')
valid_pack = mz.datasets.wiki_qa.load_data('dev', task='ranking')

Preprocess your input data in three lines of code, keep track parameters to be passed into the model.

preprocessor = mz.preprocessors.DSSMPreprocessor()
train_processed = preprocessor.fit_transform(train_pack)
valid_processed = preprocessor.transform(valid_pack)

Make use of MatchZoo customized loss functions and evaluation metrics:

ranking_task = mz.tasks.Ranking(loss=mz.losses.RankCrossEntropyLoss(num_neg=4))
ranking_task.metrics = [
    mz.metrics.NormalizedDiscountedCumulativeGain(k=3),
    mz.metrics.MeanAveragePrecision()
]

Initialize the model, fine-tune the hyper-parameters.

model = mz.models.DSSM()
model.params['input_shapes'] = preprocessor.context['input_shapes']
model.params['task'] = ranking_task
model.guess_and_fill_missing_params()
model.build()
model.compile()

Generate pair-wise training data on-the-fly, evaluate model performance using customized callbacks on validation data.

train_generator = mz.PairDataGenerator(train_processed, num_dup=1, num_neg=4, batch_size=64, shuffle=True)
valid_x, valid_y = valid_processed.unpack()
evaluate = mz.callbacks.EvaluateAllMetrics(model, x=valid_x, y=valid_y, batch_size=len(valid_x))
history = model.fit_generator(train_generator, epochs=20, callbacks=[evaluate], workers=5, use_multiprocessing=False)

References

Tutorials

English Documentation

中文文档

If you're interested in the cutting-edge research progress, please take a look at awaresome neural models for semantic match.

Install

MatchZoo is dependent on Keras and Tensorflow. Two ways to install MatchZoo:

Install MatchZoo from Pypi:

pip install matchzoo

Install MatchZoo from the Github source:

git clone https://github.com/NTMC-Community/MatchZoo.git
cd MatchZoo
python setup.py install

Models

  1. DRMM: this model is an implementation of A Deep Relevance Matching Model for Ad-hoc Retrieval.

  2. MatchPyramid: this model is an implementation of Text Matching as Image Recognition

  3. ARC-I: this model is an implementation of Convolutional Neural Network Architectures for Matching Natural Language Sentences

  4. DSSM: this model is an implementation of Learning Deep Structured Semantic Models for Web Search using Clickthrough Data

  5. CDSSM: this model is an implementation of Learning Semantic Representations Using Convolutional Neural Networks for Web Search

  6. ARC-II: this model is an implementation of Convolutional Neural Network Architectures for Matching Natural Language Sentences

  7. MV-LSTM:this model is an implementation of A Deep Architecture for Semantic Matching with Multiple Positional Sentence Representations

  8. aNMM: this model is an implementation of aNMM: Ranking Short Answer Texts with Attention-Based Neural Matching Model

  9. DUET: this model is an implementation of Learning to Match Using Local and Distributed Representations of Text for Web Search

  10. K-NRM: this model is an implementation of End-to-End Neural Ad-hoc Ranking with Kernel Pooling

  11. CONV-KNRM: this model is an implementation of Convolutional neural networks for soft-matching n-grams in ad-hoc search

  12. models under development: Match-SRNN, DeepRank, BiMPM ....

Citation

If you use MatchZoo in your research, please use the following BibTex entry.

@inproceedings{Guo:2019:MLP:3331184.3331403,
 author = {Guo, Jiafeng and Fan, Yixing and Ji, Xiang and Cheng, Xueqi},
 title = {MatchZoo: A Learning, Practicing, and Developing System for Neural Text Matching},
 booktitle = {Proceedings of the 42Nd International ACM SIGIR Conference on Research and Development in Information Retrieval},
 series = {SIGIR'19},
 year = {2019},
 isbn = {978-1-4503-6172-9},
 location = {Paris, France},
 pages = {1297--1300},
 numpages = {4},
 url = {http://doi.acm.org/10.1145/3331184.3331403},
 doi = {10.1145/3331184.3331403},
 acmid = {3331403},
 publisher = {ACM},
 address = {New York, NY, USA},
 keywords = {matchzoo, neural network, text matching},
} 

Development Team

​ ​ ​ ​

faneshion
Fan Yixing

Core Dev
ASST PROF, ICT

bwanglzu
Wang Bo

Core Dev
M.S. TU Delft

uduse
Wang Zeyi

Core Dev
B.S. UC Davis

pl8787
Pang Liang

Core Dev
ASST PROF, ICT

yangliuy
Yang Liu

Core Dev
PhD. UMASS

wqh17101
Wang Qinghua

Documentation
B.S. Shandong Univ.

ZizhenWang
Wang Zizhen

Dev
M.S. UCAS

lixinsu
Su Lixin

Dev
PhD. UCAS

zhouzhouyang520
Yang Zhou

Dev
M.S. CQUT

rgtjf
Tian Junfeng

Dev
M.S. ECNU

Contribution

Please make sure to read the Contributing Guide before creating a pull request. If you have a MatchZoo-related paper/project/compnent/tool, send a pull request to this awesome list!

Thank you to all the people who already contributed to MatchZoo!

Jianpeng Hou, Lijuan Chen, Yukun Zheng, Niuguo Cheng, Dai Zhuyun, Aneesh Joshi, Zeno Gantner, Kai Huang, stanpcf, ChangQF, Mike Kellogg

Project Organizers

  • Jiafeng Guo
    • Institute of Computing Technology, Chinese Academy of Sciences
    • Homepage
  • Yanyan Lan
    • Institute of Computing Technology, Chinese Academy of Sciences
    • Homepage
  • Xueqi Cheng
    • Institute of Computing Technology, Chinese Academy of Sciences
    • Homepage

License

Apache-2.0

Copyright (c) 2015-present, Yixing Fan (faneshion)

Issues
  • Run aNMM

    Run aNMM

    I am new to MatrchZoo. I wonder how to run aNMM. The docs don't have usage for aNMM. I feel I have to run a script for calculating the bin_sizes for aNMM? But I cannot find where this script lies.

    Furthermore, my training data does need to have a format like here: https://github.com/NTMC-Community/MatchZoo/blob/master/matchzoo/datasets/toy/train.csv

    right?

    And, where are the batches created? Since you have positive and negative documents for each query, the batch should contain examples with pos and negs samples, right?

    How can I load my own data?

    Thanks.

    question 
    opened by ctrado18 26
  • Suggestions for MatchZoo 2.0

    Suggestions for MatchZoo 2.0

    Anybody wanting to make suggestions for MZ 2.0, please add it in this issue.

    Here are my suggestions:

    • [x] Add docstrings for all functions and classes
    • [ ] Make MZ OS independent
    • [ ] Make MZ usable by providing custom data
    • [ ] Allow External Benchmarking
    • [ ] Siamese Recurrent Networks (Proposed Model)
    • [ ] docker, conda, virtualenv support (wishlist)

    More details at https://github.com/faneshion/MatchZoo/issues/106

    2.0 discussion 
    opened by aneesh-joshi 24
  • Reproduction of Benchmark Results

    Reproduction of Benchmark Results

    When running through the procedure described in the readme for the benchmark results of WikiQA, the reproduced values for [email protected], [email protected], and MAP are roughly half of the values shown in the table. Could you provide insight as to why this may be occuring?

    bug question 
    opened by ghost 21
  • External benchmarking on Match Zoo

    External benchmarking on Match Zoo

    Hi, I am trying to establish benchmark results on all the document similarity models at MatchZoo. While there are some established benchmarks, it would be good if we had a MatchZoo-code independent system for evaluating results.

    Eg:

        input_data -> MZ -> result_data
        result_data - > independent_evaluation_code -> metric scores (Example: [email protected], map, etc.)
    

    The current scenario is that the evaluation code is strongly ingrained in the MZ code, which can cause problems with different commits over time. As seen in https://github.com/faneshion/MatchZoo/issues/99

    1. Is there a way already for doing this? I assume TREC is for that. Could someone direct me on how to use it? 2. Could some one direct me on how to go about making such an evaluation code? (Once developed, I will push it back into MZ and it could be like a Continuous Integration test.)

    What do you think, @faneshion @yangliuy @bwanglzu @millanbatra @mandroid6?

    Thanks!

    opened by aneesh-joshi 20
  • add preparation data for TREC data set

    add preparation data for TREC data set

    I've added all modules for processing TREC dataset, mainly: the modifications enable to get TREC like run with corresponding ids for queries and documents. Hence, the evaluation with trec_eval is possible now. In addition to performing n-cross validation with MatchZoo. Soon, I'll add programs for constructing TREC input files that are needed by the added functions.

    opened by thiziri 18
  • Tensorflow2.0目前是否全面支持?

    Tensorflow2.0目前是否全面支持?

    如题,我目前的运行环境使用是tf2.0版本,keras是为2.3.0。 但无法执行 报错信息如下:

    ~/anaconda3/lib/python3.7/site-packages/keras/engine/training.py in _prepare_total_loss(self, masks)
        690 
        691                     output_loss = loss_fn(
    --> 692                         y_true, y_pred, sample_weight=sample_weight)
        693 
        694                 if len(self.outputs) > 1:
    
    TypeError: __call__() got an unexpected keyword argument 'sample_weight'
    
    question 
    opened by hezhefly 18
  • DSSM returning NaN for loss when used with tensorflow-gpu backend.

    DSSM returning NaN for loss when used with tensorflow-gpu backend.

    I have been running DSSM on quite a large dataset and was looking at tensorflow-gpu to speed up the training. However the returning loss and mae are always NaN for both the train and evaluation phase. I have tried a very basic tensorflow model from their tutorials and it works fine.

    Im not really sure where to start debugging with this, any help would be greatly appreciated.

    The model works fine with the cpu version of tensorflow. Example:

    model.fit(x,y, epochs=2)
    
    Epoch 1/2
    10000/10000 [==============================] - 1s 139us/step - loss: nan - mean_absolute_error: nan
    Epoch 2/2
    10000/10000 [==============================] - 1s 138us/step - loss: nan - mean_absolute_error: nan
    
    opened by MichaelStarkey 18
  • Using a model as a search engine

    Using a model as a search engine

    I see that the models usually needs a text1 and text2 to perform the training and predictions. Usually on search engines I just need the text2 (document) to perform the indexing step (training).

    How can I train the model like a search engine? i.e. I don't have the text1 information (query/question) and I want to index my documents.

    Does using the same text for text1 and text2 works for training?

    question 
    opened by denisb411 18
  • A question about the manner of input data to model.fit_generator()

    A question about the manner of input data to model.fit_generator()

    I find that input data is sent to model by outside circulation iteration. Seeing the follow plot.

    image

    I am feeling uncertain why do it and I change it to this(because I want to use tensorboard by callback function). image

    I just use model.fit_generator() to handel data and train. However, it raises a exception that is caused by validation_data2018-08-15 19-46-50 I trace it into keras inner cores and find it occurs when model starts to run evaluate_generator()。In the function evaluate_generator(),eval data generator is empty and lead to a exception at a epoch! However, it is strange and confuses me why the exception does not occur in the start epoch。I trace code and think it may be a bugger of Keras,is it true? Additional, whether this is the reason that you make a outside iteration to train model。

    大佬们好,我发现代码一开始是在外部循环迭代输入数据进入模型训练,我很奇怪并且我想在fit_generator()中调用回调函数,所以我自己手动改了代码,如图2,但是这个产生了一个异常,经过追溯,我发现是model在训练的时候,在一个epoch结束之后调用验证集跑的时候,验证集数据在keras内部代码通过一个队列调用的时候,是空的导致的异常,但是很奇怪的是,这个问题不是一开始的epoch就发生的,而是在epoch好多次之后的某一次epoch,出现了这个异常,我很奇怪,追溯代码,感觉好像是keras的bug,不知道对不对,另外我想一开始的数据处理,是不是为了解决这个问题?求指教~谢谢~

    question 
    opened by Star-in-Sky 17
  • v1.0 config equivalent in v2.0

    v1.0 config equivalent in v2.0

    Hi,

    I found v2.0 no longer support training config in v1.0.

    Is categorical_crossentropy equivalent to losses.rank_crossentropy_loss in 2.0

    question 2.0 
    opened by logicmd 17
  • support keras 2.3 and tensorflow 2.0

    support keras 2.3 and tensorflow 2.0

    • update requirements.txt: keras=2.3.0 and tensorflow >= 2.0.0
    • upgrade pip in .travis.yml (tf 2.0 requires pip >= 19)
    • make raking losses inherit keras.losses.Loss to support sample_weight keyword param
    • replace some keras.backend.tf with tf (K.tf does not exist anymore in 2.3.0 as keras is going to be synced with tf.keras and drop multi-backend)
    • add clear_session before prepare in model tests to prevent OOM during CI test

    fix #789

    opened by matthew-z 17
  • Predict a new query

    Predict a new query

    I already searched here. I use right now v1. Is there any sample code (I just found a broken link)? I have my trained DRRM model and want to predict ranking documents for a new query.

    How is the current state in v2 to that?

    I handleld to train the modle for my own custom text data with own fast word embedding. Normally I just would predict a new query but the output are the text IDs. So for DRRM are new words ignored which have no embedding in the dict?

    Thank you very much!

    question 
    opened by datistiquo 15
  • Use trained CDSSM model to predict on another dataset

    Use trained CDSSM model to predict on another dataset

    Hi I used MatchZoo to train a CDSSM model cdssm_classify.weights.1000. However I don't know how to use the trained model to predict on another dataset input.csv.

    The format of input.csv is below

    continue read the king of cards\tresume watching house of cards ...

    I saw documents say that

    python matchzoo/main.py --phase predict --model_file examples/mymodel/config/cdssm_classify.config

    However, I don't know how to generate relation_file in prediction phase. I presume the model is based ids(triletter_dict.txt, word_dict.txt, word_triletter_map.txt) generated in training phase. If I rerun test_preparation_for_classify.py on input.csv, I guess I will have a different ids(triletter_dict.txt, word_dict.txt, word_triletter_map.txt) which would cause issues when running existing models.

    "predict": {
            "input_type": "Triletter_PointGenerator",
            "dtype": "cdssm",
            "phase": "PREDICT",
            "batch_size": 128,
            "relation_file": "/home/ec2-user/oneshot_cleaned/relation_test.txt"
        }
    
    question 
    opened by logicmd 15
  • Segmentation fault running DSSM on another dataset

    Segmentation fault running DSSM on another dataset

    python matchzoo/main.py --phase train --model_file examples/config/dssm_ranking.config 
    Using TensorFlow backend.
    2018-01-08 11:47:26.702599: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.2 AVX
    {
      "inputs": {
        "test": {
          "phase": "EVAL", 
          "input_type": "Triletter_ListGenerator", 
          "batch_list": 10, 
          "relation_file": "./data/relation_test.txt", 
          "dtype": "dssm"
        }, 
        "predict": {
          "phase": "PREDICT", 
          "input_type": "Triletter_ListGenerator", 
          "batch_list": 10, 
          "relation_file": "./data/relation_test.txt", 
          "dtype": "dssm"
        }, 
        "train": {
          "relation_file": "./data/relation_train.txt", 
          "input_type": "Triletter_PairGenerator", 
          "batch_size": 100, 
          "batch_per_iter": 5, 
          "dtype": "dssm", 
          "phase": "TRAIN", 
          "query_per_iter": 3, 
          "use_iter": true
        }, 
        "share": {
          "vocab_size": 3484, 
          "embed_size": 10, 
          "target_mode": "ranking", 
          "text1_corpus": "./data/corpus_preprocessed.txt", 
          "text2_corpus": "./data/corpus_preprocessed.txt", 
          "word_triletter_map_file": "./data/word_triletter_map.txt"
        }, 
        "valid": {
          "phase": "EVAL", 
          "input_type": "Triletter_ListGenerator", 
          "batch_list": 10, 
          "relation_file": "./data/relation_valid.txt", 
          "dtype": "dssm"
        }
      }, 
      "global": {
        "optimizer": "adam", 
        "num_iters": 10, 
        "save_weights_iters": 10, 
        "learning_rate": 0.0001, 
        "test_weights_iters": 10, 
        "weights_file": "examples/weights/dssm_ranking.weights", 
        "model_type": "PY", 
        "display_interval": 10
      }, 
      "outputs": {
        "predict": {
          "save_format": "TREC", 
          "save_path": "predict.test.dssm_ranking.txt"
        }
      }, 
      "losses": [
        {
          "object_name": "rank_hinge_loss", 
          "object_params": {
            "margin": 1.0
          }
        }
      ], 
      "metrics": [
        "[email protected]", 
        "[email protected]", 
        "map"
      ], 
      "net_name": "dssm", 
      "model": {
        "model_py": "dssm.DSSM", 
        "setting": {
          "dropout_rate": 0.5, 
          "hidden_sizes": [
            100, 
            30
          ]
        }, 
        "model_path": "matchzoo/models/"
      }
    }
    [Embedding] Embedding Load Done.
    [Input] Process Input Tags. [u'train'] in TRAIN, [u'test', u'valid'] in EVAL.
    [./data/corpus_preprocessed.txt]
            Data size: 71849
    [Dataset] 1 Dataset Load Done.
    {u'relation_file': u'./data/relation_train.txt', u'vocab_size': 3484, u'embed_size': 10, u'target_mode': u'ranking', u'input_type': u'Triletter_PairGenerator', u'text1_corpus': u'./data/corpus_preprocessed.txt', u'batch_size': 100, u'batch_per_iter': 5, u'text2_corpus': u'./data/corpus_preprocessed.txt', u'word_triletter_map_file': u'./data/word_triletter_map.txt', u'dtype': u'dssm', u'phase': u'TRAIN', 'embed': array([[-0.18291523, -0.00574826, -0.13887608, ..., -0.13666791,
             0.00907838,  0.13784599],
           [ 0.03368587,  0.13503729,  0.00107509, ...,  0.18584302,
             0.03414046, -0.14042418],
           [ 0.03610065,  0.19066425,  0.11800677, ...,  0.14983599,
            -0.09182639, -0.0633784 ],
           ..., 
           [ 0.1179866 , -0.19746014,  0.08622313, ..., -0.02868197,
            -0.07183626,  0.06968395],
           [-0.02044802,  0.17994043, -0.0810562 , ...,  0.03050527,
             0.03873055, -0.14228183],
           [ 0.04971068,  0.16548306,  0.08958763, ...,  0.0537957 ,
             0.04853643,  0.09921838]], dtype=float32), u'query_per_iter': 3, u'use_iter': True}
    [./data/relation_train.txt]
            Instance size: 32953
    [Triletter_PairGenerator] init done
    {u'relation_file': u'./data/relation_test.txt', u'vocab_size': 3484, u'embed_size': 10, u'target_mode': u'ranking', u'input_type': u'Triletter_ListGenerator', u'batch_list': 10, u'text1_corpus': u'./data/corpus_preprocessed.txt', u'text2_corpus': u'./data/corpus_preprocessed.txt', u'word_triletter_map_file': u'./data/word_triletter_map.txt', u'dtype': u'dssm', u'phase': u'EVAL', 'embed': array([[-0.18291523, -0.00574826, -0.13887608, ..., -0.13666791,
             0.00907838,  0.13784599],
           [ 0.03368587,  0.13503729,  0.00107509, ...,  0.18584302,
             0.03414046, -0.14042418],
           [ 0.03610065,  0.19066425,  0.11800677, ...,  0.14983599,
            -0.09182639, -0.0633784 ],
           ..., 
           [ 0.1179866 , -0.19746014,  0.08622313, ..., -0.02868197,
            -0.07183626,  0.06968395],
           [-0.02044802,  0.17994043, -0.0810562 , ...,  0.03050527,
             0.03873055, -0.14228183],
           [ 0.04971068,  0.16548306,  0.08958763, ...,  0.0537957 ,
             0.04853643,  0.09921838]], dtype=float32)}
    [./data/relation_test.txt]
            Instance size: 25535
    List Instance Count: 1445
    [Triletter_ListGenerator] init done
    {u'relation_file': u'./data/relation_valid.txt', u'vocab_size': 3484, u'embed_size': 10, u'target_mode': u'ranking', u'input_type': u'Triletter_ListGenerator', u'batch_list': 10, u'text1_corpus': u'./data/corpus_preprocessed.txt', u'text2_corpus': u'./data/corpus_preprocessed.txt', u'word_triletter_map_file': u'./data/word_triletter_map.txt', u'dtype': u'dssm', u'phase': u'EVAL', 'embed': array([[-0.18291523, -0.00574826, -0.13887608, ..., -0.13666791,
             0.00907838,  0.13784599],
           [ 0.03368587,  0.13503729,  0.00107509, ...,  0.18584302,
             0.03414046, -0.14042418],
           [ 0.03610065,  0.19066425,  0.11800677, ...,  0.14983599,
            -0.09182639, -0.0633784 ],
           ..., 
           [ 0.1179866 , -0.19746014,  0.08622313, ..., -0.02868197,
            -0.07183626,  0.06968395],
           [-0.02044802,  0.17994043, -0.0810562 , ...,  0.03050527,
             0.03873055, -0.14228183],
           [ 0.04971068,  0.16548306,  0.08958763, ...,  0.0537957 ,
             0.04853643,  0.09921838]], dtype=float32)}
    [./data/relation_valid.txt]
            Instance size: 24919
    List Instance Count: 1443
    [Triletter_ListGenerator] init done
    [DSSM] init done
    [layer]: Input  [shape]: [None, 3484] 
     [Memory] Total Memory Use: 249.0977 MB          Resident: 261197824 Shared: 0 UnshareData: 0 UnshareStack: 0 
    [layer]: Input  [shape]: [None, 3484] 
     [Memory] Total Memory Use: 249.1133 MB          Resident: 261214208 Shared: 0 UnshareData: 0 UnshareStack: 0 
    [layer]: MLP    [shape]: [None, 30] 
     [Memory] Total Memory Use: 250.2773 MB          Resident: 262434816 Shared: 0 UnshareData: 0 UnshareStack: 0 
    [layer]: MLP    [shape]: [None, 30] 
     [Memory] Total Memory Use: 250.5195 MB          Resident: 262688768 Shared: 0 UnshareData: 0 UnshareStack: 0 
    [layer]: Dot    [shape]: [None, 1] 
     [Memory] Total Memory Use: 250.6992 MB          Resident: 262877184 Shared: 0 UnshareData: 0 UnshareStack: 0 
    [Model] Model Compile Done.
    Segmentation fault: 11
    
    opened by levyfan 14
  • input_dpool_index for match_pyramid

    input_dpool_index for match_pyramid

    Hi! How to build input_dpool_index for match_pyramid? Thanks

    inputs = [input_left, input_right, input_dpool_index]

    question 
    opened by Decalogue 14
  • mtrand error: 'NoneType' object cannot be interpreted as an index

    mtrand error: 'NoneType' object cannot be interpreted as an index

    hey, hoping someone can help me :)

    After this code:

    python gen_w2v.py /home/ba/MatchZoo/data/toy_example/ranking/word_dict.txt /home/ba/MatchZoo/data/toy_example/ranking/glove.840B.300d.txt /home/ba/MatchZoo/data/toy_example/ranking/embed_glove_d300

    I am getting this error:

    `2196017it [9:41:20, 62.96it/s] load word vectors ...

    Loading vectors from /home/ba/MatchZoo/data/toy_example/ranking/word_dict.txt 100%|████████████████████████████████████████████████████████████████| 212644/212644 [00:00<00:00, 1206949.69it/s] Traceback (most recent call last): File "gen_w2v.py", line 129, in embeddings = load_word_embedding(word_dict, w2v_file) File "gen_w2v.py", line 90, in load_word_embedding curr_embed = (2.0 * np.random.random_sample([dim]) - 1.0) * alpha File "mtrand.pyx", line 861, in mtrand.RandomState.random_sample File "mtrand.pyx", line 167, in mtrand.cont0_array TypeError: 'NoneType' object cannot be interpreted as an index`

    Word Vectors were successfully loaded before, then this error appears when trying to load the word dict. I tryed to reinstall numpy and some other packages but nothing helped.

    The word_dict file was produced by the preprocess.py and looks like this (Its just a little part of the Dict, it has 212644 lines):

    extrastress 190601 aceton 13407 autosom 5845 chemoradiotherapeut 14713 polarprob 71802 93,134 118825 opn/ca-125 203195 procavia 160607 pmit+ 135356 143,991 204032 noxious 20925 choedochoscop 163663 hhai. 203721 edelfosin 160238

    Thanks for any help :)

    question 
    opened by h3n0r1k 14
  • Continuous Integration

    Continuous Integration

    Realized a lot of issues came out related to installation error, I would suggest integrating MatchZoo with continuous integration.

    @faneshion @yangliuy @uduse What do you think? There is a related PR #13 but apparently, CI does not necessarily limit to flake8 testing, but also python docstring check, PEP8 check and unit tests.

    If you're willing to do so, I suggest the project administrator could use his Github account to login TravisCI, which is a cloud-based integration test engine. And turn on CI for MatchZoo. It's always free for open source projects.

    Every time when someone sends a PR to master branch, CI will automatically deploy all the dependencies on the server. It's a nice approach to reduce the technique debt.

    enhancement 
    opened by bwanglzu 13
  • n-cross validation with MatchZoo

    n-cross validation with MatchZoo

    MatchZoo splits the data into train, test and valid sets, hence, test data is a subset of queries (with corresponding documents) for the ranking task. I would like to perform 5-cross-validation with matchzoo in order to have all queries in the test file. Is it possible? could you please give me some indications? Thanks in advance.

    question 
    opened by thiziri 13
  • Missing data

    Missing data

    Hi,

    I have tried to setup the project as described however the data is missing. Looking at the models config file I could state that are referencing unavailable folders. If you try to setup the project in a vanilla enviroment will find something like this.

    IOError: [Errno 2] No such file or directory: u'../data/mq2007/embed.idf'
    
    opened by DavidGOrtega 11
  • wechat

    wechat

    Your Wechat MatchZoo Group is full, and i can't join you, could you please take me in. My wechat number is : hshrimp. Thank you.

    opened by hshrimp 11
  • error about matchpyramid on wikiqa

    error about matchpyramid on wikiqa

    I have succeeded train with model drmm. However when I want to use matchpyramid to train the model, it reply the error

    /DynamicMaxPooling.py:37: RuntimeWarning: divide by zero encountered in divide stride2 = 1.0 * max_len2 / len2_one py:36: RuntimeWarning: divide by zero encountered in divide stride1 = 1.0 * max_len1 / len1_one

    I know the means of the error, but how should i do for it

    Thank you

    bug 
    opened by ckqsars 10
  • 为啥没有中文语料的例子

    为啥没有中文语料的例子

    不知道是否支持,会有啥区别吗

    2.0 
    opened by xxllp 10
  • Error running matchpyramid

    Error running matchpyramid

    Describe the bug

    When I ran the tutorial code https://github.com/NTMC-Community/MatchZoo/blob/master/tutorials/wikiqa/matchpyramid.ipynb, I got the following error:

    2019-02-23 22:51:08.473125: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
    Epoch 1/20
     51/102 [==============>...............] - ETA: 6s - loss: 4.7849Traceback (most recent call last):
      File "msrp.py", line 150, in <module>
        history = model.fit_generator(train_generator, epochs=20, callbacks=[evaluate], workers=30, use_multiprocessing=True)
      File "/home/shixingzhou/softwares/Python-3.6.4rc1/env/lib/python3.6/site-packages/matchzoo/engine/base_model.py", line 265, in fit_generator
        verbose=verbose, **kwargs
      File "/home/shixingzhou/softwares/Python-3.6.4rc1/env/lib/python3.6/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
        return func(*args, **kwargs)
      File "/home/shixingzhou/softwares/Python-3.6.4rc1/env/lib/python3.6/site-packages/keras/engine/training.py", line 1418, in fit_generator
        initial_epoch=initial_epoch)
      File "/home/shixingzhou/softwares/Python-3.6.4rc1/env/lib/python3.6/site-packages/keras/engine/training_generator.py", line 217, in fit_generator
        class_weight=class_weight)
      File "/home/shixingzhou/softwares/Python-3.6.4rc1/env/lib/python3.6/site-packages/keras/engine/training.py", line 1217, in train_on_batch
        outputs = self.train_function(ins)
      File "/home/shixingzhou/softwares/Python-3.6.4rc1/env/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2715, in __call__
        return self._call(inputs)
      File "/home/shixingzhou/softwares/Python-3.6.4rc1/env/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2675, in _call
        fetched = self._callable_fn(*array_vals)
      File "/home/shixingzhou/softwares/Python-3.6.4rc1/env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1399, in __call__
        run_metadata_ptr)
      File "/home/shixingzhou/softwares/Python-3.6.4rc1/env/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 526, in __exit__
        c_api.TF_GetCode(self.status.status))
    tensorflow.python.framework.errors_impl.InvalidArgumentError: ConcatOp : Dimensions of inputs should match: shape[0] = [8,1] vs. shape[1] = [7,1]
             [[{{node loss/dense_1_loss/concat_1}} = ConcatV2[N=5, T=DT_FLOAT, Tidx=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](loss/dense_1_loss/lambda_2/strid
    ed_slice, loss/dense_1_loss/lambda_4/strided_slice, loss/dense_1_loss/lambda_6/strided_slice, loss/dense_1_loss/lambda_8/strided_slice, loss/dense_1_loss/lambda_10/strided_slice,
     training/Adam/gradients/loss/dense_1_loss/Mean_3_grad/Maximum)]]
    

    And I have modified model.params['embedding_input_dim'] = preprocessor.context['vocab_size'] into model.params['embedding_input_dim'] = preprocessor.context['vocab_size'] + 1. Otherwise, error happens at model.load_embedding_matrix(embedding_matrix).

    To Reproduce

    Just https://github.com/NTMC-Community/MatchZoo/blob/master/tutorials/wikiqa/matchpyramid.ipynb

    Context

    • OS [e.g. Windows 10, macOS 10.14]: Ubuntu 14.04
    • Hardware [e.g. CPU only, GTX 1080 Ti]: CPU

    MatchZoo version is 2.0

    bug duplicate 2.0 
    opened by shizhouxing 10
  • Strange behavior for my own data - is it overfitting?

    Strange behavior for my own data - is it overfitting?

    Hey!

    I don't know if my issue is due to overfitting because I have less data of around 20 000 training samples. I do IR with just single sentences (if they match). So for a query I will test it against all documents I have. I hope I can get any advices before I will test my data with the arc models. Up to now I use a CNN siamese Network (representation based) with a dropout layer at the end. This model is very good on trained data but also generalises well somehow (some works and some works not). With that I could live with that.

    But the strange thing (if it strange?) is that if the query is just an empty string it gets very high confidence (most 1!) with several documents!

    Also if the query is just a single word which is totally out of scope (not trained and this words does not occur in any of the documents!) it also gets high confidence with some documents.

    Does someone has made similar experience or has any thoughts what is going on there?

    opened by datistiquo 10
  • how to save the model on each epoch?

    how to save the model on each epoch?

    hi, I want to save the model on each epoch. I tried:

    evaluate = mz.callbacks.EvaluateAllMetrics(model, x=valid_x, y=valid_y, batch_size=16, once_every=1, model_save_path='path_to_model')

    but I get the following error : TypeError: can't pickle SwigPyObject objects

    Does anyone have an idea on how to solve this? thanks a lot!

    question 
    opened by lis-kp 10
  • How to contribute

    How to contribute

    NOTICE: BEFORE SENDING A PR, PLEASE CREATE AN ISSUE FIRST, DISCUSS WITH US THEN CONTRIBUTE. SENDING PRs DIRECTLY TO OUR CODE BASE = WASTE YOUR & OUR TIME.

    Hi, ppls, today we officially introduced TravisCI into MatchZoo to get rid of technical debt.

    If anyone happened to figure out any bug or enhancement, please follow this pipeline:

    1. Fork the latest version of MatchZoo into your repo. (exclude core developers)
    2. Create an issue under faneshion/Matchzoo, write the description of the bug/enhancement.
    3. Clone your forked MatchZoo into your machine, edit code.
    4. Run make init and make test using terminal/command line, ensure dependency check & unit tests passed on your computer.
    5. Push to your forked repo, send the pull request, in PR, you need to create a link to the issue you created using #[issue_id], and describe what has been changed.
    6. Wait CI passed.
    7. Wait Codecov generate the coverage report.
    8. We'll review the code and merge PR

    Thanks!

    discussion 
    opened by bwanglzu 10
  • model.fit_generator is too slow when dataset is large

    model.fit_generator is too slow when dataset is large

    //training is too slow when the dataset is large genfun = generator.get_batch_generator() history = model.fit_generator( genfun, steps_per_epoch = display_interval, epochs = 1, shuffle=False, verbose = 1 ) dingtalk20180109160210

    2.0 
    opened by granthst 9
  • How to train MatchZoo models on very large data sets

    How to train MatchZoo models on very large data sets

    Hi, I have dataset with a lot of training pairs, when running MatchZoo models for training, I get a following memory error (with RAM 10GO):

    /usr/local/miniconda/lib/python3.6/site-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type. from ._conv import register_converters as _register_converters Using TensorFlow backend. Traceback (most recent call last): File "matchzoo/main_gpu.py", line 351, in main(sys.argv) File "matchzoo/main_gpu.py", line 343, in main train(config, int(args.gpus)) File "matchzoo/main_gpu.py", line 117, in train train_gen[tag] = generator( config = conf ) File "~/MatchZoo/matchzoo/inputs/pair_generator.py", line 99, in init super(PairGenerator, self).init(config=config) File "~/MatchZoo/matchzoo/inputs/pair_generator.py", line 25, in init self.pair_list = self.make_pair_static(self.rel) File "~/MatchZoo/matchzoo/inputs/pair_generator.py", line 49, in make_pair_static pair_list.append( (d1, high_d2, low_d2) ) MemoryError

    Here is the running log before this error occures: (While running mvlstm):

    {
      "net_name": "MVLSTM",
      "global": {
        "model_type": "PY",
        "weights_file": "~/MatchZoo/examples/trec_millionQueries/multi_grad_judgement/rerank_2k_okapi/weights/my_models/amvlstm_reranking_okapi_fold_0.weights",
        "save_weights_iters": 50,
        "num_iters": 500,
        "display_interval": 10,
        "test_weights_iters": 300,
        "optimizer": "adadelta",
        "learning_rate": 0.1
      },
      "inputs": {
        "share": {
          "text1_corpus": "~/data/Robust/train_on_million_queries/multi_graddedJudgements_okapi/rerank_2k_okapi/fold_0/corpus_preprocessed.txt",
          "text2_corpus": "~/data/Robust/train_on_million_queries/multi_graddedJudgements_okapi/rerank_2k_okapi/fold_0/corpus_preprocessed.txt",
          "use_dpool": false,
          "embed_size": 300,
          "embed_path": "~/data/Robust/train_on_million_queries/multi_graddedJudgements_okapi/rerank_2k_okapi/fold_0/glove_extendStem_300_norm",
          "vocab_size": 667606,
          "train_embed": false,
          "target_mode": "ranking",
          "text1_maxlen": 10,
          "text2_maxlen": 40
        },
        "train": {
          "input_type": "PairGenerator",
          "phase": "TRAIN",
          "use_iter": false,
          "query_per_iter": 50,
          "batch_per_iter": 5,
          "batch_size": 128,
          "relation_file": "~/MatchZoo_latest/data/Robust/train_on_million_queries/multi_graddedJudgements_okapi/rerank_2k_okapi/fold_0/relation_train.txt"
        },
        "valid": {
          "input_type": "ListGenerator",
          "phase": "EVAL",
          "batch_list": 10,
          "relation_file": "~/data/Robust/train_on_million_queries/multi_graddedJudgements_okapi/rerank_2k_okapi/fold_0/relation_valid.txt"
        },
        "test": {
          "input_type": "ListGenerator",
          "phase": "EVAL",
          "batch_list": 10,
          "relation_file": "~/data/Robust/train_on_million_queries/multi_graddedJudgements_okapi/rerank_2k_okapi/fold_0/relation_test.txt"
        },
        "predict": {
          "input_type": "ListGenerator",
          "phase": "PREDICT",
          "batch_list": 10,
          "relation_file": "~/data/Robust/train_on_million_queries/multi_graddedJudgements_okapi/rerank_2k_okapi/fold_0/relation_test.txt"
        }
      },
      "outputs": {
        "predict": {
          "save_format": "TREC",
          "save_path": "~/MatchZoo/examples/trec_millionQueries/multi_grad_judgement/rerank_2k_okapi/predictions/predict.test.amvlstm_fold_0.reranking.txt"
        }
      },
      "model": {
        "model_path": "matchzoo/models/",
        "model_py": "mvlstm.MVLSTM",
        "setting": {
          "hidden_size": 50,
          "topk": 100,
          "dropout_rate": 0.5
        }
      },
      "losses": [
        {
          "object_name": "rank_hinge_loss",
          "object_params": {
            "margin": 0.5
          }
        }
      ],
      "metrics": [
        "[email protected]",
        "[email protected]",
        "map"
      ]
    }
    [~/data/Robust/train_on_million_queries/multi_graddedJudgements_okapi/rerank_2k_okapi/fold_0/glove_extendStem_300_norm]
    	Embedding size: 667606
    Generate numpy embed: (667606, 300)
    [Embedding] Embedding Load Done.
    [Input] Process Input Tags. odict_keys(['train']) in TRAIN, odict_keys(['valid', 'test']) in EVAL.
    [~/data/Robust/train_on_million_queries/multi_graddedJudgements_okapi/rerank_2k_okapi/fold_0/corpus_preprocessed.txt]
    	Data size: 578619
    [Dataset] 1 Dataset Load Done.
    {'text1_corpus': '~/data/Robust/train_on_million_queries/multi_graddedJudgements_okapi/rerank_2k_okapi/fold_0/corpus_preprocessed.txt', 'text2_corpus': '~/data/Robust/train_on_million_queries/multi_graddedJudgements_okapi/rerank_2k_okapi/fold_0/corpus_preprocessed.txt', 'use_dpool': False, 'embed_size': 300, 'embed_path': '~/data/Robust/train_on_million_queries/multi_graddedJudgements_okapi/rerank_2k_okapi/fold_0/glove_extendStem_300_norm', 'vocab_size': 667606, 'train_embed': False, 'target_mode': 'ranking', 'text1_maxlen': 10, 'text2_maxlen': 40, 'embed': array([[ 0.004176,  0.06535 , -0.033108, ...,  0.068528, -0.0289  ,
            -0.039592],
           [-0.025855, -0.056063, -0.016158, ..., -0.008576,  0.010382,
             0.096152],
           [-0.050733, -0.023928,  0.008703, ..., -0.048408, -0.087827,
             0.066331],
           ...,
           [ 0.100099,  0.030999, -0.053699, ..., -0.003202,  0.089405,
            -0.068456],
           [ 0.037297, -0.046732,  0.007727, ...,  0.112233, -0.03772 ,
            -0.074903],
           [ 0.      ,  0.      ,  0.      , ...,  0.      ,  0.      ,
             0.      ]], dtype=float32), 'input_type': 'PairGenerator', 'phase': 'TRAIN', 'use_iter': False, 'query_per_iter': 50, 'batch_per_iter': 5, 'batch_size': 128, 'relation_file': '~/data/Robust/train_on_million_queries/multi_graddedJudgements_okapi/rerank_2k_okapi/fold_0/relation_train.txt'}
    [~/data/Robust/train_on_million_queries/multi_graddedJudgements_okapi/rerank_2k_okapi/fold_0/relation_train.txt]
    	Instance size: 101405725
    

    Thanks in advance @pl8787 @bwanglzu @faneshion @yangliuy

    question 
    opened by thiziri 9
  • Design DataPack Interface

    Design DataPack Interface

    image

    2.0 
    opened by pl8787 9
  • fstring not supported in python 3.5

    fstring not supported in python 3.5

    Describe the bug

    Traceback (most recent call last): File "/MRSwithOutline/MatchZooModels.py", line 2, in import matchzoo as mz File "/py3.5/lib/python3.5/site-packages/MatchZoo-2.0.0-py3.5.egg/matchzoo/init.py", line 18, in from .data_pack import DataPack File "/py3.5/lib/python3.5/site-packages/MatchZoo-2.0.0-py3.5.egg/matchzoo/data_pack/init.py", line 1, in from .data_pack import DataPack, load_data_pack File "/py3.5/lib/python3.5/site-packages/MatchZoo-2.0.0-py3.5.egg/matchzoo/data_pack/data_pack.py", line 230 f"inplace parameter of {func} not documented.\n"

    To Reproduce

    import matchzoo as mz import os

    os.environ['CUDA_VISIBLE_DEVICES'] = '1'

    predict_pack = mz.datasets.wiki_qa.load_data('test', task='ranking') preprocessor = mz.preprocessors.DSSMPreprocessor() predict_pack_processed = preprocessor.transform(predict_pack) print(predict_pack_processed)

    Describe your attempts

    seems like some typo in the code, plz check it

    You should also provide code snippets you tried as a workaround, StackOverflow solution that you have walked through, or your best guess of the cause that you can't locate (e.g. cosmic radiation).

    Context

    • OS: ubuntu 16.04
    • Hardware: Tesla K80 GPU

    MatchZoo version: 2.0

    2.0 
    opened by DaoD 9
  • DSSM model.predict() scores rank does not match with the rank by dot layer cosine similarity

    DSSM model.predict() scores rank does not match with the rank by dot layer cosine similarity

    Describe the Question

    I have a trained DSSM model and wanted to compare the ranked items based on dssm model.predict() scores against the cosine similarity scores after the model's dot layer, I would expect the two ranks to be the same since model.predict() is just the final score after a linear activation but the results are completely the opposite and I'm trying to understand how that might be given the linear coefficient from the final dense layer is positive.

    Describe your attempts

    • [x] I walked through the tutorials
    • [x] I checked the documentation
    • [x] I checked to make sure that this is not a duplicate question

    1. DSSM model summary 2. Predicted scores comparison 3. Predicted dataframe with two sets of scores, sorted by pred_score here which gives completely opposite rank compared to if sorted by dot score

    question 
    opened by jchen0529 0
  • set_up.py missing tensorflow

    set_up.py missing tensorflow

    Describe the bug

    the project needs TensorFlow, but set_up.py does not contain the package. Although the requirements.txt contain the package, but when execute the command: pip install -e ., it will not install the package and occur no module error? Actually, is there any reason that not containing TensorFlow in set_up.py???

    To Reproduce

    pip3 install -e . python3 -m pytest -v tests/unit_test/processor_units/test_processor_units.py ============================= test session starts ============================== platform linux -- Python 3.7.12, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- /mnt/zejun/smp/data/python_star_2000repo/MatchZoo/venv_test_7/bin/python3.7 cachedir: .pytest_cache rootdir: /mnt/zejun/smp/data/python_star_2000repo/MatchZoo plugins: cov-3.0.0, mock-3.6.1 collecting ... collected 0 items / 1 error

    ==================================== ERRORS ==================================== ___ ERROR collecting tests/unit_test/processor_units/test_processor_units.py ___ ImportError while importing test module '/mnt/zejun/smp/data/python_star_2000repo/MatchZoo/tests/unit_test/processor_units/test_processor_units.py'. Hint: make sure your test modules/packages have valid Python names. Traceback: /usr/lib/python3.7/importlib/init.py:127: in import_module return _bootstrap._gcd_import(name[level:], package, level) tests/unit_test/processor_units/test_processor_units.py:4: in from matchzoo.preprocessors import units matchzoo/init.py:20: in from . import preprocessors matchzoo/preprocessors/init.py:1: in from . import units matchzoo/preprocessors/units/init.py:13: in from .tokenize import Tokenize matchzoo/preprocessors/units/tokenize.py:2: in from matchzoo.utils.bert_utils import is_chinese_char,
    matchzoo/utils/init.py:4: in from .make_keras_optimizer_picklable import make_keras_optimizer_picklable matchzoo/utils/make_keras_optimizer_picklable.py:1: in import keras venv_test_7/lib/python3.7/site-packages/keras/init.py:21: in from tensorflow.python import tf2 E ModuleNotFoundError: No module named 'tensorflow' =========================== short test summary info ============================ ERROR tests/unit_test/processor_units/test_processor_units.py !!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!! =============================== 1 error in 0.91s ===============================

    Describe your attempts

    • [x] I checked the documentation and found no answer
    • [x] I checked to make sure that this is not a duplicate issue

    Context

    • Ubutun
    bug 
    opened by zjzh 0
  • GPU-Utils is low 1%

    GPU-Utils is low 1%

    Describe the bug

    run the example in Get Started in 60 Seconds

    image

    Context

    • OS Ubuntu18.04
    • Hardware Tesla 80k, cuda 10.1,cudnn7.0
    • matchzoo 2.2.0, tensorflow2.2.0, keras2.3.0

    Additional Information

    Other things you want the developers to know.

    bug 
    opened by lonelydancer 0
  • keras should be replaced by tf.keras

    keras should be replaced by tf.keras

    Reason: https://www.pyimagesearch.com/2019/10/21/keras-vs-tf-keras-whats-the-difference-in-tensorflow-2-0/

    Solution: I will send a pull request later.

    enhancement 
    opened by songzy12 6
  • matchzoo.contrib.models.ESIM(), model.save, raise ValueError: substring not found

    matchzoo.contrib.models.ESIM(), model.save, raise ValueError: substring not found

    Describe the bug

    when I run tutorials.wikiqa.esim.ipynb, in the last run model.save(SAVE_PATH), raise ValueError: substring not found

    File "D:\ProgramData\Anaconda3\envs\Matchzoo\lib\site-packages\tensorflow_core\python\ops\variables.py", line 1150, in _shared_name return self.name[:self.name.index(":")]

    bug 
    opened by paulxin001 1
  • How to setting learning rate in the model params??

    How to setting learning rate in the model params??

    such as model build, where we can set the learning rate

    model = mz.models.MatchPyramid() model.params.update(preprocessor.context) model.params['task'] = ranking_task model.params['embedding_output_dim'] = 128 model.params['embedding_input_dim'] = preprocessor.context['embedding_input_dim'] model.params['embedding_trainable'] = True model.params['num_blocks'] = 2 model.params['kernel_count'] = [8, 16] model.params['kernel_size'] = [[5, 5], [3, 3]] model.params['dpool_size'] = [3, 3] model.params['optimizer'] = 'adam' model.params['dropout_rate'] = 0.3 model.build()
    model.compile()

    question 
    opened by gaoliming123 0
  • Import Error

    Import Error

    Describe the bug

    error occurs when import matchzoo

    Describe your attempts

    import matchzoo as mz Traceback (most recent call last): File "", line 1, in File "/app/python_venv/lib/python3.5/site-packages/matchzoo/init.py", line 13, in from .data_pack import DataPack File "/app/python_venv/lib/python3.5/site-packages/matchzoo/data_pack/init.py", line 1, in from .data_pack import DataPack, load_data_pack File "/app/python_venv/lib/python3.5/site-packages/matchzoo/data_pack/data_pack.py", line 215 f'{data_file_path} already exist, fail to save') ^ SyntaxError: invalid syntax

    Context

    centos7 matchzoo 2.2.0

    bug 
    opened by lovekittynine 1
  • The use of the GPU.

    The use of the GPU.

    Describe the Question


    I can't use GPU.

    Environment:

    TensorFlow-gpu==2.0.0

    >>>tf.__version__
    '2.0.0'
    

    CUDA==10.0

    ~$ nvcc -V
    Cuda compilation tools, release 10.0, V10.0.130
    
    >>> import tensorflow as tf
    >>> tf.test.is_gpu_available()
    ...
    ...
    True
    

    I called the GPU in the program.

    import os
    # os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
    os.environ["CUDA_VISIBLE_DEVICES"] = "0"
    

    However

    ~$ nvidia-smi
    Sun Dec 20 22:20:35 2020       
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 430.40       Driver Version: 430.40       CUDA Version: 10.1     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  TITAN X (Pascal)    Off  | 00000000:02:00.0 Off |                  N/A |
    | 23%   36C    P8    10W / 250W |    269MiB / 12196MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    |   1  TITAN X (Pascal)    Off  | 00000000:03:00.0 Off |                  N/A |
    | 23%   40C    P8    11W / 250W |     10MiB / 12196MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    |   2  TITAN X (Pascal)    Off  | 00000000:82:00.0 Off |                  N/A |
    | 23%   41C    P8     9W / 250W |     10MiB / 12196MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    |   3  TITAN X (Pascal)    Off  | 00000000:83:00.0 Off |                  N/A |
    | 25%   44C    P0    58W / 250W |     10MiB / 12196MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
                                                                                   
    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID   Type   Process name                             Usage      |
    |=============================================================================|
    |    0     21783      C   /home/cuifulai/Lib/Python36/bin/python3      259MiB |
    +-----------------------------------------------------------------------------+
    

    Describe your attempts

    • [√] I walked through the tutorials
    • [√] I checked the documentation
    • [√] I checked to make sure that this is not a duplicate question
    question 
    opened by Ambitioner-c 0
  • Hello, as long as I use a structure with a convolutional layer, memory overflow will occur for small data. How to solve the memory overflow?

    Hello, as long as I use a structure with a convolutional layer, memory overflow will occur for small data. How to solve the memory overflow?

    Describe the Question

    Please provide a clear and concise description of what the question is.

    Describe your attempts

    • [x] I walked through the tutorials
    • [x] I checked the documentation
    • [x] I checked to make sure that this is not a duplicate question

    You may also provide a Minimal, Complete, and Verifiable example you tried as a workaround, or StackOverflow solution that you have walked through. (e.g. cosmic radiation).

    In addition, figure out your MatchZoo version by running import matchzoo; matchzoo.__version__. If this gives you an error, then you're probably using 1.0, and 1.0 is no longer supported. Then attach the corresponding label on the issue.

    question 
    opened by zhangchen-fzu 1
  • Using callbacks for early stopping in DSSM

    Using callbacks for early stopping in DSSM

    Describe the Question

    Please provide a clear and concise description of what the question is.

    Describe your attempts

    • [x] I walked through the tutorials
    • [x] I checked the documentation
    • [x] I checked to make sure that this is not a duplicate question

    You may also provide a Minimal, Complete, and Verifiable example you tried as a workaround, or StackOverflow solution that you have walked through. (e.g. cosmic radiation).

    In addition, figure out your MatchZoo version by running import matchzoo; matchzoo.__version__. If this gives you an error, then you're probably using 1.0, and 1.0 is no longer supported. Then attach the corresponding label on the issue.



    Hello, I'm trying to run DSSM code, and I want to use keras.callbacks.EarlyStopping with this code. I ran DSSM tutorial, and what I only changed was the last few lines.

    Original code was like this,

    train_generator = mz.DataGenerator(train_pack_processed, mode='pair', num_dup=1, num_neg=4, batch_size=32, shuffle=True)
    len(train_generator)
    
    history = model.fit_generator(train_generator, epochs=20, callbacks=[evaluate], workers=5, use_multiprocessing=False)
    

    And what I changed was like this.

    train_generator = mz.DataGenerator(train_pack_processed, mode='pair', num_dup=1, num_neg=4, batch_size=32, shuffle=True)
    len(train_generator)
    
    from keras.callbacks import EarlyStopping
    from keras.callbacks import ModelCheckpoint
    from keras.models import load_model
    
    es = EarlyStopping(monitor='val_loss', mode='min', verbose=1)
    
    history = model.fit_generator(train_generator, epochs=2000, callbacks=[evaluate, es], workers=5, use_multiprocessing=False)
    

    And there was an error.

    Epoch 1/2000
    17/17 [==============================] - 2s 95ms/step - loss: 1.5384
    Validation: [email protected](0.0): 0.021622386820576125 - [email protected](0.0): 0.029349502492551117 - mean_average_precision(0.0): 0.0341616525519229
    ---------------------------------------------------------------------------
    TypeError                                 Traceback (most recent call last)
    <ipython-input-17-5e8d82ce978c> in <module>()
    ----> 1 history = model.fit_generator(train_generator, epochs=2000, callbacks=[evaluate, es], workers=5, use_multiprocessing=False)
    
    6 frames
    /usr/local/lib/python3.6/dist-packages/matchzoo/engine/base_model.py in fit_generator(self, generator, epochs, verbose, **kwargs)
        274             generator=generator,
        275             epochs=epochs,
    --> 276             verbose=verbose, **kwargs
        277         )
        278 
    
    /usr/local/lib/python3.6/dist-packages/keras/legacy/interfaces.py in wrapper(*args, **kwargs)
         89                 warnings.warn('Update your `' + object_name + '` call to the ' +
         90                               'Keras 2 API: ' + signature, stacklevel=2)
    ---> 91             return func(*args, **kwargs)
         92         wrapper._original_function = func
         93         return wrapper
    
    /usr/local/lib/python3.6/dist-packages/keras/engine/training.py in fit_generator(self, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, validation_freq, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch)
       1730             use_multiprocessing=use_multiprocessing,
       1731             shuffle=shuffle,
    -> 1732             initial_epoch=initial_epoch)
       1733 
       1734     @interfaces.legacy_generator_methods_support
    
    /usr/local/lib/python3.6/dist-packages/keras/engine/training_generator.py in fit_generator(model, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, validation_freq, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch)
        258                     break
        259 
    --> 260             callbacks.on_epoch_end(epoch, epoch_logs)
        261             epoch += 1
        262             if callbacks.model.stop_training:
    
    /usr/local/lib/python3.6/dist-packages/keras/callbacks/callbacks.py in on_epoch_end(self, epoch, logs)
        150         logs = logs or {}
        151         for callback in self.callbacks:
    --> 152             callback.on_epoch_end(epoch, logs)
        153 
        154     def on_train_batch_begin(self, batch, logs=None):
    
    /usr/local/lib/python3.6/dist-packages/keras/callbacks/callbacks.py in on_epoch_end(self, epoch, logs)
        814 
        815     def on_epoch_end(self, epoch, logs=None):
    --> 816         current = self.get_monitor_value(logs)
        817         if current is None:
        818             return
    
    /usr/local/lib/python3.6/dist-packages/keras/callbacks/callbacks.py in get_monitor_value(self, logs)
        844                 'Early stopping conditioned on metric `%s` '
        845                 'which is not available. Available metrics are: %s' %
    --> 846                 (self.monitor, ','.join(list(logs.keys()))), RuntimeWarning
        847             )
        848         return monitor_value
    
    TypeError: sequence item 1: expected str instance, NormalizedDiscountedCumulativeGain found
    

    I attach this sample code for early stopping.

    # mlp overfit on the moons dataset with simple early stopping
    from sklearn.datasets import make_moons
    from keras.models import Sequential
    from keras.layers import Dense
    from keras.callbacks import EarlyStopping 
    from matplotlib import pyplot
    # generate 2d classification dataset
    X, y = make_moons(n_samples=100, noise=0.2, random_state=1)
    # split into train and test
    n_train = 30
    trainX, testX = X[:n_train, :], X[n_train:, :]
    trainy, testy = y[:n_train], y[n_train:]
    # define model
    model = Sequential()
    model.add(Dense(500, input_dim=2, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    # simple early stopping
    es = EarlyStopping(monitor='val_loss', mode='min', verbose=1) 
    # fit model
    history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=1, callbacks=[es]) 
    # evaluate the model
    _, train_acc = model.evaluate(trainX, trainy, verbose=0)
    _, test_acc = model.evaluate(testX, testy, verbose=0)
    print('Train: %.3f, Test: %.3f' % (train_acc, test_acc))
    # plot training history
    pyplot.plot(history.history['loss'], label='train')
    pyplot.plot(history.history['val_loss'], label='test')
    pyplot.legend()
    pyplot.show()
    

    When I use keras.callbacks.EarlyStopping I need to set monitor argument, the criterion by which the model can determine it should stop or not. In the sample code above I'm choosing 'accuracy' metric and the performance is shown with loss, accuracy, val_loss, val_accuracy, so I can choose 'val_loss' as a monitor argument. In DSSM code the performance is shown by loss, 2 NDCG scores and MAP score. I want to ask how I can choose these scores to use early stopping callback. And I want to ask if there's any other way you use to stop training in proper time, since I'm not so much experienced with deep learning yet. Thank you in advance.

    question 
    opened by saekomdalkom 0
Releases(v2.2)
  • v2.2(Oct 9, 2019)

  • v2.1(Apr 4, 2019)

    • add automation modules
      • mz.auto.tuner that automatically search for model hyper parameters
      • mz.auto.preprer that unifies model preprocessing and training processes
    • add QuoraQP dataset
    • rewrite mz.DataGenerator to be callback-based
    • fix models behaviors under classification tasks
    • reorganize project structure, the most significant one being moving processor_units to preprocessors.units
    • rename redundant names (e.g. NaiveModel -> Naive, TokenizeUnit -> Tokenize)
    • update the tutorials
    • various other updates
    Source code(tar.gz)
    Source code(zip)
Owner
Neural Text Matching Community
Neural Text Matching Community
Linear programming solver for paper-reviewer matching and mind-matching

Paper-Reviewer Matcher A python package for paper-reviewer matching algorithm based on topic modeling and linear programming. The algorithm is impleme

Titipat Achakulvisut 59 Feb 11, 2022
Python package for performing Entity and Text Matching using Deep Learning.

DeepMatcher DeepMatcher is a Python package for performing entity and text matching using deep learning. It provides built-in neural networks and util

null 402 Feb 17, 2022
Python package for performing Entity and Text Matching using Deep Learning.

DeepMatcher DeepMatcher is a Python package for performing entity and text matching using deep learning. It provides built-in neural networks and util

null 276 Feb 9, 2021
CJK computer science terms comparison / 中日韓電腦科學術語對照 / 日中韓のコンピュータ科学の用語対照 / 한·중·일 전산학 용어 대조

CJK computer science terms comparison This repository contains the source code of the website. You can see the website from the following link: Englis

Hong Minhee (洪 民憙) 63 Feb 11, 2022
Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Alexander Veysov 1.2k Feb 22, 2022
A pytorch implementation of the ACL2019 paper "Simple and Effective Text Matching with Richer Alignment Features".

RE2 This is a pytorch implementation of the ACL 2019 paper "Simple and Effective Text Matching with Richer Alignment Features". The original Tensorflo

null 250 Feb 13, 2022
Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

null 110 Feb 23, 2022
A design of MIDI language for music generation task, specifically for Natural Language Processing (NLP) models.

MIDI Language Introduction Reference Paper: Pop Music Transformer: Beat-based Modeling and Generation of Expressive Pop Piano Compositions: code This

Robert Bogan Kang 3 Dec 27, 2021
Build Text Rerankers with Deep Language Models

Reranker is a lightweight, effective and efficient package for training and deploying deep languge model reranker in information retrieval (IR), question answering (QA) and many other natural language processing (NLP) pipelines. The training procedure follows our ECIR paper Rethink Training of BERT Rerankers in Multi-Stage Retrieval Pipeline using a localized constrastive esimation (LCE) loss.

Luyu Gao 104 Feb 12, 2022
:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

Dedupe Python Library dedupe is a python library that uses machine learning to perform fuzzy matching, deduplication and entity resolution quickly on

Dedupe.io 3.3k Feb 24, 2022
:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

Dedupe Python Library dedupe is a python library that uses machine learning to perform fuzzy matching, deduplication and entity resolution quickly on

Dedupe.io 2.9k Feb 11, 2021
:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

Dedupe Python Library dedupe is a python library that uses machine learning to perform fuzzy matching, deduplication and entity resolution quickly on

Dedupe.io 2.9k Feb 17, 2021
🎐 a python library for doing approximate and phonetic matching of strings.

jellyfish Jellyfish is a python library for doing approximate and phonetic matching of strings. Written by James Turk <[email protected]> and Michael

James Turk 1.6k Feb 19, 2022
🎐 a python library for doing approximate and phonetic matching of strings.

jellyfish Jellyfish is a python library for doing approximate and phonetic matching of strings. Written by James Turk <[email protected]> and Michael

James Turk 1.4k Feb 12, 2021
🎐 a python library for doing approximate and phonetic matching of strings.

jellyfish Jellyfish is a python library for doing approximate and phonetic matching of strings. Written by James Turk <[email protected]> and Michael

James Turk 1.4k Feb 17, 2021
Fuzzy String Matching in Python

FuzzyWuzzy Fuzzy string matching like a boss. It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.

SeatGeek 8.6k Feb 18, 2022
Fuzzy String Matching in Python

FuzzyWuzzy Fuzzy string matching like a boss. It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.

SeatGeek 7.8k Feb 12, 2021