Methylation/modified base calling separated from basecalling.

Overview
[Oxford Nanopore Technologies]

Remora

Methylation/modified base calling separated from basecalling. Remora primarily provides an API to call modified bases for basecaller programs such as Bonito. Remora also provides the tools to prepare datasets, train modified base models and run simple inference.

Installation

Install from pypi:

pip install ont-remora

Install from github source for development:

git clone [email protected]:nanoporetech/remora.git
pip install -e remora/[tests]

It is recommended that Remora be installed in a virtual environment. For example python3.8 -m venv --prompt remora --copies venv; source venv/bin/activate.

See help for any Remora sub-command with the -h flag.

Getting Started

Remora models are trained to perform binary or categorical prediction of modified base content of a nanopore read. Models may also be trained to perform canonical base prediction, but this feature may be removed at a later time. The rest of the documentation will focus on the modified base detection task.

The Remora training/prediction input unit (refered to as a chunk) consists of:

  1. Section of normalized signal
  2. Canonical bases attributed to the section of signal
  3. Mapping between these two

Chunks have a fixed signal length defined at data preparation time and saved as a model attribute. A fixed position within the chunk is defined as the "focus position". This position is the center of the base of interest.

Pre-trained Models

Pre-trained models are included in the Remora repository. To see the selection of models included in the current installation run remora model list_pretrained.

Python API

The Remora API can be applied to make modified base calls given a basecalled read via a RemoraRead object. sig should be a float32 numpy array. seq is a string derived from sig (can be either basecalls or other downstream derived sequence; e.g. mapped reference positions). seq_to_sig_map should be an int32 numpy array of length len(seq) + 1 and elements should be indices within sig array assigned to each base in seq.

from remora.model_util import load_model
from remora.data_chunks import RemoraRead
from remora.inference import call_read_mods

model, model_metadata = load_model("remora_train_results/model_best.onnx")
read = RemoraRead(sig, seq_to_sig_map, str_seq=seq)
mod_probs, _, pos = call_read_mods(
  read,
  model,
  model_metadata,
  return_mod_probs=True,
)

mod_probs will contain the probability of each modeled modified base as found in model_metadata["mod_long_names"]. For example, run mod_probs.argmax(axis=1) to obtain the prediction for each input unit. pos contains the position (index in input sequence) for each prediction within mod_probs.

Data Preparation

Remora data preparation begins from Taiyaki mapped signal files generally produced from Megalodon containing modified base annotations. This requires installation of Taiyaki via pip install git+https://github.com/nanoporetech/taiyaki.

An example dataset might be pre-processed with the following commands.

megalodon \
  pcr_fast5s/ \
  --reference ref.mmi \
  --output-directory mega_res_pcr \
  --outputs mappings signal_mappings \
  --num-reads 10000 \
  --guppy-config dna_r9.4.1_450bps_fast.cfg \
  --devices 0 \
  --processes 20
# Note the --ref-mods-all-motifs option defines the modified base characteristics
megalodon \
  sssI_fast5s/ \
  --ref-mods-all-motifs m 5mC CG 0 \
  --reference ref.mmi \
  --output-directory mega_res_sssI \
  --outputs mappings signal_mappings \
  --num-reads 10000 \
  --guppy-config dna_r9.4.1_450bps_fast.cfg \
  --devices 0 \
  --processes 20

python \
  taiyaki/misc/merge_mappedsignalfiles.py \
  mapped_signal_train_data.hdf5 \
  --input mega_res_pcr/signal_mappings.hdf5 None \
  --input mega_res_sssI/signal_mappings.hdf5 None \
  --allow_mod_merge \
  --batch_format

After the construction of a training dataset, chunks must be extracted and saved in a Remora-friendly format. The following command performs this task in Remora.

remora \
  dataset prepare \
  mapped_signal_train_data.hdf5 \
  --output-remora-training-file remora_train_chunks.npz \
  --motif CG 0 \
  --mod-bases m \
  --chunk-context 50 50 \
  --kmer-context-bases 6 6 \
  --max-chunks-per-read 20 \
  --log-filename log.txt

The resulting remora_train_chunks.npz file can then be used to train a Remora model.

Model Training

Models are trained with the remora model train command. For example a model can be trained with the following command.

remora \
  model train \
  remora_train_chunks.npz \
  --model remora/models/Conv_w_ref.py \
  --device 0 \
  --output-path remora_train_results

This command will produce a final model in ONNX format for use in Bonito, Megalodon or remora infer commands.

Model Inference

For testing purposes inference within Remora is provided given Taiyaki mapped signal files as input. The below command will call the held out validation dataset from the data preparation section above.

remora \
  infer from_taiyaki_mapped_signal \
  mega_res_pcr/split_signal_mappings.split_a.hdf5 \
  remora_train_results/model_best.onnx \
  --output-path remora_infer_results_pcr.txt \
  --device 0
remora \
  infer from_taiyaki_mapped_signal \
  mega_res_sssI/split_signal_mappings.split_a.hdf5 \
  remora_train_results/model_best.onnx \
  --output-path remora_infer_results_sssI.txt \
  --device 0

Note that in order to perfrom inference on a GPU device the onnxruntime-gpu package must be installed.

GPU Troubleshooting

Note that standard Remora models are small enough to run quite quickly on CPU resources and this is the primary recommandation. Running Remora models on GPU compute resources is considered experimental with minimal support.

Deployment of Remora models is facilitated by the Open Neural Network Exchange (ONNX) format. The onnxruntime python package is used to run the models. In order to support running models on GPU resources the GPU compatible package must be installed (pip install onnxruntime-gpu).

Once installed the remora infer command takes a --device argument. Similarly, the API remora.model_util.load_model function takes a device argument. These arguments specify the GPU device ID to use for inference.

Once the device option is specified, Remora will attempt to load the model on the GPU resources. If this fails a RemoraError will be raised. The likely cause of this is the required CUDA and cuDNN dependency versions. See the requirements on the onnxruntime documentation page here.

To check the versions of the various dependencies see the following commands.

# check cuda version
nvcc --version
# check cuDNN version
grep -A 2 "define CUDNN_MAJOR" `whereis cudnn | cut -f2 -d" "`
# check onnxruntime version
python -c "import onnxruntime as ort; print(ort.__version__)"

These versions should match a row in the table linked above. CUDA and cuDNN versions can be downloaded from the NVIDIA website (cuDNN link; CUDA link). The cuDNN download can be specified at runtime as in the following example.

CUDA_PATH=/path/to/cuda/include/cuda.h \
  CUDNN_H_PATH=/path/to/cuda/include/cudnn.h \
  remora \
  infer [arguments]

The onnxruntime dependency can be set via the python package install command. For example pip install "onnxruntime-gpu<1.7".

Terms and Licence

This is a research release provided under the terms of the Oxford Nanopore Technologies' Public Licence. Research releases are provided as technology demonstrators to provide early access to features or stimulate Community development of tools. Support for this software will be minimal and is only provided directly by the developers. Feature requests, improvements, and discussions are welcome and can be implemented by forking and pull requests. Much as we would like to rectify every issue, the developers may have limited resource for support of this software. Research releases may be unstable and subject to rapid change by Oxford Nanopore Technologies.

© 2021 Oxford Nanopore Technologies Ltd. Remora is distributed under the terms of the Oxford Nanopore Technologies' Public Licence.

Research Release

Research releases are provided as technology demonstrators to provide early access to features or stimulate Community development of tools. Support for this software will be minimal and is only provided directly by the developers. Feature requests, improvements, and discussions are welcome and can be implemented by forking and pull requests. However much as we would like to rectify every issue and piece of feedback users may have, the developers may have limited resource for support of this software. Research releases may be unstable and subject to rapid iteration by Oxford Nanopore Technologies.

Comments
  • Error while installing remora

    Error while installing remora

    Hello Everyone,

    I am currently trying to get remora and the Basecaller Bonito on our HPC. I am using the pip install command but i always get the Error :

          ############################
          # Package would be ignored #
          ############################
          Python recognizes 'remora.trained_models' as an importable package, however it is
          included in the distribution as "data".
          This behavior is likely to change in future versions of setuptools (and
          therefore is considered deprecated).
      
          Please make sure that 'remora.trained_models' is included as a package by using
          setuptools' `packages` configuration field or the proper discovery methods
          (for example by using `find_namespace_packages(...)`/`find_namespace:`
          instead of `find_packages(...)`/`find:`).
      
          You can read more about "package discovery" and "data files" on setuptools
          documentation page.
      
      
      !!
      
        check.warn(importable)
      error: command 'icc' failed: No such file or directory
      [end of output]
    

    note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for ont-remora Failed to build ont-remora ERROR: Could not build wheels for ont-remora, which is required to install pyproject.toml-based projects

    Maybe this is a known issue or someone can help me out. I am using a PyPi mirror currently since the HPC has no net connection.

    I would appreciate any help!

    kind regards,

    Azlan

    opened by AzlanNI 16
  • how to interpret the results from

    how to interpret the results from "remora infer from_taiyaki_mapped_signal"

    Thanks for the great tool!

    Just wondering how the "read_pos" in the results from "remora infer from_taiyaki_mapped_signal" are chosen - so which positions of a read are shown in the result, please? If only modified positions are shown, why will we have different class_pred values? Also the read_pos is the relative position within the read but not the position

    And for "class_pred", 1 means modified and 0 means unmodified, don't they?

    What's the meaning of "label", please?

    Thanks! Jon

    opened by jon-xu 10
  • Insect 5hmC values anomalous

    Insect 5hmC values anomalous

    I am working with an insect genome and trying to call 5mC and optionally 5hmC. When using megalodon with --remora-modified-bases dna_r9.4.1_e8 model calling 5mC only, I get around 6% 5mC (too high). while calling 5hmC_5mC, I get 0.55% 5mC (about right), but I am getting near 60% 5hmC which seems FAR too high to be realistic. I've never heard of an insect with such incredibly high 5hmC.

    1. Shouldn't the 5mC values match from both calls?
    2. What's going on with the 5hmC? If I can't trust that one, why should I trust the 5hmC levels?

    My calls are

    1. megalodon /path/to/wasp-runs/ --sort-mappings --outputs mod_mappings mods per_read_mods --reference waspassembly.fasta --devices 0 --processes 23 --output-directory megalodon-out-5mc-sup --guppy-params " --use_tcp" --overwrite --guppy-server-path /opt/ont/guppy/bin/guppy_basecall_server --guppy-config dna_r9.4.1_450bps_sup.cfg --remora-modified-bases dna_r9.4.1_e8 sup 0.0.0 5mc CG 0
    2. megalodon /path/to/wasp-runs/ --sort-mappings --outputs mod_mappings mods per_read_mods --reference waspassembly.fasta --devices 0 --processes 23 --output-directory megalodon-out-5hmC-5mc-sup --guppy-params " --use_tcp" --overwrite --guppy-server-path /opt/ont/guppy/bin/guppy_basecall_server --guppy-config dna_r9.4.1_450bps_sup.cfg --remora-modified-bases dna_r9.4.1_e8 sup 0.0.0 5hmc_5mc CG 0
    opened by dithiii 7
  • Installation of taiyaki

    Installation of taiyaki

    Hello!

    I am attempting to get everything prepared for training remora models.

    I have megalodon installed and I am attempting to install taiyaki. I followed what was suggested in the README, namely: Remora data preparation begins from Taiyaki mapped signal files generally produced from Megalodon containing modified base annotations. This requires installation of Taiyaki via pip install git+https://github.com/nanoporetech/taiyaki.

    However, when I run this command in a clean python virtual environment, I encounter the following error: Screen Shot 2022-03-17 at 4 11 17 PM

    Any thoughts on why this is happening? Are there any specific requirements needed to install taiyaki in this way?

    Thanks, Paul

    opened by pwh124 7
  • Is remora 5hmC/5mC ready for

    Is remora 5hmC/5mC ready for "prime time"?

    Is the 5hmC/5mC remora mode quantitative enough for biological inference now? Unfortunately I haven't seen any benchmarking papers/preprints out there, and I haven't seen any data on 5hmC performance aside from its introduction in some of the nanopore conferences.

    We know that the regular 5mC model is essentially as good/better than bisulfite 5mC calling. Do you have that information for 5hmC/5mC?

    opened by billytcl 6
  • Several issues on remora usage

    Several issues on remora usage

    Hi, Thanks for this amazing tool. I have several questions and would really appreciate it if you can help.

    1. What's the difference between the pre-trained models dna_r9.4.1_e8 and dna_r9.4.1_e8.1?

    2. What's the relationship of the remora pre-trained models and models in rerio repo? In the latest Megalodon it seems that Megalodon will call the remora model, but in the previous one Megalodon is using rerio model. A little confused here.

    3. Is remora independent from Megalodon and Taiyaki? Will remora replace Megalodon somehow in the future? Can you provide more information on its usages?

    4. Is remora a new methylation calling tool or not? If so, how can I use the remora to call methylations? Any plan on a detailed tutorial like Megalodon?

    5. Do you have any plan to release the training datasets for remora shown on NCM2021?

    6. It seems that in default remora ont-pyguppy-client-lib==5.1.9, however, the latest version of Guppy is only 5.0.16 in the community. Is there a delay for the Guppy release? Or is it possible to use remora on the older version of Guppy?

    Thank you so much for your help!

    Best, Ziwei

    question 
    opened by PanZiwei 6
  • Install ont-remora==2.0.0 failed, due to pod5 install failed

    Install ont-remora==2.0.0 failed, due to pod5 install failed

    When I install 2.0.0, it failed:

    pip install ont-remora==2.0.0
    Collecting ont-remora==2.0.0
      Using cached ont-remora-2.0.0.tar.gz (76 kB)
      Installing build dependencies ... done
      Getting requirements to build wheel ... done
      Installing backend dependencies ... done
      Preparing metadata (pyproject.toml) ... done
    Collecting requests
      Downloading requests-2.27.1-py2.py3-none-any.whl (63 kB)
         |████████████████████████████████| 63 kB 786 kB/s
    ERROR: Could not find a version that satisfies the requirement pod5>=0.0.43 (from ont-remora) (from versions: none)
    ERROR: No matching distribution found for pod5>=0.0.43
    

    However, when I install pod5, still failed:

    opened by liuyangzzu 5
  • Model improvement questions

    Model improvement questions

    Greetings,

    This is mostly about how to improve the quality of remora models and a few other questions will be asked below. I have trained a custom modification remora model and used it to basecall a modified strand of dna. The created model is pretty poor in regards of it mistaking natural CG sites with modified ones. I presumed this was due to poor modification efficiency on my end. I used the default 0.0.0 remora mC model to basecall a methylated control strand and created a model of it aswell. I was surprised to see that my mC model was poor quality aswell. I was wondering if you have any suggestions on how to improve the model training itself as I am unable to train a basic mC model using nearly 100% methylated dna strands. I'm attaching some IGV pictures visualizing the remora's pre trained mC model, a trained mC model, and a custom model for our modification in that order:

    Megalodon mod_mappings using a pre-trained remora 5mC model

    rem-pre-mC

    Megalodon mod_mappings using a 5mC model trained by me model

    rem-cmC

    Megalodon mod_mappings using a 5ahyC model trained by me model

    Note that IGV shows 5ahyC in blue just like 5hmC

    rem-ahyC

    The pre-trained model makes me believe that the methylation efficiency is sufficient. It is rather the model training where I could do some improvement. I used the workflow written in depo's readme. The models were prepared on a different ~1kb substrate of relatively spaced CG motifs (similar amount and spacing as in the pictures). At this moment I have a few questions regarding this type of model training and some unrelated:

    1. Do you perhaps have any suggestions how I could improve the model training process? Some settings to fiddle with? Or is it the substrate that is lacking?
    2. How exactly is the hmC-mC model trained? Is it possible to train a model which could seperate hmC and my custom modification as the hmC-mC model deos?
    3. Is it possible to train a model using only + strands? I.e. mapping the signals of + strands only or seperating them afterwards? Or rather is there a way to process a fast5 file to seperate the strands assuming the sequence is not palindromic and is barcoded. This is important since we have difficulty modifying both strands.
    4. What exactly is described by accuracy when a remora model is in training? Note that my trained models had >0.99 accuracy
    5. More importantly what kind of substrate do you recommend for model creation? CG content/length etc.?
    opened by jorisbalc 5
  • Running Megalodon for Remora 5mC_all_context_sup_r1041_e82 model

    Running Megalodon for Remora 5mC_all_context_sup_r1041_e82 model

    We are interested in trying out doing methylation calling on data generated from an R10.4.1 flow cell that has already been basecalled using the SUP basecalling model.

    From everything we've read, it seems like this is the exact use for the Rerio model: 5mC_all_context_sup_r1041_e82

    I had a few questions about the logistics of actually running this model though. We have successfully downloaded the file and have the .onnx, but im not sure of what we should be using for the following parameters:

    Should we use the --do-not-use-guppy-server command if we want to use the basecalling that has already been done and is in our fast5 files?

    If not, what should we specify for our --guppy-config file? My intuition is: dna_r10.4.1_e8.2_260bps_sup.cfg, but when we try this it times out without ever starting. When looking at the logs: "Could not load guppy server configuration state: 'Configurations'"

    For this rerio model, should we specify --remora-modified-bases?

    If this is of any help, here is basically what we are trying, that is working is:

    megalodon /SSD/TestData/fast5_pass/ --reference testref.mmi --devices 0 --guppy-server-path /opt/ont/guppy/bin/basecall_server --outputs mod_mappings mods mappings --output-directory /SSD/test_directory/ --processes 30 --remora-model /opt/ont/guppy/data/remora_models_5mc_all_context_sup_r1041_e82.onnx --guppy-config dna_r10.3_450bps_hac.cfg

    It seems to be running, but i am not sure if it appropriate, in particular the guppy-config file.

    Thanks in advance

    opened by jcolicchio-soundag 5
  • 'Remora model list_pretrained'

    'Remora model list_pretrained'

    Hello. Thank you very much for sharing this useful tool. When I install 'remora' from github source for development, it success. However, when I run 'remora model list_pretrtained', there is an error:

    ''' Traceback (most recent call last): File "/lustre/home/rongqiao/anaconda3/envs/remora1.1.0-env/bin/remora", line 33, in sys.exit(load_entry_point('ont-remora', 'console_scripts', 'remora')()) File "/lustre/home/rongqiao/remora1.1.0/remora/src/remora/main.py", line 69, in run cmd_func(args) File "/lustre/home/rongqiao/remora1.1.0/remora/src/remora/parsers.py", line 674, in run_list_pretrained from remora.model_util import get_pretrained_models File "/lustre/home/rongqiao/remora1.1.0/remora/src/remora/model_util.py", line 11, in import onnx File "/lustre/home/rongqiao/anaconda3/envs/remora1.1.0-env/lib/python3.8/site-packages/onnx/init.py", line 11, in from onnx.external_data_helper import load_external_data_for_model, write_external_data_tensors, convert_model_to_external_data File "/lustre/home/rongqiao/anaconda3/envs/remora1.1.0-env/lib/python3.8/site-packages/onnx/external_data_helper.py", line 14, in from .onnx_pb import TensorProto, ModelProto File "/lustre/home/rongqiao/anaconda3/envs/remora1.1.0-env/lib/python3.8/site-packages/onnx/onnx_pb.py", line 8, in from .onnx_ml_pb2 import * # noqa File "/lustre/home/rongqiao/anaconda3/envs/remora1.1.0-env/lib/python3.8/site-packages/onnx/onnx_ml_pb2.py", line 33, in _descriptor.EnumValueDescriptor( File "/lustre/home/rongqiao/anaconda3/envs/remora1.1.0-env/lib/python3.8/site-packages/google/protobuf/descriptor.py", line 755, in new _message.Message._CheckCalledFromGeneratedFile() TypeError: Descriptors cannot not be created directly. If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0. If you cannot immediately regenerate your protos, some other possible workarounds are:

    1. Downgrade the protobuf package to 3.20.x or lower.
    2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).

    More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates '''

    Is this because I install the package from github source?

    opened by Flower9618 5
  • CRF models are not fully supported.

    CRF models are not fully supported.

    Hello,

    Thanks for developing this software!

    I'm trying to run Remora with Megalodon with the following command:

    megalodon ecoli_ci_test_fast5 --guppy-config dna_r9.4.1_450bps_fast.cfg --remora-modified-bases dna_r9.4.1_e8 fast 0.0.0 5hmc_5mc CG 0 --outputs basecalls mappings mod_mappings mods --reference /projects/li-lab/Nanopore_compare/nf_input/reference_genome/ecoli/Ecoli_k12_mg1655.fasta --devices 0 --processes 20 --guppy-server-path /projects/li-lab/software/ont-guppy-gpu_5.0.16/bin/guppy_basecall_server --overwrite
    

    When I run it, I get a message that says CRF models are not fully supported. It appears to be taking a very long time to run. Also, at the end of the guppy_log, it says

    2022-01-18 17:48:32.768178 [guppy/info] New client connected Client 1 anonymous_client_1 id: 9fcd2103-4b21-4e72-9e42-b0ae103868a9 (connection string = 'dna_r9.4.1_450bps_fast:>timeout_interval=15000>client_name=>alignment_type=auto:::').
    2022-01-18 17:48:32.813812 [guppy/info] Client 1 anonymous_client_1 id: 9fcd2103-4b21-4e72-9e42-b0ae103868a9 has disconnected.
    

    I'm using Megalodon version 2.4.1 with PyGuppy and Guppy GPU version 5.0.16. This is similar to an error mentioned in Issue #2, specifically this comment: https://github.com/nanoporetech/remora/issues/2#issuecomment-985712066. Do you know why this error is occurring?

    I'm also a little bit confused, since it says on the Remora GitHub that running Remora on GPU resources is experimental with little support. However, Megalodon, which is running the Remora trained model (if I understand it correctly), requires a path to a Guppy basecall server, which greatly benefits from GPU usage. As a result, I am running on GPU resources. Could this be a source of any error?

    Any help is greatly appreciated. Thank you!

    opened by twslocum 5
  • Any support for 5hmC on the newest Remora models for r9.4.1?

    Any support for 5hmC on the newest Remora models for r9.4.1?

    We have a ton of data using R9.4.1 (Kit 10) chemistry (eg. in the order of hundreds+ human samples) and would like to explore 5hmC calling. Is that possible with the new remora v2 models or do we have to use the older initial Remora ones? As far as I know those ones have substantially less accuracy?

    opened by billytcl 0
  • Data preparation scripts for Remora models with random bases

    Data preparation scripts for Remora models with random bases

    Hello Remora Team,

    In this year's ONT update, Clive mentioned that the newer models that perform better than BS-seq are trained with sequences that contain a modified position with +-30 random bases around that position, if I understand it correctly. Are the scripts to prepare the training data for this kind of input data publicly available? Right now only fully modified and unmodified reads are applicable with the data preparation scripts uploaded here, correct?

    Thanks for your help!

    Cheers, Anna

    opened by AnWiercze 2
  • Various questions about Remora

    Various questions about Remora

    Hello,

    I have a few questions that are not really related to each other.

    First: I always assumed when training and running the model on new data, you had to know the reference in each case in order to generate ground-truth sequences. However, I just noticed this paragraph:

    "The Remora API can be applied to make modified base calls given a basecalled read via a RemoraRead object. sig should be a float32 numpy array. seq is a string derived from sig (can be either basecalls or other downstream derived sequence; e.g. mapped reference positions). seq_to_sig_map should be an int32 numpy array of length len(seq) + 1 and elements should be indices within sig array assigned to each base in seq."

    Lets say I know the reference sequence for the training data but may not know the reference for some new unseen data. Would it be advisable to do the following process?

    Training:

    1. Basecall using Guppy
    2. Generate ground-truth sequences by mapping basecalls to a reference
    3. Use Taiyaki prepare_mapped_reads.py to map signals to ground-truth sequences
    4. Convert Taiyaki .hdf5 into Remora .npz using remora dataset prepare
    5. Remora model train on resulting .npz
    6. Generate .onnx model file

    Testing on new data:

    1. Basecall using Guppy
    2. Using Taiyaki prepare_mapped_reads.py (?) to map signals to basecalls (NOT a reference)
    3. Convert Taiyaki .hdf5 into Remora .npz using remora dataset prepare
    4. Run remora infer from_remora_dataset on resulting .npz file with the .onnx model file generated during training.

    If the answer to the above question is yes, then what is the best way to map signals to the basecalls? Would I just use the same process (prepare_mapped_reads.py with basecalls.fastq as reference?)

    My second question has to do with remora dataset prepare. The default for the --motif parameter is N 0. However, from what I understand, a canonical base (ACTG or any combination) motif/position has to be declared when running Taiyaki's prepare_mapped_reads.py. Generating predictions in any context would be ideal for my situation, but I'm not sure how to get the default here to work. If I try to run prepare_mapped_reads.py with --alphabet ACTG --mod Y N mod_long_name_here, it throws an assertion error saying "Canonical coding for modified base must be a canonical base, got N.) If I try running remora dataset prepare with default parameters after successfully running prepare_mapped_reads.py with something that works like --alphabet ACTG --mod Y A mod_long_name_here, then remora throws a RemoraError saying "Canonical base within motif does not match canonical equivalent for modified base (A)."

    What I'm getting at here is it doesn't seem to be possible to run remora with the default "any context" --motif parameter N 0 because of limitations of the tools used further upstream, such as Taiyaki's prepare_mapped_reads.py. If there is a way to generate a dataset in which the default Remora --motif parameter works, it would be of great help to know how to do that.

    Thanks!

    opened by tcb72 8
Releases(v2.0.0)
  • v2.0.0(Dec 6, 2022)

    Remora v2.0.0 release

    Feature additions:

    • Updated kit14 5mC+5hmC models
    • Simplified POD5+BAM input pipeline
    • Remove ONNX model format (pytorch only unified with Dorado)
    • Automatic model downloads
    • Inference and validation from modBAM format
    • Duplex modified base calling
    • Remore Taiyaki/Megalodon dependency
    • Basecall-anchored training
    Source code(tar.gz)
    Source code(zip)
  • v1.1.1(Jun 16, 2022)

    Remora v1.1.1 release

    Feature additions:

    • Guppy-compatible model export including version 1 Remora models

    Bug Fixes

    • onnxruntime protobuf dependency version issue
    • remora validate from_modbams using strand from --regions-bed
    • Fix big in unused chunk extraction code
    Source code(tar.gz)
    Source code(zip)
  • v1.1.0(May 18, 2022)

    Remora 1.1.0 release

    Feature additions:

    • Kit14 (R10.4.1 E8.2) model releases
    • Kit12 (R10.4 E8.1) model updates
    • Improved validation tools (remora validate from_modbams)
    • Update to new modbam tag specifications (? notation)
    • Improved support for custom training data
    • Fix scaling bug causing numpy divide by zero warning
    • All-context 5mC Kit14 research model added to Rerio (https://github.com/nanoporetech/rerio#remora-models)
    Source code(tar.gz)
    Source code(zip)
  • v1.0.0(Mar 25, 2022)

    Remora 1.0.0 release

    Feature additions:

    • Sequence-based signal re-scaling
    • Signal mapping refinement based on expected levels
    • Improved API
    • Added model with improved performance for R10.4 E8.1 SUP basecalling model (full suite of improved models to be included in next release)

    Various bug fixes.

    Source code(tar.gz)
    Source code(zip)
  • v0.1.2(Jan 26, 2022)

    This release includes some key bug fixes and feature additions:

    • Bug fix for onnx model stalling/segfault issue
    • Better training dataset manipulation
    • External validation datasets during training
    • Allow multiple modified base motifs in one model
    Source code(tar.gz)
    Source code(zip)
  • v0.1.1(Dec 1, 2021)

  • v0.1.0(Dec 1, 2021)

Owner
Oxford Nanopore Technologies
Nanopores for single molecule (DNA/RNA, protein) analysis using the MinION, GridION and PromethION systems
Oxford Nanopore Technologies
Anomaly Detection with R

AnomalyDetection R package AnomalyDetection is an open-source R package to detect anomalies which is robust, from a statistical standpoint, in the pre

Twitter 3.5k Dec 27, 2022
Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code. Tuplex has similar Python APIs to Apache Spark or Dask, but rather

Tuplex 791 Jan 04, 2023
Udacity - Data Analyst Nanodegree - Project 4 - Wrangle and Analyze Data

WeRateDogs Twitter Data from 2015 to 2017 Udacity - Data Analyst Nanodegree - Project 4 - Wrangle and Analyze Data Table of Contents Introduction Proj

Keenan Cooper 1 Jan 12, 2022
A powerful data analysis package based on mathematical step functions. Strongly aligned with pandas.

The leading use-case for the staircase package is for the creation and analysis of step functions. Pretty exciting huh. But don't hit the close button

48 Dec 21, 2022
Hidden Markov Models in Python, with scikit-learn like API

hmmlearn hmmlearn is a set of algorithms for unsupervised learning and inference of Hidden Markov Models. For supervised learning learning of HMMs and

2.7k Jan 03, 2023
Toolchest provides APIs for scientific and bioinformatic data analysis.

Toolchest Python Client Toolchest provides APIs for scientific and bioinformatic data analysis. It allows you to abstract away the costliness of runni

Toolchest 11 Jun 30, 2022
Top 50 best selling books on amazon

It's a dashboard that shows the detailed information about each book in the top 50 best selling books on amazon over the last ten years

Nahla Tarek 1 Nov 18, 2021
ped-crash-techvol: Texas Ped Crash Tech Volume Pack

ped-crash-techvol: Texas Ped Crash Tech Volume Pack In conjunction with the Final Report "Identifying Risk Factors that Lead to Increase in Fatal Pede

Network Modeling Center; Center for Transportation Research; The University of Texas at Austin 2 Sep 28, 2022
Python beta calculator that retrieves stock and market data and provides linear regressions.

Stock and Index Beta Calculator Python script that calculates the beta (β) of a stock against the chosen index. The script retrieves the data and resa

sammuhrai 4 Jul 29, 2022
Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

Data Scientist Learning Plan Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

Trung-Duy Nguyen 27 Nov 01, 2022
Repositori untuk menyimpan material Long Course STMKGxHMGI tentang Geophysical Python for Seismic Data Analysis

Long Course "Geophysical Python for Seismic Data Analysis" Instruktur: Dr.rer.nat. Wiwit Suryanto, M.Si Dipersiapkan oleh: Anang Sahroni Waktu: Sesi 1

Anang Sahroni 0 Dec 04, 2021
X-news - Pipeline data use scrapy, kafka, spark streaming, spark ML and elasticsearch, Kibana

X-news - Pipeline data use scrapy, kafka, spark streaming, spark ML and elasticsearch, Kibana

Nguyễn Quang Huy 5 Sep 28, 2022
Generates a simple report about the current Covid-19 cases and deaths in Malaysia

Generates a simple report about the current Covid-19 cases and deaths in Malaysia. Results are delay one day, data provided by the Ministry of Health Malaysia Covid-19 public data.

Yap Khai Chuen 7 Dec 15, 2022
INFO-H515 - Big Data Scalable Analytics

INFO-H515 - Big Data Scalable Analytics Jacopo De Stefani, Giovanni Buroni, Théo Verhelst and Gianluca Bontempi - Machine Learning Group Exercise clas

Yann-Aël Le Borgne 58 Dec 11, 2022
A tool to compare differences between dataframes and create a differences report in Excel

similarpanda A module to check for differences between pandas Dataframes, and generate a report in Excel format. This is helpful in a workplace settin

Andre Pretorius 9 Sep 15, 2022
Data exploration done quick.

Pandas Tab Implementation of Stata's tabulate command in Pandas for extremely easy to type one-way and two-way tabulations. Support: Python 3.7 and 3.

W.D. 20 Aug 27, 2022
Intercepting proxy + analysis toolkit for Second Life compatible virtual worlds

Hippolyzer Hippolyzer is a revival of Linden Lab's PyOGP library targeting modern Python 3, with a focus on debugging issues in Second Life-compatible

Salad Dais 6 Sep 01, 2022
Generate lookml for views from dbt models

dbt2looker Use dbt2looker to generate Looker view files automatically from dbt models. Features Column descriptions synced to looker Dimension for eac

lightdash 126 Dec 28, 2022
Full ELT process on GCP environment.

Rent Houses Germany - GCP Pipeline Project: The goal of the project is to extract data about house rentals in Germany, store, process and analyze it u

Felipe Demenech Vasconcelos 2 Jan 20, 2022
MIR Cheatsheet - Survival Guidebook for MIR Researchers in the Lab

MIR Cheatsheet - Survival Guidebook for MIR Researchers in the Lab

SeungHeonDoh 3 Jul 02, 2022