Extract longest transcript or longest CDS transcript from GTF annotation file or gencode transcripts fasta file.

Overview

GetTransTool Package

There are four types of methods to extract longest transcript or longest CDS regeion with longest transcript from transcripts fasta file or GTF file.


  • 1.Extract longest transcript from gencode transcripts fasta file.

  • 2.Extract longest transcript from gtf format annotation file based on gencode/ensembl/ucsc database.

  • 3.Extract longest CDS regeion with longest transcript from gencode database transcripts fasta file.

  • 4.Extract longest CDS regeion with longest transcript from gtf format annotation file based on ensembl/ucsc database.

Install

$ pip install GetTransTool

Usage

1. get longest transcript from gencode transcripts fasta file:

help infomation:

$ GetLongestTransFromGencode -h
usage: GetLongestTransFromGencode --file gencode.vM28.transcripts.fa.gz --outfile longest_trans.fa

Get longest transcripts from gencode transcripts fasta file.

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -f transfile, --file transfile
                        input your transcripts file with ".gz" format. (gencode.vM28.transcripts.fa.gz)
  -o longestfile, --outfile longestfile
                        output your longest transcript file. (longest_trans.fa)

Thank your for your support, if you have any questions or suggestions please contact me: [email protected].

usage:

$ GetLongestTransFromGencode --file gencode.vM28.transcripts.fa.gz --outfile longest_trans_gencode.fa
Your job is running, please wait...
Your job is done!
Running with 32.33 seconds!

there will be three files produced including name_changed.fa, longest_transcripts_info.csv, longest_trans_gencode.fa.

name_changed.fa:

>4933401J01Rik_ENSMUSG00000102693.2_ENSMUST00000193812.2_1070
AAGGAAAGAGGATAACACTTGAAATGTAAATAAAGAAAATACCTAATAAAAATAAATAAA
AACATGCTTTCAAAGGAAATAAAAAGTTGGATTCAAAAATTTAACTTTTGCTCATTTGGT
ATAATCAAGGAAAAGACCTTTGCATATAAAATATATTTTGAATAAAATTCAGTGGAAGAA
...

longest_transcripts_info.csv:

this is the longest transcripts exon length information.

fullname,gene_name,translength
snoZ196_ENSMUSG00002074855.1_ENSMUST00020182568.1_35,snoZ196,35
snoZ159_ENSMUSG00002075734.1_ENSMUST00020182611.1_87,snoZ159,87
n-R5s93_ENSMUSG00000119639.1_ENSMUST00000240071.1_119,n-R5s93,119
...

longest_trans_gencode.fa:

this is the filtered longest transcript fasta file.

>4933401J01Rik_ENSMUSG00000102693.2_ENSMUST00000193812.2_1070
AAGGAAAGAGGATAACACTTGAAATGTAAATAAAGAAAATACCTAATAAAAATAAATAAA
AACATGCTTTCAAAGGAAATAAAAAGTTGGATTCAAAAATTTAACTTTTGCTCATTTGGT
ATAATCAAGGAAAAGACCTTTGCATATAAAATATATTTTGAATAAAATTCAGTGGAAGAA
...

2. Extract longest transcript from gtf format annotation file based on gencode/ensembl/ucsc database:

help infomation:

$ GetLongestTransFromGTF -h
usage: GetLongestTransFromGTF --database ensembl --gtffile Homo_sapiens.GRCh38.101.gtf.gz --genome Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz --outfile longest_trans.fa

Extract longest transcript from gtf format annotation file based on gencode/ensembl/ucsc database.

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -d databse, --database databse
                        which annotation database you choose. (default="ensembl", ucsc/ensembl/gencode)
  -g gtffile, --gtffile gtffile
                        input your GTF file with ".gz" format.
  -fa genome, --genome genome
                        your genome fasta file matched with your GTF file with ".gz" format. (Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz)
  -o longestfile, --outfile longestfile
                        output your longest transcript file. (longest_trans.fa)

Thank your for your support, if you have any questions or suggestions please contact me: [email protected].

usage:

$ GetLongestTransFromGTF --database ensembl --gtffile Homo_sapiens.GRCh38.103.gtf.gz --genome Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz --outfile longest_trans_ensembl.fa
Your job is running, please wait...
Your job is done! 
Running with 159.51 seconds!

there will be three files produced including longest_transcripts_info.csv, longest_trans.gtf, longest_trans_ensembl.fa.

longest_transcripts_info.csv:

,transcript_length,gene_name
snoZ196_ENSG00000281780_ENST00000625269_snoRNA,89,snoZ196
hsa-mir-423_ENSG00000266919_ENST00000586878_lncRNA,94,hsa-mir-423
hsa-mir-1253_ENSG00000272920_ENST00000609567_lncRNA,105,hsa-mir-1253
...

longest_trans.gtf:

this is the gtf information for the longest transcripts.

1	havana	gene	11869	14409	.	+	.	gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene";
1	havana	transcript	11869	14409	.	+	.	gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; tag "basic"; transcript_support_level "1";
1	havana	exon	11869	12227	.	+	.	gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "1"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00002234944"; exon_version "1"; tag "basic"; transcript_support_level "1";
1	havana	exon	12613	12721	.	+	.	gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00003582793"; exon_version "1"; tag "basic"; transcript_support_level "1";
1	havana	exon	13221	14409	.	+	.	gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "3"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00002312635"; exon_version "1"; tag "basic"; transcript_support_level "1";
1	havana	gene	14404	29570	.	-	.	gene_id "ENSG00000227232"; gene_version "5"; gene_name "WASH7P"; gene_source "havana"; gene_biotype "unprocessed_pseudogene";

longest_trans_ensembl.fa:

>DDX11L1_ENSG00000223972_ENST00000456328_transcribed_unprocessed_pseudogene
GTTAACTTGCCGTCAGCCTTTTCTTTGACCTCTTCTTTCTGTTCATGTGTATTTGCTGTC
TCTTAGCCCAGACTTCCCGTGTCCTTTCCACCGGGCCTTTGAGAGGTCACAGGGTCTTGA
TGCTGTGGTCTTCATCTGCAGGTGTCTGACTTCCAGCAACTGCTGGCCTGTGCCAGGGTG
...

for ucsc:

$ GetLongestTransFromGTF --database ucsc --gtffile hg19.ncbiRefSeq.gtf.gz --genome hg19.fa.gz --outfile longest_trans_ucsc.fa

3. Extract longest CDS regeion with longest transcript from gencode database transcripts fasta file.

help infomation:

$ GetCDSLongestFromGencode -h
usage: GetCDSLongestFromGencode --file gencode.vM28.pc_transcripts.fa.gz --outfile longest_cds_trans.fa

Extract longest CDS regeion with longest transcript from gencode database transcripts fasta file.

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -f transfile, --file transfile
                        input your protein-coding transcripts file with ".gz" format. (gencode.vM28.pc_transcripts.fa.gz)
  -o longestfile, --outfile longestfile
                        output your longest transcript file. (longest_cds_trans.fa)

Thank your for your support, if you have any questions or suggestions please contact me: [email protected].

usage:

$ GetCDSLongestFromGencode --file gencode.vM28.pc_transcripts.fa.gz --outfile longest_cds_trans_gencode.fa
Your job is running, please wait...
Your job is done! 
Running with 17.67 seconds!

there will be four files produced including name_changed.fa, All_transcripts_cds_info.csv, longest_cds_transcripts_info.csv, longest_cds_trans_gencode.fa.

name_changed.fa:

>Xkr4_ENSMUSG00000051951.6_ENSMUST00000070533.5_151_2094_3634
GCGGCGGCGGGCGAGCGGGCGCTGGAGTAGGAGCTGGGGAGCGGCGCGGCCGGGGAAGGA
AGCCAGGGCGAGGCGAGGAGGTGGCGGGAGGAGGAGACAGCAGGGACAGGTGTCAGATAA
AGGAGTGCTCTCCTCCGCTGCCGAGGCATCATGGCCGCTAAGTCAGACGGGAGGCTGAAG
...

All_transcripts_cds_info.csv:

this is the all transcripts cds and exon length information.

fullname,gene_name,translength,cdslength
>mt-Nd6_ENSMUSG00000064368.1_ENSMUST00000082419.1_1_519_519,>mt-Nd6,519,519
>mt-Nd5_ENSMUSG00000064367.1_ENSMUST00000082418.1_1_1824_1824,>mt-Nd5,1824,1824
>mt-Nd4l_ENSMUSG00000065947.1_ENSMUST00000084013.1_1_297_297,>mt-Nd4l,297,297
...

longest_cds_transcripts_info.csv:

fullname,gene_name,translength,cdslength
>mt-Nd6_ENSMUSG00000064368.1_ENSMUST00000082419.1_1_519_519,>mt-Nd6,519,519
>mt-Nd5_ENSMUSG00000064367.1_ENSMUST00000082418.1_1_1824_1824,>mt-Nd5,1824,1824
>mt-Nd4l_ENSMUSG00000065947.1_ENSMUST00000084013.1_1_297_297,>mt-Nd4l,297,297
...

longest_cds_trans_gencode.fa:

>Xkr4_ENSMUSG00000051951.6_ENSMUST00000070533.5_151_2094_3634
GCGGCGGCGGGCGAGCGGGCGCTGGAGTAGGAGCTGGGGAGCGGCGCGGCCGGGGAAGGA
AGCCAGGGCGAGGCGAGGAGGTGGCGGGAGGAGGAGACAGCAGGGACAGGTGTCAGATAA
AGGAGTGCTCTCCTCCGCTGCCGAGGCATCATGGCCGCTAAGTCAGACGGGAGGCTGAAG
...

4. Extract longest CDS regeion with longest transcript from gtf format annotation file based on ensembl/ucsc database.

help infomation:

$ GetCDSLongestFromGTF -h
usage: GetCDSLongestFromGTF --database ensembl --gtffile Homo_sapiens.GRCh38.101.gtf.gz --genome Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz --outfile longest_cds_trans.fa

Extract longest CDS regeion with longest transcript from gtf format annotation file based on ensembl/ucsc database.

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -d databse, --database databse
                        which annotation database you choose. (default="ensembl", ucsc/ensembl)
  -g gtffile, --gtffile gtffile
                        input your GTF file with ".gz" format.
  -fa genome, --genome genome
                        your genome fasta file matched with your GTF file with ".gz" format. (Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz)
  -o cdslongestfile, --outfile cdslongestfile
                        output your longest transcript file. (longest_cds_trans.fa)

Thank your for your support, if you have any questions or suggestions please contact me: [email protected].

usage:

$ GetCDSLongestFromGTF  --database ensembl --gtffile Homo_sapiens.GRCh38.103.gtf.gz --genome Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz --outfile longest_cds_trans_ensembl.fa
Your job is running, please wait...
Your job is done! 
Running with 152.38 seconds!

there will be four files produced including CDS_longest_trans.gtf, All_transcripts_cds_info.csv, longest_cds_transcripts_info.csv, longest_cds_trans_ensembl.fa.

CDS_longest_trans.gtf:

1	ensembl_havana	gene	65419	71585	.	+	.	gene_id "ENSG00000186092"; gene_version "6"; gene_name "OR4F5"; gene_source "ensembl_havana"; gene_biotype "protein_coding";
1	havana	transcript	65419	71585	.	+	.	gene_id "ENSG00000186092"; gene_version "6"; transcript_id "ENST00000641515"; transcript_version "2"; gene_name "OR4F5"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "OR4F5-202"; transcript_source "havana"; transcript_biotype "protein_coding"; tag "basic";
1	havana	exon	65419	65433	.	+	.	gene_id "ENSG00000186092"; gene_version "6"; transcript_id "ENST00000641515"; transcript_version "2"; exon_number "1"; gene_name "OR4F5"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "OR4F5-202"; transcript_source "havana"; transcript_biotype "protein_coding"; exon_id "ENSE00003812156"; exon_version "1"; tag "basic";
1	havana	exon	65520	65573	.	+	.	gene_id "ENSG00000186092"; gene_version "6"; transcript_id "ENST00000641515"; transcript_version "2"; exon_number "2"; gene_name "OR4F5"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "OR4F5-202"; transcript_source "havana"; transcript_biotype "protein_coding"; exon_id "ENSE00003813641"; exon_version "1"; tag "basic";
1	havana	CDS	65565	65573	.	+	0	gene_id "ENSG00000186092"; gene_version "6"; transcript_id "ENST00000641515"; transcript_version "2"; exon_number "2"; gene_name "OR4F5"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "OR4F5-202"; transcript_source "havana"; transcript_biotype "protein_coding"; protein_id "ENSP00000493376"; protein_version "2"; tag "basic";
1	havana	start_codon	65565	65567	.	+	0	gene_id "ENSG00000186092"; gene_version "6"; transcript_id "ENST00000641515"; transcript_version "2"; exon_number "2"; gene_name "OR4F5"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "OR4F5-202"; transcript_source "havana"; transcript_biotype "protein_coding"; tag "basic";
1	havana	exon	69037	71585	.	+	.	gene_id "ENSG00000186092"; gene_version "6"; transcript_id "ENST00000641515"; transcript_version "2"; exon_number "3"; gene_name "OR4F5"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "OR4F5-202"; transcript_source "havana"; transcript_biotype "protein_coding"; exon_id "ENSE00003813949"; exon_version "1"; tag "basic";
1	havana	CDS	69037	70005	.	+	0	gene_id "ENSG00000186092"; gene_version "6"; transcript_id "ENST00000641515"; transcript_version "2"; exon_number "3"; gene_name "OR4F5"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "OR4F5-202"; transcript_source "havana"; transcript_biotype "protein_coding"; protein_id "ENSP00000493376"; protein_version "2"; tag "basic";
1	havana	stop_codon	70006	70008	.	+	0	gene_id "ENSG00000186092"; gene_version "6"; transcript_id "ENST00000641515"; transcript_version "2"; exon_number "3"; gene_name "OR4F5"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "OR4F5-202"; transcript_source "havana"; transcript_biotype "protein_coding"; tag "basic";
1	havana	five_prime_utr	65419	65433	.	+	.	gene_id "ENSG00000186092"; gene_version "6"; transcript_id "ENST00000641515"; transcript_version "2"; gene_name "OR4F5"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "OR4F5-202"; transcript_source "havana"; transcript_biotype "protein_coding"; tag "basic";
1	havana	five_prime_utr	65520	65564	.	+	.	gene_id "ENSG00000186092"; gene_version "6"; transcript_id "ENST00000641515"; transcript_version "2"; gene_name "OR4F5"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "OR4F5-202"; transcript_source "havana"; transcript_biotype "protein_coding"; tag "basic";
1	havana	three_prime_utr	70009	71585	.	+	.	gene_id "ENSG00000186092"; gene_version "6"; transcript_id "ENST00000641515"; transcript_version "2"; gene_name "OR4F5"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "OR4F5-202"; transcript_source "havana"; transcript_biotype "protein_coding"; tag "basic";
1	ensembl_havana	gene	450740	451678	.	-	.	gene_id "ENSG00000284733"; gene_version "2"; gene_name "OR4F29"; gene_source "ensembl_havana"; gene_biotype "protein_coding";
...

All_transcripts_cds_info.csv:

this is the all transcripts cds and exon length information.

cdslength,ID,translength,utr5length,gene_name
2709,ZZZ3_ENSG00000036549_ENST00000370801,6412,476,ZZZ3
1227,ZZZ3_ENSG00000036549_ENST00000370798,2468,486,ZZZ3
173,ZZZ3_ENSG00000036549_ENST00000433749,603,430,ZZZ3
...

longest_cds_transcripts_info.csv:

cdslength,ID,translength,utr5length,gene_name
2709,ZZZ3_ENSG00000036549_ENST00000370801,6412,476,ZZZ3
8883,ZZEF1_ENSG00000074755_ENST00000381638,11466,135,ZZEF1
1716,ZYX_ENSG00000159840_ENST00000322764,2228,80,ZYX
...

longest_cds_trans_gencode.fa:

>OR4F5_ENSG00000186092_ENST00000641515_61_1038_2618
CCCAGATCTCTTCAGTTTTTATGCCTCATTCTGTGAAAATTGCTGTAGTCTCTTCCAGTT
ATGAAGAAGGTAACTGCAGAGGCTATTTCCTGGAATGAATCAACGAGTGAAACGAATAAC
TCTATGGTGACTGAATTCATTTTTCTGGGTCTCTCTGATTCTCAGGAACTCCAGACCTTC
...

for ucsc:

$ GetCDSLongestFromGTF  --database ucsc --gtffile hg19.ncbiRefSeq.gtf.gz --genome hg19.fa.gz --outfile longest_cds_trans_ensembl.fa

END

Thank your for your support, if you have any questions or suggestions please contact me: [email protected].

Owner
laojunjun
路漫漫其修远兮 吾将上下而求索
laojunjun
Python Sreamlit Duplicate Records Finder Remover

Python-Sreamlit-Duplicate-Records-Finder-Remover Streamlit is an open-source Python library that makes it easy to create and share beautiful, custom w

RONALD KANYEPI 1 Jan 21, 2022
Publicly Open Amazon AWS S3 Bucket Viewer

S3Viewer Publicly open storage viewer (Amazon S3 Bucket, Azure Blob, FTP server, HTTP Index Of/) s3viewer is a free tool for security researchers that

Sharon Brizinov 377 Dec 02, 2022
Read and write TIFF files

Read and write TIFF files Tifffile is a Python library to store numpy arrays in TIFF (Tagged Image File Format) files, and read image and metadata fro

Christoph Gohlke 346 Dec 18, 2022
File support for asyncio

aiofiles: file support for asyncio aiofiles is an Apache2 licensed library, written in Python, for handling local disk files in asyncio applications.

Tin Tvrtković 2.1k Jan 01, 2023
The best way to convert files on your computer, be it .pdf to .png, .pdf to .docx, .png to .ico, or anything you can imagine.

The best way to convert files on your computer, be it .pdf to .png, .pdf to .docx, .png to .ico, or anything you can imagine.

JareBear 2 Nov 20, 2021
shred - A cross-platform library for securely deleting files beyond recovery.

shred Help the project financially: Donate: https://smartlegion.github.io/donate/ Yandex Money: https://yoomoney.ru/to/4100115206129186 PayPal: https:

4 Sep 04, 2021
Python code snippets for extracting PDB codes from .fasta files

Python_snippets_for_bioinformatics Python code snippets for extracting PDB codes from .fasta files If you have a single .fasta file for all protein se

Sofi-Mukhtar 3 Feb 09, 2022
A simple file sharing tool written in python

Share it A simple file sharing tool written in python Installation If you are using Windows os you can directly Run .exe file -- download If you are

Sachit Yadav 7 Dec 16, 2022
Lumar - Smart File Creator

Lumar is a free tool for creating and managing files. With Lumar you can quickly create any type of file, add a file content and file size. With Lumar you can also find out if Photoshop or other imag

Paul - FloatDesign 3 Dec 10, 2021
A Certificate renaming tool made for IEEE CS SBC, SJCE.

PDF Batch Renamer Made for IEEE CS SBC, SJCE How to use? Before using the python script, ensure that pytesseract, pdf2image, opencv and other supporti

Ashwin Kumar U 2 Nov 14, 2021
This is a junk file creator tool which creates junk files in Internal Storage

This is a junk file creator tool which creates junk files in Internal Storage

KiLL3R_xRO 3 Jun 20, 2021
Instant Fuzzy File Search for Alfred

List all the files inside a folder using fd, and instantly fuzzy-search through all of them using fzf, all from inside Alfred with a single keyword: fzf.

Mr. Pennyworth 37 Nov 30, 2022
CleverCSV is a Python package for handling messy CSV files.

CleverCSV is a Python package for handling messy CSV files. It provides a drop-in replacement for the builtin CSV module with improved dialect detection, and comes with a handy command line applicati

The Alan Turing Institute 1k Dec 19, 2022
Uncompress DEFLATE streams in pure Python

stream-inflate Uncompress DEFLATE streams in pure Python. Installation pip install stream-inflate Usage from stream_inflate import stream_inflate impo

Michal Charemza 7 Oct 13, 2022
Remove [x]_ from StudIP zip Archives and archive_filelist.csv completely

This tool removes the "[x]_" at the beginning of StudIP zip Archives. It also deletes the "archive_filelist.csv" file

Kelke vl 1 Jan 19, 2022
Python module that parse power builder file (PBD) and analyze code

PowerBuilder-decompile Python module that parse power builder file (PBD) and analyze code (Incomplete) this tool is composed of: pbd_dump.py pbd file

Samy Sultan 8 Dec 15, 2022
Listreqs is a simple requirements.txt generator. It's an alternative to pipreqs

⚡ Listreqs Listreqs is a simple requirements.txt generator. It's an alternative to pipreqs. Where in Pipreqs, it helps you to Generate requirements.tx

Soumyadip Sarkar 4 Oct 15, 2021
Powerful Python library for atomic file writes.

Powerful Python library for atomic file writes.

Markus Unterwaditzer 313 Oct 19, 2022
A wrapper for DVD file structure and ISO files.

vs-parsedvd DVDs were an error. A wrapper for DVD file structure and ISO files. You can find me in the IEW Discord server

7 Nov 17, 2022
This is just a GUI that detects your file's real extension using the filetype module.

Real-file.extnsn This is just a GUI that detects your file's real extension using the filetype module. Requirements Python 3.4 and above filetype modu

1 Aug 08, 2021