Full Spectrum Bioinformatics - a free online text designed to introduce key topics in Bioinformatics using the Python

Last update: Dec 28, 2022

Overview

Full Spectrum Bioinformatics is a free online text designed to introduce key topics in Bioinformatics using the Python programming language. The text is written in interactive Jupyter Notebooks, which allow you to try out and modify example code and analyses.

In addition to explanations of concepts, Full Spectrum Bioinformatics also includes Bioinformatics Vignettes written by readers of the text. Each vignette is focused around a particular core concept, and show how readers have applied that concepts to their research projects.

If you happen to already be familiar with GitHub and Jupyter Notebooks, you can download the entire project and run it interactively, or click the 'Open in Colab' links to open interactive versions of each section in Google Colab (you will need to 'Save as' your own copy in order to change code). You can also view a static version of each section using the nbviewer links. If using the direct GitHub links, you may sometimes get a GitHub error message. Usually hitting reload page or using the nbviewer link avoids this issue.

Lead Author: Jesse Zaneveld¹
Vignette Authors: Nia Prabhu^*¹, Aziz Bajouri^*^1,2, Ayomikun Akinrinade^{*^{1,3

* Vignette authors contributed equally and are listed in chronological order of first contribution.
1 Division of Biological Sciences, School of STEM, University of Washington, Bothell, Washington, USA
2 Division of Computer and Software Systems, School of STEM, University of Washington, Bothell, Washington, USA
3 Division of Health Studies, School of Nursing and Health Studies, University of Washington, Bothell, Washington, USA}}

The text is currently in prototype status. Chapters with content you can preview are linked below:

Chapter 1. Foreword
Chapter 2. Introduction
- The Many Paths to Bioinformatics
- Speaking Each Other's Language
  - An Absurdly Brief Introduction to Biology
  - An Absurdly Brief Introduction to Computer Science
  - An Absurdly Brief Introduction to Statistics
Chapter 3. The Command Line
- Using the Command Line
- Exercise: Little Brother is Missing
Chapter 4. Exploring Python
- Warm-up Exercise: Spot the Difference
- Exploring Python
- A Tour of Python Data Types
- A Tour of Python Syntax (functions, conditions, iteration, classes)
Chapter 5. Project Design
- Using Literature Surveys to Ask Good Questions and Propose Testable Hypotheses
Chapter 6. Biological Sequences
- An introduction to Biological Sequences
- Representing and Manipulating Biological Sequences as Python Strings
- Analyzing Biological Sequences with For Loops and If Statements
- Reading and writing FASTA files using Python
- Bioinformatics Vignette (Aziz Bajouri): Using set objects to find circular RNAs involved in multiple diseases
- Exercise: Error Bingo
- Error Messages in Python
- Bioinformatics Vignette (Nia Prabhu): Using For Loops and Dictionaries to Compare Nucleotide Composition in Pandemic and Non-Pandemic Causing Influenza Strains
- Capstone: testing for depletion of CG dinucleotides in the human genome
Chapter 7. 'Omics
- An Introduction to 'Omics
- Working with Tabular 'Omic data in Python using Pandas
- Analyzing Microbiome Alpha Diversity in Python
- Analyzing Microbiome Beta Diversity in Python
- Simulating the Effect of Sequencing Depth on Diversity Estimates
Chapter 8. Visualization
- Graphs as a Visual Language
- Exercise: Anger Tufte
- Representing Correlation
- Representing Distribution
Chapter 9. Alignment and Phylogenetics
- 9a. Alignment
- Homology and Alignment
- Global Alignment with the Needleman-Wunsch algorithm
- Local Alignment with the Smith-Waterman algorithm
- BLAST and the k-mer trick
- Exercise: Duck vs. Yeast
- 9b. Phylogenetics
- Tree thinking
- Representing Phylogenetic Trees with Python Classes
- Generating Trees Using Birth-Death Models
- Working with Traits on Trees
- Maximum Parsimony Ancestral State Reconstruction
- Hidden State Prediction
- Phylogenetic Comparative Methods
Chapter 10. Simulation
- Simulating Biological Networks
- Simulating the Population Genetics of Natural Selection and Genetic Drift
- Simulating the Evolution of Social Behavior
Chapter 11. Statistics
- Linear Models - a Statistical Swiss Army Knife
- Monte Carlo simulation and the Fundamental Unity of Statistical Hypothesis Tests
- Statistical Distributions and Parametric Tests
- Rank Transformations
- Monte Carlo simulation of Effect Size, Sample Size, and Significance
- Dealing with Multiple Comparisons
- Exercise: Revising your writing about statistical results
- An Introduction to Maximum Likelihood optimization
- The Best Model of A Cat is a Cat - model complexity, overfitting, and the AIC
- An Introduction to Bayesian Approaches
Chapter 12. Multivariate Statistics and Machine Learning
- Unsupervised Classification: of ordination, clustering and fishtanks
- Supervised Classification: from lines to trees to forests.
- Bioinformatics Vignette (Ayomikun Akinrinade): Using K-Nearest Neighbors and Binary Decision Tree Algorithms to Predict Enzyme Function from Protein Sequences
Chapter 13. Presenting Research
- Presentations as Verbal Chess
Chapter 14. Polishing and Publishing
- Presenting Research
- From Data to Conclusion: building a research manuscript brick by brick
- Resistance is Futile: becoming a language Borg
- Exercise: generating a targeted title using templating
- The Inverted Pyramid: optimizing your text from a reader's perspective
Chapter 15. Careers that draw on Bioinformatics
- Fighting for an Inclusive Workplace
  - Examining Privilege and Identity
  - Making Your Science and Teaching Accessible and Inclusive
  - Campus and Local Activism
  - Improving University Policy
- Happiness Matters
- Radical Collaboration
- Cognitive Bias and Networking
- Open-source Science as Shield and Sword
- Applying for Grants
Appendices:
- Appendix A - Data Sources for Bioinformatics Projects
- Appendix B - Timesaving Starter Code
  - Template Script with Interface and Test Code
  - IUPAC codes in python
  - Standard Translation Tables in Python
- Appendix C - Contributing a Community Example
- Appendix D - Paper Formatting Kit
- Appendix E - Project Specifications

This project is being developed with support from NSF Integrative and Organismal Systems award .

Feedback

You can submit feedback about completed chapters at the following link

Comments

Bump nokogiri from 1.10.9 to 1.11.1
Bumps nokogiri from 1.10.9 to 1.11.1.

Release notes

Sourced from nokogiri's releases.

v1.11.1 / 2021-01-06

Fixed

[CRuby] If libxml-ruby is loaded before nokogiri, the SAX and Push parsers no longer call libxml-ruby's handlers. Instead, they defensively override the libxml2 global handler before parsing. [#2168]

SHA-256 Checksums of published gems

a41091292992cb99be1b53927e1de4abe5912742ded956b0ba3383ce4f29711c nokogiri-1.11.1-arm64-darwin.gem d44fccb8475394eb71f29dfa7bb3ac32ee50795972c4557ffe54122ce486479d nokogiri-1.11.1-java.gem f760285e3db732ee0d6e06370f89407f656d5181a55329271760e82658b4c3fc nokogiri-1.11.1-x64-mingw32.gem dd48343bc4628936d371ba7256c4f74513b6fa642e553ad7401ce0d9b8d26e1f nokogiri-1.11.1-x86-linux.gem 7f49138821d714fe2c5d040dda4af24199ae207960bf6aad4a61483f896bb046 nokogiri-1.11.1-x86-mingw32.gem 5c26111f7f26831508cc5234e273afd93f43fbbfd0dcae5394490038b88d28e7 nokogiri-1.11.1-x86_64-darwin.gem c3617c0680af1dd9fda5c0fd7d72a0da68b422c0c0b4cebcd7c45ff5082ea6d2 nokogiri-1.11.1-x86_64-linux.gem 42c2a54dd3ef03ef2543177bee3b5308313214e99f0d1aa85f984324329e5caa nokogiri-1.11.1.gem

v1.11.0 / 2021-01-03

Notes

Faster, more reliable installation: Native Gems for Linux and OSX/Darwin

"Native gems" contain pre-compiled libraries for a specific machine architecture. On supported platforms, this removes the need for compiling the C extension and the packaged libraries. This results in much faster installation and more reliable installation, which as you probably know are the biggest headaches for Nokogiri users.

We've been shipping native Windows gems since 2009, but starting in v1.11.0 we are also shipping native gems for these platforms:

Linux: x86-linux and x86_64-linux -- including musl platforms like alpine

OSX/Darwin: x86_64-darwin and arm64-darwin

We'd appreciate your thoughts and feedback on this work at #2075.

Dependencies

Ruby

This release introduces support for Ruby 2.7 and 3.0 in the precompiled native gems.

This release ends support for:

Ruby 2.3, for which official support ended on 2019-03-31 [#1886] (Thanks @ashmaroli!)

Ruby 2.4, for which official support ended on 2020-04-05

JRuby 9.1, which is the Ruby 2.3-compatible release.

Gems

... (truncated)

Changelog

Sourced from nokogiri's changelog.

v1.11.1 / 2021-01-06

Fixed

[CRuby] If libxml-ruby is loaded before nokogiri, the SAX and Push parsers no longer call libxml-ruby's handlers. Instead, they defensively override the libxml2 global handler before parsing. [#2168]

v1.11.0 / 2021-01-03

Notes

Faster, more reliable installation: Native Gems for Linux and OSX/Darwin

"Native gems" contain pre-compiled libraries for a specific machine architecture. On supported platforms, this removes the need for compiling the C extension and the packaged libraries. This results in much faster installation and more reliable installation, which as you probably know are the biggest headaches for Nokogiri users.

We've been shipping native Windows gems since 2009, but starting in v1.11.0 we are also shipping native gems for these platforms:

Linux: x86-linux and x86_64-linux -- including musl platforms like alpine

OSX/Darwin: x86_64-darwin and arm64-darwin

We'd appreciate your thoughts and feedback on this work at #2075.

Dependencies

Ruby

This release introduces support for Ruby 2.7 and 3.0 in the precompiled native gems.

This release ends support for:

Ruby 2.3, for which official support ended on 2019-03-31 [#1886] (Thanks @ashmaroli!)

Ruby 2.4, for which official support ended on 2020-04-05

JRuby 9.1, which is the Ruby 2.3-compatible release.

Gems

Explicitly add racc as a runtime dependency. [#1988] (Thanks, @voxik!)

[MRI] Upgrade mini_portile2 dependency from ~> 2.4.0 to ~> 2.5.0 [#2005] (Thanks, @alejandroperea!)

Security

See note below about CVE-2020-26247 in the "Changed" subsection entitled "XML::Schema parsing treats input as untrusted by default".

Added

Add Node methods for manipulating "keyword attributes" (for example, class and rel): #kwattr_values, #kwattr_add, #kwattr_append, and #kwattr_remove. [#2000]

... (truncated)

Commits

7be6f04 version bump to v1.11.1

aa0c399 dev: overhaul .gitignore

3d90c6d Merge pull request #2169 from sparklemotion/2168-active-support-test-failure

bbf850c changelog: update for #2168

ee69772 ci: another valgrind suppression

f9a2c4e fix: restore proper error handling in the SAX push parser

35aa88b fix(cruby): reset libxml2's error handler in sax and push parsers

07459fd fix(test): clobber libxml2's global error handler before every test

b682ac5 ci: ensure all tests are running setup

007662f github: update "installation difficulty" issue template

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 2
Missing reading response link on "Error Messages in Python"

There is no reading response link at the bottom of content/04_exploring_python/error_messages_in_python.ipynb. Additionally, on the reading response form, there is no entry for this reading.

opened by LucaOnline 1
Add discussion of HISAT2 & transcriptomics

HiSat2 https://anaconda.org/bioconda/hisat2

Salmon intro (another alternative that interoperates well with DESeq2) https://combine-lab.github.io/salmon/getting_started/

opened by zaneveld 0
Literature Synthesis section -- discuss cutting extra phrases that don't add meaning in literature

In addition we found a more recent study that showed that [research finding] (cite1;cite2). --> [research finding]

In a 2016 study it was shown that [finding])(cite1) --> finding

opened by zaneveld 0
More database links: https://www.cbioportal.org/ (Cancer research database) https://www.idigbio.org/ (Integrated digitized biocollections) https://www.gbif.org/ (biodiversity data) https://bceenetwork.org/cure-summaries/ https://docs.google.com/document/d/1gC-sj3p8aUKgEDxVPJfq793Mm4n5niZm/edit (overview of databases for genes and genomics for cancer)

Open resources shared in the 2022 AACU Talks (CUREing Cancer: How a Virtual Cancer Genomics CURE Made Research Accessible to Students During COVID and another was on Expanding Access to Undergraduate Research Through BCEENET Cures Using Digitized Collections Data) on CUREs (shared by Robin Angotti):

https://www.cbioportal.org/ (Cancer research database)
https://www.idigbio.org/ (Integrated digitized biocollections) https://www.gbif.org/ (biodiversity data) https://bceenetwork.org/cure-summaries/ https://docs.google.com/document/d/1gC-sj3p8aUKgEDxVPJfq793Mm4n5niZm/edit (overview of databases for genes and genomics for cancer)

opened by zaneveld 0

Releases(release-2022.3.1)

release-2022.3.1(Mar 2, 2022)

What's Changed

The 2022.3.1 Release of Full Spectrum Bioinformatics greatly expands the scope and maturity of the text, including contributions from 3 undergraduate co-authors. This text has now been used to support multiple classes, and has 35 sections that are linked from the table of content and ready for classroom use.

Here are some of the major changes:

The text has several new sections: -- An overview of python syntax now overviews how to recognize python syntax before we dive into studying the details -- A first chapter on sequence alignment now covers Needleman-Wunsch alignment, both as worked by hand using a simple example, and an implementation in numpy. -- The text now discusses linear models, with accompanying illustrations as well as figures -- An Error Bingo exercise now encourages students to intentionally trigger and learn from errors
-- An extensive section has been added discussing common errors in python, why they most commonly occur, and how to fix them.

-- 3 undergraduate contributors have added Bioinformatics Vignettes showing how to apply the principles in the text to biological problems: - Nia Prabhu (nucleotide composition) - Aziz Bajouri (set analysis) - Ayomikun Akinrinade (machine learning)

-- A section has been added on revising writing about statistical results -- An initial draft section on visualizing correlation has been added showing how a scatterplot can be revised to add linear regression results, 95% confidence intervals, and to better meet recommendations for data visualization. -- The Data Sources page has been greatly updated, and now includes logos for linked resources

New Draft Sections: -- A draft section on student activism and fighting for an inclusive workplace has been added. -- A draft section on network analysis has several in-progress code commits (not yet linked from main table of contents)

Other changes: -- Full Spectrum Bioinformatics has now adopted a code of conduct -- Many minor fixes -- Exercises have been added to many sections that previously lacked them -- The exercise on calculating CG content in the human genome has been updated -- Several chapters have been updated to include Feedback links that were previously missing -- Unused Jupyter Book files have been removed

Full Changelog: https://github.com/zaneveld/full_spectrum_bioinformatics/compare/release-2020.12.1...release-2022.3.1
Source code(tar.gz)
Source code(zip)
full_spectrum_bioinformatics_2022.3.0.zip(182.17 MB)
release-2020.12.1(Dec 8, 2020)

This is an initial development release of the Full Spectrum Bioinformatics online textbook. This is not a full release of the entire planned textbook, but rather an incremental development release of some content that is sufficiently developed that it has been used in classes.

Some current features include: -- A series of open-access Jupyter Notebooks discussing topics in Bioinformatics. -- Links to Google Colab to allow students to run notebooks in a browser without installing software -- An outline table of contents shows planned sections, with sections that are in beta status available as live links. -- This release includes 21 new sections, covering topics ranging from sequence analysis to how to revise one's writing about statistical results:

Foreword The Command Line Using the Command Line Exercise: Little Brother is Missing Exploring Python Exploring Python A Tour of Python Data Types Project Design Using Literature Surveys to Ask Good Questions and Propose Testable Hypotheses Biological Sequences An introduction to Biological Sequences Representing and Manipulating Biological Sequences as Python Strings Analyzing Biological Sequences with For Loops and If Statements Reading and writing FASTA files using Python 'Omics An Introduction to 'Omics Working with Tabular 'Omic data in Python using Pandas Phylogenetic Trees Representing Phylogenetic Trees with Python Classes Generating Trees Using Birth-Death Models Simulation Simulating the Population Genetics of Natural Selection and Genetic Drift Statistics Rank Transformations Monte Carlo simulation of Effect Size, Sample Size, and Significance Dealing with Multiple Comparisons Exercise: Revising your writing about statistical results Polishing and Publishing Presenting Research Careers that draw on Bioinformatics Applying for Grants

NOTE: this is very similar to release-2020.12.0, other than minor edits to the readme but I need to re-release to trigger Zenodo to generate a DOI.
Source code(tar.gz)
Source code(zip)
release-2020.12.0(Dec 7, 2020)

This is an initial development release of the Full Spectrum Bioinformatics online textbook. This is not a full release of the entire planned textbook, but rather an incremental development release of some content that is sufficiently developed that it has been used in classes.

Some current features include: -- A series of open-access Jupyter Notebooks discussing topics in Bioinformatics. -- Links to Google Colab to allow students to run notebooks in a browser without installing software -- An outline table of contents shows planned sections, with sections that are in beta status available as live links. -- This release includes 21 new sections, covering topics ranging from sequence analysis to how to revise one's writing about statistical results:

Foreword The Command Line Using the Command Line Exercise: Little Brother is Missing Exploring Python Exploring Python A Tour of Python Data Types Project Design Using Literature Surveys to Ask Good Questions and Propose Testable Hypotheses Biological Sequences An introduction to Biological Sequences Representing and Manipulating Biological Sequences as Python Strings Analyzing Biological Sequences with For Loops and If Statements Reading and writing FASTA files using Python 'Omics An Introduction to 'Omics Working with Tabular 'Omic data in Python using Pandas Phylogenetic Trees Representing Phylogenetic Trees with Python Classes Generating Trees Using Birth-Death Models Simulation Simulating the Population Genetics of Natural Selection and Genetic Drift Statistics Rank Transformations Monte Carlo simulation of Effect Size, Sample Size, and Significance Dealing with Multiple Comparisons Exercise: Revising your writing about statistical results Polishing and Publishing Presenting Research Careers that draw on Bioinformatics Applying for Grants
Source code(tar.gz)
Source code(zip)
full_spectrum_bioinformatics.zip(84.89 MB)

Owner

Jesse Zaneveld

GitHub Repository

apple's universal binaries BUT MUCH WORSE (PRACTICAL SHITPOST) (NOT PRODUCTION READY)

hyperuniversality investment opportunity: what if we could run multiple architectures in a single file, again apple universal binaries, but worse how

2 Oct 19, 2021

📜 GPT-2 Rhyming Limerick and Haiku models using data augmentation

Well-formed Limericks and Haikus with GPT2 📜 GPT-2 Rhyming Limerick and Haiku models using data augmentation In collaboration with Matthew Korahais &

2 May 26, 2022

Data preprocessing rosetta parser for python

datapreprocessing_rosetta_parser I've never done any NLP or text data processing before, so I wanted to use this hackathon as a learning opportunity,

2 Nov 28, 2021

wxPython app for converting encodings, modifying and fixing SRT files

Subtitle Converter Program za obradu srt i txt fajlova. Requirements: Python version 3.8 wxPython version 4.1.0 or newer Libraries: srt, PyDispatcher

4 Nov 25, 2022

Pre-training BERT masked language models with custom vocabulary

Pre-training BERT Masked Language Models (MLM) This repository contains the method to pre-train a BERT model using custom vocabulary. It was used to p

14 Nov 02, 2022

A python framework to transform natural language questions to queries in a database query language.

__ _ _ _ ___ _ __ _ _ / _` | | | |/ _ \ '_ \| | | | | (_| | |_| | __/ |_) | |_| | \__, |\__,_|\___| .__/ \__, | |_| |_| |___/

1.2k Dec 18, 2022

Official implementation of MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis

MLP Singer Official implementation of MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis. Audio samples are available on our demo page.

103 Dec 23, 2022

A Paper List for Speech Translation

Keyword: Speech Translation, Spoken Language Processing, Natural Language Processing

138 Dec 24, 2022

Paddlespeech Streaming ASR GUI

Paddlespeech-Streaming-ASR-GUI Introduction A paddlespeech Streaming ASR GUI. Us

3 Jan 05, 2022

2021 2학기 데이터크롤링 기말프로젝트

공지 주제 웹 크롤링을 이용한 취업 공고 스케줄러 스케줄 주제 정하기 코딩하기 핵심 코드 설명 + 피피티 구조 구상 // 12/4 토 피피티 + 스크립트(대본) 제작 + 녹화 // ~ 12/10 ~ 12/11 금~토 영상 편집 // ~12/11 토 웹크롤러 사람인_평균

2 Aug 16, 2022

ConvBERT: Improving BERT with Span-based Dynamic Convolution

ConvBERT Introduction In this repo, we introduce a new architecture ConvBERT for pre-training based language model. The code is tested on a V100 GPU.

237 Dec 10, 2022

Legal text retrieval for python

legal-text-retrieval Overview This system contains 2 steps: generate training data containing negative sample found by mixture score of cosine(tfidf)

22 Dec 06, 2022

A Practitioner's Guide to Natural Language Processing

Learn how to process, classify, cluster, summarize, understand syntax, semantics and sentiment of text data with the power of Python! This repository contains code and datasets used in my book, Text

1.5k Jan 03, 2023

Pipeline for chemical image-to-text competition

BMS-Molecular-Translation Introduction This is a pipeline for Bristol-Myers Squibb – Molecular Translation by Vadim Timakin and Maksim Zhdanov. We got

7 Sep 20, 2022

Source code of the "Graph-Bert: Only Attention is Needed for Learning Graph Representations" paper

Graph-Bert Source code of "Graph-Bert: Only Attention is Needed for Learning Graph Representations". Please check the script.py as the entry point. We

14 Mar 25, 2022

Non-Autoregressive Predictive Coding

Non-Autoregressive Predictive Coding This repository contains the implementation of Non-Autoregressive Predictive Coding (NPC) as described in the pre

43 Nov 15, 2022

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset. TunBERT was applied to three NLP downstream tasks: Sentiment Analysis (S

72 Dec 09, 2022

Full Spectrum Bioinformatics - a free online text designed to introduce key topics in Bioinformatics using the Python

Related tags

Overview

Feedback

Comments

Bump nokogiri from 1.10.9 to 1.11.1

v1.11.1 / 2021-01-06

Fixed

SHA-256 Checksums of published gems

v1.11.0 / 2021-01-03

Notes

Faster, more reliable installation: Native Gems for Linux and OSX/Darwin

Dependencies

Ruby

Gems