HTGAA Week : Protein Design

Shuguang Zhang (MIT), Thras Karydis (DeepCure)

This week is all about proteins! The homework is divided into two parts. Part A is focused in protein analysis and protein informatics. In part B, you will have a fun introduction to the challenging world of protein folding.

Part A: Protein analysis

In this part of the homework, you will be using online resources and 3D visualization software to answer questions about proteins.

Exercises

Answer any of the following questions
- How many molecules of amino acids do you take with a piece of 500 grams of meat? (on average an amino acid is ~100 Daltons)
- Why are there only 20 natural amino acids?
- Why most molecular helices are right handed?
- Where did amino acids come from before enzymes that make them, and before life started?
- What do digital databases and nucleosomes have in common?
Pick any protein (from any organism) of your interest that has a 3D structure and answer the following questions.
- Briefly describe the protein you selected and why you selected it.
- Identity the amino acid sequence of your protein.
  - How long is it? What is the most frequent amino acid?
  - How many protein sequence homologs are there for your protein?
    Hint: Use the pBLAST tool to search for homologs and ClustalOmega to align and visualize them.
  - Does your protein belong to any protein family?
- Identify the structure page of your protein in RCSB
  - When was the structure solved? Is it a good quality structure?
  - Are there any other molecules in the solved structure apart from protein?
  - Does your protein belong to any structure classification family?
- Open the structure of your protein in any 3D molecule visualization software
  - Visualize the protein as "cartoon", "ribbon" and "ball and stick".
  - Color the protein by secondary structure. Does it have more $\alpha$ helices or $\beta$ sheets?
  - Color the protein by residue type. What can you tell about the distribution of hydrophobic vs hydrophilic residues?
  - Visualize the surface of the protein. Does it have any "holes" (aka binding pockets)?

Resources

Databases

Uniprot: A comprehensive, high-quality and freely accessible resource of protein sequence and functional information. It is linked to almost every other database. Example Entry: G3ECR1
RCSB: Collection of all publicly available biological macromolecular structures. Example Entry: 4CMP
PFAM: A large collection of protein families, i.e. groups of proteins with similar sequence/function. Example Entry: PF02171
SCOP: A large collection of structural protein families. Proteins are organized according to their structural and evolutionary relationships. Example Entry: 6PGDH C-terminal helical region-like
ExPaSy: SIB Bioinformatics Resource Portal which provides access to scientific databases and software tools (i.e., resources) in different areas of life sciences including proteomics, genomics, phylogeny, systems biology, population genetics, transcriptomics etc.

3D Molecule Visualization software

PyMOL(https://pymol.org/edu/?q=educational): PyMOL is a user-sponsored molecular visualization system on an open-source foundation, maintained and distributed by Schrödinger.
- Practical PyMOL for Beginners
- Video Tutorials: Video 1 Video2 (and tons more… just search "PyMOL tutorial" in youtube).
Chimera: A highly extensible program for interactive visualization and analysis of molecular structures and related data, including density maps, supramolecular assemblies, sequence alignments, docking results, trajectories, and conformational ensembles.
- Chimera Tutorials
- Video Tutorials: Video 1 Video 2 (and tons more… just search "Chimera tutorial" in youtube).
VMD: A molecular visualization program for displaying, animating, and analyzing large biomolecular systems using 3-D graphics and built-in scripting
- VMD Tutorials
- Video Tutorials: Video 1 Video 2 (and tons more… you know the drill)
NGLViewer: NGL Viewer is a collection of tools for web-based molecular graphics. WebGL is employed to display molecules like proteins and DNA/RNA with a variety of representations.
- Web application (really cool demos)
- Jupyter Widget Tutorial

Sequence alignment and homology

BLAST: BLAST finds regions of similarity between biological sequences. The program compares nucleotide or protein sequences (pBLAST) to sequence databases and calculates the statistical significance.
- BLAST Video Tutorial
- BLAST Extensive Tutorial
Clustal Omega: A new multiple sequence alignment program that uses seeded guide trees and HMM profile-profile techniques to generate alignments between three or more sequences.
- How to Use
The BLOSUM matrices are used during alignment to check how similar are amino acids to each other. Here is the BLOSUM62 matrix, most commonly used if no a-priori information is available for the evolutionary relationship of the protein sequences.

Part B: How to (almost) Fold (almost) Anything

In this part you will be folding protein sequences into 3D structures. The goal is to get an understanding on how computational protein modeling works as well as to see first hand the great computing power needed for molecular simulations in biology.

For questions 1 and 2 you will be using the Python version of the Rosetta protein structure prediction software, while for question 3 (extra credit) you can use any of the available software listed in the resources.

The files for this exercise are available to clone or download from the followign GitHub repository: https://github.com/thrakar9/protein_folding_workshop.

Questions

Folding a small (30 aa) peptide. Follow the "Setting up PyRosetta" instructions below and make sure you have a working PyRosetta installation.
a. Open the "Protein Folding with Pyrosetta" Jupyter notebook. Execute interactively the code in the notebook and answer the questions therein. When you are done, save the notebook (with the answers and all outputs) to an HTML file, and link it to your class page.
b. Pick the lowest energy model and structurally (visually) compare it to the native. How close is it to the native? If its different, what parts did the computer program get wrong? Note: To compare the structures you have first to align them to the native. You can do that very easily in PyMOL. Here is a short video tutorial on aligning structures with PyMOL
c. Pick the lowest RMSD model and structurally compare it to the native. How close is it to the native? If its different than the lowest energy model, how is it different? Remember that in a blind case, we will not have the benefit of an RMSD column.
Fold your own sequence! In question 1 we used the sequence from a human protein as input to the folding algorithm. Yet, in principle, you can give any arbitrary sequence of amino acids as an input.
a. Use any process to create a sequence of 30-50 amino acids, and predict it's 3D structure using the notebook from Q1. You can try to run the script with multiple parameter combinations and compare the results. Log the parameters that had the best outcome.
b. Compare the resulting structures of 2(a) with those from question 1. Do the structures in both cases look protein-like ? If not, can you think of an explanation?
c. Try folding multiple sequences to come up with the most protein-looking structure!
Folding protein homologs (extra credit) For this exercise you will be running multiple protein folding simulations. If you don't have access to a powerful machine, use any of the folding servers listed in the resources.
a. Take the protein sequence from question 1 and randomly change 5 letters to any other amino acid. Predict the protein structure of the unedited (probably done already in Q.1) and edited protein and compare the results. Did the changes you introduced changed the structure significantly?
b. Take again the original sequence from Q.1 and now change 5 letters to favorable alternatives according to the BLOSUM matrix. Predict the protein structure for the new sequence and compare with the results of 3(a). Did the new changes have the same effect to the structure?
c. By using the BLOSUM matrix as a guide, try to introduce as many changes as possible to the protein sequence, without significantly changing it's structure.

Resources

Setting up PyRosetta

Download and install Anaconda.
- Visit https://www.anaconda.com/distribution/
- Select the Python 3.7 version and follow the instructions to install
Create a Python 3.6.8 virtual environment with conda
- conda create -n protein_design python=3.6.8
- Verify you have the correct Python version by activating the environment `conda activate protein_design` and executing the command python . You should an output similar to this:


xxxxxxxxxx
  Python 3.6.8 | Anaconda, Inc. | (default, Dec 29 2018, 19:04:46)
  [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
  Type "help", "copyright", "credits" or "license" for more information.

Download and install PyRosetta

Visit http://www.pyrosetta.org/dow
Select the Python 3.6 version for your system
Download. Username: levinthal and Password paradox
Note: This combination of username/password is only for academic use.
Activate the virtual environment we created above: conda activate protein_design
Extract and install PyRosetta to the environment.


xxxxxxxxxx
tar -vjxf PyRosetta-<version>.tar.bz2
cd setup && python3.6 setup.py install

Verify the installation in Python.


xxxxxxxxxx
python -c "import pyrosetta; pyrosetta.init()"

Working with Jupyter notebooks

Jupyter Notebooks are simply amazing. If you haven't used them before, today is your lucky day. Some resources:

Video Tutorial
You can save your notebook as an HTML file (File->Download as->HTML)

HTGAA Week : Protein Design

Shuguang Zhang (MIT), Thras Karydis (DeepCure)

Part A: Protein analysis

Exercises

Resources

Databases

3D Molecule Visualization software

Sequence alignment and homology

Part B: How to (almost) Fold (almost) Anything

Questions

Resources

Setting up PyRosetta

Working with Jupyter notebooks

Protein folding (structure prediction) webservers