HTGAA Week : Protein Design

Shuguang Zhang (MIT), Thras Karydis (DeepCure)

This week is all about proteins! The homework is divided into two parts. Part A is focused in protein analysis and protein informatics. In part B, you will have a fun introduction to the challenging world of protein folding.

Part A: Protein analysis

In this part of the homework, you will be using online resources and 3D visualization software to answer questions about proteins.


  1. Answer any of the following questions

    • How many molecules of amino acids do you take with a piece of 500 grams of meat? (on average an amino acid is ~100 Daltons)
    • Why are there only 20 natural amino acids?
    • Why most molecular helices are right handed?
    • Where did amino acids come from before enzymes that make them, and before life started?
    • What do digital databases and nucleosomes have in common?
  2. Pick any protein (from any organism) of your interest that has a 3D structure and answer the following questions.

    • Briefly describe the protein you selected and why you selected it.

    • Identity the amino acid sequence of your protein.

      • How long is it? What is the most frequent amino acid?

      • How many protein sequence homologs are there for your protein?

        Hint: Use the pBLAST tool to search for homologs and ClustalOmega to align and visualize them.

      • Does your protein belong to any protein family?

    • Identify the structure page of your protein in RCSB

      • When was the structure solved? Is it a good quality structure?
      • Are there any other molecules in the solved structure apart from protein?
      • Does your protein belong to any structure classification family?
    • Open the structure of your protein in any 3D molecule visualization software

      • Visualize the protein as "cartoon", "ribbon" and "ball and stick".
      • Color the protein by secondary structure. Does it have more helices or sheets?
      • Color the protein by residue type. What can you tell about the distribution of hydrophobic vs hydrophilic residues?
      • Visualize the surface of the protein. Does it have any "holes" (aka binding pockets)?



3D Molecule Visualization software

Sequence alignment and homology

Part B: How to (almost) Fold (almost) Anything

In this part you will be folding protein sequences into 3D structures. The goal is to get an understanding on how computational protein modeling works as well as to see first hand the great computing power needed for molecular simulations in biology.

For questions 1 and 2 you will be using the Python version of the Rosetta protein structure prediction software, while for question 3 (extra credit) you can use any of the available software listed in the resources.

The files for this exercise are available to clone or download from the followign GitHub repository:


  1. Folding a small (30 aa) peptide. Follow the "Setting up PyRosetta" instructions below and make sure you have a working PyRosetta installation.

    a. Open the "Protein Folding with Pyrosetta" Jupyter notebook. Execute interactively the code in the notebook and answer the questions therein. When you are done, save the notebook (with the answers and all outputs) to an HTML file, and link it to your class page.

    b. Pick the lowest energy model and structurally (visually) compare it to the native. How close is it to the native? If its different, what parts did the computer program get wrong? Note: To compare the structures you have first to align them to the native. You can do that very easily in PyMOL. Here is a short video tutorial on aligning structures with PyMOL

    c. Pick the lowest RMSD model and structurally compare it to the native. How close is it to the native? If its different than the lowest energy model, how is it different? Remember that in a blind case, we will not have the benefit of an RMSD column.

  2. Fold your own sequence! In question 1 we used the sequence from a human protein as input to the folding algorithm. Yet, in principle, you can give any arbitrary sequence of amino acids as an input.

    a. Use any process to create a sequence of 30-50 amino acids, and predict it's 3D structure using the notebook from Q1. You can try to run the script with multiple parameter combinations and compare the results. Log the parameters that had the best outcome.

    b. Compare the resulting structures of 2(a) with those from question 1. Do the structures in both cases look protein-like ? If not, can you think of an explanation?

    c. Try folding multiple sequences to come up with the most protein-looking structure!

  3. Folding protein homologs (extra credit) For this exercise you will be running multiple protein folding simulations. If you don't have access to a powerful machine, use any of the folding servers listed in the resources.

    a. Take the protein sequence from question 1 and randomly change 5 letters to any other amino acid. Predict the protein structure of the unedited (probably done already in Q.1) and edited protein and compare the results. Did the changes you introduced changed the structure significantly?

    b. Take again the original sequence from Q.1 and now change 5 letters to favorable alternatives according to the BLOSUM matrix. Predict the protein structure for the new sequence and compare with the results of 3(a). Did the new changes have the same effect to the structure?

    c. By using the BLOSUM matrix as a guide, try to introduce as many changes as possible to the protein sequence, without significantly changing it's structure.


Setting up PyRosetta

  1. Download and install Anaconda.

  2. Create a Python 3.6.8 virtual environment with conda

    • conda create -n protein_design python=3.6.8
    • Verify you have the correct Python version by activating the environment `conda activate protein_design` and executing the command python . You should an output similar to this:
  1. Download and install PyRosetta

    • Visit

    • Select the Python 3.6 version for your system

    • Download. Username: levinthal and Password paradox

      Note: This combination of username/password is only for academic use.

    • Activate the virtual environment we created above: conda activate protein_design

    • Extract and install PyRosetta to the environment.

    • Verify the installation in Python.

Working with Jupyter notebooks

Jupyter Notebooks are simply amazing. If you haven't used them before, today is your lucky day. Some resources:

Protein folding (structure prediction) webservers