MAS.S66: How to Grow (Almost) Anything

Priya Pillai

Protein Design: Analyzing Protein Folding and Structure

March 21, 2019

Proteins are the workhorses of the cell, running nearly all of cellular activities. In order to be able to perform a vast array of activity, they need to form a large variety of functional shapes. However, for the sake of simplicity and encoding within the genome, they must also be made of standardized parts. This is accomplished naturally via the use of 20 different amino acids. Each amino acid has a "backbone" that is used to connect them together and a "side chain" which allows the chemical variety necessary to produce different three dimensional structures and functions.

Amino Acid Codon Chart
The mRNA genetic code to produce an amino acid-a set of 3 base pairs translates to 1 amino acid

An interesting question to consider is why there are only 20 amino acids that are able to form all of the many functions of proteins. Since each amino acid is encoded by 3 DNA/RNA bases and there are 4 possible bases in each, we could have up to 63 different amino acids encoded (assuming one was used as a stop codon). Scientists have not really found an explicit answer as to why we only have 20, but we can consider some of the potential advantages this system has. The 20 amino acids span a large variety of sizes, polarities, and charges. 2 amino acids are capable of bonding their side chains to form sulfur bridges through the amino acid. It may simply be that the vast majority of potential chemical functions that would be necessary in biology are already accounted for within these 20. Also, having significantly less than 63 amino acids means that the genetic code can have small mutations without necessarily changing an amino acid. Even when an amino acid is changed, it is highly likely that the amino acid shift is to something very similar to the first amino acid. Biology needs genetic variety and mutations in order to create its large variety, and using significantly more amino acids could risk a species long term survival potential due to genetic mutations. Finally, having a repetitive genetic code means that the DNA can mean more than just the protein sequence. Regulation of protein production can rely on the specific genetic sequence, via mRNA folding of the transcript itself, via RNA interfence, via protein bonding to the DNA or the RNA, or via a number of other mechanisms. The repetitive genetic code means that two genes with similar protein sequences can be independently identified and regulated.

APAF1 Details
APAF1, the apoptosome and the "Wheel of Death" protein. Source: Acehan, Devrim & Yu, Xinchao & Wang, Liu & Jiang, Xuejun & Booth, Christopher & Wang, Xiaodong & W Akey, Christopher. (2004). Structures of the Human and Drosophila Apoptosome Determined by Single Particle Electron Cryo-microscopy. Microscopy and Microanalysis - MICROSC MICROANAL. 10. 10.1017/S1431927604883090.

A protein with a particularly interesting three dimensional structure to analyze is APAF1, also known as the "Wheel of Death" protein. This protein is crucial in beginning apopotosis. When bound to cytochrome-C released by the mitochondria, it forms a heptamer in a wheel like structure. APAF1 then binds to procaspase-9, modifies procaspase-9 to caspase-9 (of CRISPR-Cas9 fame), which finally cuts up the DNA to kill the cell. The most frequent amino acid in it is leucine, a nonpolar amino acid, with 139 leucines of 1248 total amino acids (in its largest and most common isoform). According to the NCBI database, there are 10 homologs of APAF1 in humans, chimpanzees, Rhesus monkeys, dogs, cows, mice, rats, chickens, zebrafishes, and frogs, as well as 257 orthologs. This makes sense, as it is a core protein in the apoptosis pathway, necessary in multicellular organisms. It is a part of the family TF323866 on Pfam.

A detailed structure of APAF1 has been solved since 2013, though more recent papers have gotten its structure to the atomic level. It binds with either ATP or DTP (an energy molecule), a heme iron containing group (HEC or HEM), and magnesium in its various structures. Some of the structures also include it bound to procaspase-9. Almost all include it bound to cytochrome-c, as it will only form its wheel structure when bound to cytochrome-c. While it is not included in the SCOP database, it appears to segregate its alpha helices and beta pleated sheets, and would likely be classified as a+b.

Cartoon Model of APAF1
Cartoon visualization of APAF1, colored by position in sequence
Ribbon Model of APAF1
Ribbon visualization of APAF1, colored by secondary structure

Looking at the visualizations og APAF1, we can see it has slightly more alpha helices than beta pleated sheets, although the areas are highly segregated. It also has its characteristic "wheel" shape, where the outer edges connect to a cytochrome-C and its center binds to procaspase-9.

Ball and Stick Model of APAF1
Ball and Stick visualization of APAF1, colored by hydrophobicity
Surface Model of APAF1
Surface visualization of APAF1, colored by hydrophobicity

From the visualizations of its hydrophobicity and its surface, we find that it has a large variety of hydrophobicity on its outside. This is likely because it is cytoplasmic and therefore does not need particular hydrophobicity for the membrance. It is more hydrophillic on the outside than its inner barrel, which is more hydrophobic. While its surface does not have any visible binding pockets, the cartoon visualization does show binding at the ends and in the center.

Initial results from PyRosetta MonteCarlo Folding Algorithm
Initial Results from PyRosetta Monte Carlo Folding Algorithm of 30 aa chain (bottom let is the original molecule, other is Monte Carlo result)
Initial RMSD Graph
Initial results' RMSDs from the original molecule

When using PyRosetta, I ran into a number of difficulties; most notably, I had to use Ubuntu for Windows on my computer, and so had to create a separate python file rather than use the Jupyter notebook. The first time I ran the script, something went wrong during this process, as the vast majority of the protein did not fold correctly. Looking into the error output revealed that the fragments of 3 mers and 9 mers had not been read into the file correctly, leading to a lack of information on the secondary structure.

Original Molecule
Original Molecular Fold of 30 aa chain
Final Results from PyRosetta MonteCarlo Folding Algorithm
Final results from PyRosetta MonteCarlo Folding Algorithm of 30 aa chain
RMSD Graph
Final results' RMSDs from the original molecule

The python script with the questions and answers from the Jupyter notebook is below:

import os
import glob
import shutil
import pandas as pd
#import nglview as ngl
import pyrosetta as prs
prs.init()
from pyrosetta import rosetta

outstr = "\n"

scorefxn_low = prs.create_score_function('score3')
scorefxn_high = prs.get_fa_scorefxn()

native_pose = prs.pose_from_pdb('../../data/1BL0/1BL0_chainA.pdb')

outstr = outstr + "\n" + native_pose.sequence()

DSSP = prs.rosetta.protocols.moves.DsspMover()
DSSP.apply(native_pose)    # populates the pose's Pose.secstruct

outstr = outstr + "\n" + 'Native Pose Res 1:' + native_pose.residue(1).__str__()

pose = prs.pose_from_sequence(native_pose.sequence())
test_pose = prs.Pose()
test_pose.assign(pose)
test_pose.pdb_info().name('Linearized Pose')

test_pose.dump_file('outputs/linearized_pose.pdb')

to_centroid = prs.SwitchResidueTypeSetMover('centroid')
to_full_atom = prs.SwitchResidueTypeSetMover('fa_standard')
to_full_atom.apply(test_pose)
outstr = outstr + "\n" + 'Full Atom Score:' + scorefxn_high(test_pose).__str__()
to_centroid.apply(test_pose)
outstr = outstr + "\n" + 'Centroid Score:' + scorefxn_low(test_pose).__str__()

# Q: Visualize the centroid-only structure and see the difference with the full atom that we visualized above?
# Print again the information for the first residue and compare.
test_pose.assign(pose)
to_centroid.apply(test_pose)
test_pose.dump_file('outputs/centroid_only.pdb')
outstr = outstr + 'Centroid Pose Res 1:' + test_pose.residue(1).__str__()

# Loading the files with the pre-computed fragmets
long_frag_filename = '../../data/1BL0/9_fragments.txt'
long_frag_length = 9
short_frag_filename = '../../data/1BL0/3_fragments.txt'
short_frag_length = 3

# Defining parameters of the folding algorithm
long_inserts=5  # How many 9-fragment pieces to insest during the search
short_inserts=10 # How many 3-fragment pieces to insest during the search

kT = 3.0 # Simulated Annealing temperature
cycles = 1000 # How many cycles of Monte Carlo search to run
jobs = 50 # How many trajectories in parallel to compute.
job_output = 'outputs/1BL0/decoy' # The prefix of the filenames to store the results

movemap = prs.MoveMap()
movemap.set_bb(True)

fragset_long = rosetta.core.fragment.ConstantLengthFragSet(long_frag_length, long_frag_filename)
long_frag_mover = rosetta.protocols.simple_moves.ClassicFragmentMover(fragset_long, movemap)

fragset_short = rosetta.core.fragment.ConstantLengthFragSet(short_frag_length, short_frag_filename)
short_frag_mover = rosetta.protocols.simple_moves.ClassicFragmentMover(fragset_short, movemap)

insert_long_frag = prs.RepeatMover(long_frag_mover, long_inserts)
insert_short_frag = prs.RepeatMover(short_frag_mover, short_inserts)

# Q: How many 9-mer and 3-mer fragmets do we have in our database?
# A: We have 200 9-mers and 200 3-mers.

# Making sure the structure is in centroid-only mode for the search
test_pose.assign(pose)
to_centroid.apply(test_pose)

# Defining what sequence of actions to do between each scoring step
folding_mover = prs.SequenceMover()
folding_mover.add_mover(insert_long_frag)
folding_mover.add_mover(insert_short_frag)

mc = prs.MonteCarlo(test_pose, scorefxn_low, kT)
trial = prs.TrialMover(folding_mover, mc)

# Setting up how many cycles of search to do in each trajectory
folding = prs.RepeatMover(trial, cycles)

fast_relax_mover = prs.rosetta.protocols.relax.FastRelax(scorefxn_high)

scores = [0] * (jobs + 1)
scores[0] = scorefxn_low(test_pose)

if os.path.isdir(os.path.dirname(job_output)):
    shutil.rmtree(os.path.dirname(job_output), ignore_errors=True)
os.makedirs(os.path.dirname(job_output))
jd = prs.PyJobDistributor(job_output, nstruct=jobs, scorefxn=scorefxn_high)

counter = 0 
while not jd.job_complete:
    # a. set necessary variables for the new trajectory
    # -reload the starting pose
    test_pose.assign(pose)
    to_centroid.apply(test_pose)
    # -change the pose's PDBInfo.name, for the PyMOL_Observer
    counter += 1
    test_pose.pdb_info().name(job_output + '_' + str(counter))
    # -reset the MonteCarlo object (sets lowest_score to that of test_pose)
    mc.reset(test_pose)

    #### if you create a custom protocol, you may have additional
    ####    variables to reset, such as kT

    #### if you create a custom protocol, this section will most likely
    ####    change, many protocols exist as single Movers or can be
    ####    chained together in a sequence (see above) so you need
    ####    only apply the final Mover
    # b. apply the refinement protocol
    folding.apply(test_pose)

    ####
    # c. export the lowest scoring decoy structure for this trajectory
    # -recover the lowest scoring decoy structure
    mc.recover_low(test_pose)
    # -store the final score for this trajectory
    # -convert the decoy to fullatom
    # the sidechain conformations will all be default,
    #    normally, the decoys would NOT be converted to fullatom before
    #    writing them to PDB (since a large number of trajectories would
    #    be considered and their fullatom score are unnecessary)
    # here the fullatom mode is reproduced to make the output easier to
    #    understand and manipulate, PyRosetta can load in PDB files of
    #    centroid structures, however you must convert to fullatom for
    #    nearly any other application
    to_full_atom.apply(test_pose)
    fast_relax_mover.apply(test_pose)
    scores[counter] = scorefxn_high(test_pose)
    # -output the fullatom decoy structure into a PDB file
    jd.output_decoy(test_pose)
    # -export the final structure to PyMOL
    test_pose.pdb_info().name(job_output + '_' + str(counter) + '_fa')

decoy_poses = [prs.pose_from_pdb(f) for f in glob.glob(job_output + '*.pdb')]

def align_and_get_rmsds(native_pose, decoy_poses):
    prs.rosetta.core.pose.full_model_info.make_sure_full_model_info_is_setup(native_pose)
    rmsds = []
    for p in decoy_poses:
        prs.rosetta.core.pose.full_model_info.make_sure_full_model_info_is_setup(p)
        rmsds += [prs.rosetta.protocols.stepwise.modeler.align.superimpose_with_stepwise_aligner(native_pose, p)]
    return rmsds

rmsds = align_and_get_rmsds(native_pose, decoy_poses)
rmsd_data = []
for i in range(1, len(decoy_poses)):  # print out the job scores
    rmsd_data.append({'structure': decoy_poses[i].pdb_info().name(), 
                      'rmsd': rmsds[i],
                      'energy_score': scores[i]})

rmsd_df = pd.DataFrame(rmsd_data)
outstr = outstr + "\n" + rmsd_df.sort_values('rmsd').__str__()

print("\n")
print(outstr)
# Q: Plot the energy_score VS rmsd. Are the lowest energy structures the closest to the native one?
# No, the relation between rmsd and energy is not particularly clear.
# This may be due to the algorithm not calculating all of the energy/entropy effects that would exist in the cell.

The output from the above code was the following:

DAITIHSILDWIEDNLESPLSLEKVSERSGYSKWHLQRMFKKETGHSLGQYIRSRKMTEIAQKLKESNEPILYLAERYGFESQQTLTRTFKNYFDVPPHKYRMTNMQGESRFLHPL
Native Pose Res 1:Residue 1: ASP:NtermProteinFull (ASP, D):
Base: ASP
 Properties: POLYMER PROTEIN CANONICAL_AA LOWER_TERMINUS SC_ORBITALS POLAR CHARGED NEGATIVE_CHARGE METALBINDING ALPHA_AA
  L_AA
 Variant types: LOWER_TERMINUS_VARIANT
 Main-chain atoms:  N    CA   C
 Backbone atoms:    N    CA   C    O   1H   2H   3H    HA
 Side-chain atoms:  CB   CG   OD1  OD2 1HB  2HB
Atom Coordinates:
   N  : 0.229, 36.012, 74.172
   CA : 0.041, 35.606, 75.594
   C  : -0.096, 36.849, 76.498
   O  : -0.951, 36.895, 77.382
   CB : 1.225, 34.718, 76.092
   CG : 2.159, 34.156, 74.999
   OD1: 1.688, 33.361, 74.151
   OD2: 3.378, 34.497, 75.007
  1H  : 1.056, 35.74, 73.68
  2H  : -0.43, 35.723, 73.478
  3H  : 0.251, 36.981, 73.928
   HA : -0.884, 35.037, 75.696
  1HB : 1.839, 35.199, 76.854
  2HB : 0.67, 33.892, 76.539
Mirrored relative to coordinates in ResidueType: FALSE

Full Atom Score:46662.2385289
Centroid Score:454.122863078Centroid Pose Res 1:Residue 1: ASP:NtermProteinFull (ASP, D):
Base: ASP
 Properties: POLYMER PROTEIN CANONICAL_AA LOWER_TERMINUS POLAR CHARGED NEGATIVE_CHARGE ALPHA_AA L_AA
 Variant types: LOWER_TERMINUS_VARIANT
 Main-chain atoms:  N    CA   C
 Backbone atoms:    N    CA   C    O    H
 Side-chain atoms:  CB   CEN
Atom Coordinates:
   N  : 0, 0, 0
   CA : 1.458, 0, 0
   C  : 2.00885, 1.42017, 0
   O  : 1.25096, 2.39022, -2.58987e-16
   CB : 1.99452, -0.771871, -1.208
   CEN: 2.35051, -1.69379, -1.45468
   H  : -0.5, -0.433013, -0.75
Mirrored relative to coordinates in ResidueType: FALSE

    energy_score       rmsd                  structure
35   -231.095101   8.653948  outputs/1BL0/decoy_41.pdb
23   -209.375508   9.021131  outputs/1BL0/decoy_30.pdb
12   -228.351223   9.443739  outputs/1BL0/decoy_20.pdb
38   -233.549446  11.155148  outputs/1BL0/decoy_44.pdb
40   -217.730875  11.352236  outputs/1BL0/decoy_46.pdb
4    -262.381694  11.430146  outputs/1BL0/decoy_13.pdb
19   -240.134107  11.453890  outputs/1BL0/decoy_27.pdb
8    -198.935451  11.468517  outputs/1BL0/decoy_17.pdb
43   -228.512963  11.658983  outputs/1BL0/decoy_49.pdb
29   -197.366819  11.712007  outputs/1BL0/decoy_36.pdb
14   -199.338112  11.787165  outputs/1BL0/decoy_22.pdb
48   -228.067943  11.992335   outputs/1BL0/decoy_9.pdb
37   -204.112821  12.066104  outputs/1BL0/decoy_43.pdb
34   -230.291555  12.549900  outputs/1BL0/decoy_40.pdb
30   -223.302879  12.796835  outputs/1BL0/decoy_37.pdb
24   -229.092550  12.844552  outputs/1BL0/decoy_31.pdb
28   -223.952602  13.011182  outputs/1BL0/decoy_35.pdb
13   -251.331840  13.097167  outputs/1BL0/decoy_21.pdb
20   -243.358432  13.125232  outputs/1BL0/decoy_28.pdb
27   -208.356426  13.137354  outputs/1BL0/decoy_34.pdb
36   -218.443369  13.211121  outputs/1BL0/decoy_42.pdb
16   -229.827580  13.268226  outputs/1BL0/decoy_24.pdb
33   -228.583472  13.314477   outputs/1BL0/decoy_4.pdb
1    -226.267214  13.456331  outputs/1BL0/decoy_10.pdb
39   -221.819451  13.663134  outputs/1BL0/decoy_45.pdb
44   -204.486623  13.768157   outputs/1BL0/decoy_5.pdb
10   -227.297727  13.816728  outputs/1BL0/decoy_19.pdb
0    -220.112053  13.845310   outputs/1BL0/decoy_1.pdb
18   -209.300823  13.892698  outputs/1BL0/decoy_26.pdb
11   -235.324900  13.967921   outputs/1BL0/decoy_2.pdb
41   -243.745637  14.020139  outputs/1BL0/decoy_47.pdb
45   -233.379122  14.049248   outputs/1BL0/decoy_6.pdb
42   -192.659804  14.243319  outputs/1BL0/decoy_48.pdb
47   -170.373117  14.254252   outputs/1BL0/decoy_8.pdb
17   -218.493161  14.306151  outputs/1BL0/decoy_25.pdb
25   -240.833522  14.316915  outputs/1BL0/decoy_32.pdb
22   -193.748537  14.333572   outputs/1BL0/decoy_3.pdb
7    -248.869317  14.352598  outputs/1BL0/decoy_16.pdb
21   -211.495260  14.411678  outputs/1BL0/decoy_29.pdb
31   -227.040126  14.453772  outputs/1BL0/decoy_38.pdb
3    -229.143745  14.565474  outputs/1BL0/decoy_12.pdb
46   -247.448064  14.780769   outputs/1BL0/decoy_7.pdb
2    -227.853903  14.909472  outputs/1BL0/decoy_11.pdb
5    -231.810258  15.146400  outputs/1BL0/decoy_14.pdb
9    -233.614844  15.340778  outputs/1BL0/decoy_18.pdb
26   -219.723730  15.978415  outputs/1BL0/decoy_33.pdb
15   -208.108354  15.992851  outputs/1BL0/decoy_23.pdb
32   -201.849057  16.030535  outputs/1BL0/decoy_39.pdb
6    -225.259435  16.802127  outputs/1BL0/decoy_15.pdb