HTGAA by Belen Vicente

Week 4: Protein Design

Part A: Protein Analysis

1. Prelab questions

How many molecules of amino acids do you take with a piece of 500 grams of meat? (on average an amino acid is ~100 Daltons)
Different types of meat have different percentage of protein, but we will assume that the average is 26%. Therefore in 500g of meat we would have 130g of proteins.
Given that 1 Dalton is 1.66054 × 10e−24 g, Therefore, there will be 130 g protein / 1.66 × 10e−24 g/Da = 7.829e+25 Da of proteins in 500g of meat. Since each aminoacid is ~100 Daltons, then, there will be 7.829e+23 aminoacids in 500g of meat

Why are there only 20 natural amino acids?
Although the answer it is not clear, it seems to be related to their chemical properties. Althoughtheoretically there could by 64 (4^3) aminoacids, the 20 natural aminoacids seem to have higher polymerization reactivity and fewer side reaction, which make them more stable and prone to allow life.

Why most molecular helices are right handed?
Jack Dunitz published a paper in 2001 explaining how Pauling in his first representation of the alpha helix had predicted a left-handed structure that turned out to be wrong since the right structure was its enantomer. As with DNA, the aminoacid helices bear a diastereomeric relationship to the chirality of the amino acids; both have to be inverted to get the proper enantiomer. I found a great explanation in this page.

Where did amino acids come from before enzymes that make them, and before life started?
In 1953, Miller and Urey attempted to re-create the conditions of primordial Earth. In a flask, they combined ammonia, hydrogen, methane, and water vapor plus electrical sparks (Miller 1953). They found that new molecules were formed, and they identified these molecules as eleven standard amino acids.

What do digital databases and nucleosomes have in common?
Nucleosomes are the basic structural unit for DNA packaging in cells. It is forms by a segment of DNA wound around 8 histones. Databases also consist of repeated structures of information that can be retrieved when needed.

2. Pick any protein (from any organism) of your interest that has a 3D structure and answer the following questions.

Briefly describe the protein you selected and why you selected it
I picked 6LU7, which is the main protease of the 2019-nCoV coronavirus (also called SARS-CoV-2), that is currently posing dangers worldwide. This protein is a dimer with 2 identical subunits (ie. 2 chains) that form 2 active sites.

Identity the amino acid sequence of your protein

How long is it? What is the most frequent amino acid?
It has 306 residues and the most frequent one is Leucine. Below is the RSCB sequence chain view
How many protein sequence homologs are there for your protein?
Using the pBLAST tool, we found more than 100 homologs in other virus such as Bat SARS-like coronavirus, Bat coronavirus RaTG13 etc.

Snapshot of the results obtained in the BLAST

Snapshot of the distance tree of results
Does your protein belong to any protein family?
It is a proteinase

Identify the structure page of your protein in RCSB

When was the structure solved? Is it a good quality structure?
It was deposited on January 26th 2020 and released on 2020-02-05 with a new version updated on February 26th.
Regarding the quality, here is the full report . The reported resolution of this entry is 2.16 Å.
Are there any other molecules in the solved structure apart from protein?
Yes, it is studied with an inhibitor: N-[(5-METHYLISOXAZOL-3-YL)CARBONYL]ALANYL-L-VALYL-N~1~-((1R,2Z)-4-(BENZYLOXY)-4-OXO-1-{[(3R)-2-OXOPYRROLIDIN-3-YL]METHYL}BUT-2-ENYL)-L-LEUCINAMIDE
Does your protein belong to any structure classification family?
It is only classified as viral protein. Structure classification family is not yet available since it was published very recently

Open the structure of your protein in any 3D molecule visualization software
Proteins can be represented as assymetric unit, biological unit, unitcell and supercell. For 6LU7, we will use the biological unit in all representations, since that is considered the functional subunit (2 chains)

Assymetric Unit

Biological unit

Unit cell

Super cell

Visualize the protein as "cartoon", "ribbon" and "ball and stick".
I decided to represent the biological unit of the protein, which is formed by 2 chains.

Ribbon Cartoon ; Ball and stick
Color the protein by secondary structure. Does it have more helices or sheets?

There is similar amount of helices and sheets
Pink: alpha helix Yellow: beta strands White: coil Blue: beta turns
Color the protein by residue type. What can you tell about the distribution of hydrophobic vs hydrophilic residues?
Visualize the surface of the protein. Does it have any "holes" (aka binding pockets)?

A cysteine amino acid and a nearby histidine perform the protein-cutting reaction. This structure has a peptide-like inhibitor bound in the active site

Week 4: Protein Design

Part B: Folding Proteins

1. Folding a small (30 aa) peptide. Follow the "Setting up PyRosetta" instructions below and make sure you have a working PyRosetta installation

Open the "Protein Folding with Pyrosetta" Jupyter notebook. Execute interactively the code in the notebook and answer the questions therein. Link the notebook to your class page.

Click here to see the notebook script

Pick the lowest energy model and the lowest RMSD models and structurally (visually) compare them to the native. How close is it to the native? If its different, what parts did the computer program get wrong?

Alignment results of native structure (green) vs. lowest energy model (blue) and lowest RMSD (pink)

2. Fold your own sequence! In question 1 we used the sequence from a human protein as input to the folding algorithm. Yet, in principle, you can give any arbitrary sequence of amino acids as an input.

Use any process to create a sequence of 30-50 amino acids, and predict it's 3D structure using the notebook from Q1. You can try to run the script with multiple parameter combinations and compare the results. Log the parameters that had the best outcome.

I created a protein with all students in HTGAA: EYAL MANVITHA LYNCED, THOMAS, JACK, JOE, ANJALI AND BELEN The sequence was:

EYALMANVITHALYNCEDTHQMASIACKIQEANIALIVELEN
CHECK THE COMPLETE SIMULATION!

Here is a list of the top folding structures with lowest energy scores:

And represented here the top 3 structures with the lowest energy scores:

decoy_33 decoy_7 ; decoy_30

Compare the resulting structures of 2(a) with those from question 1. Do the structures in both cases look protein-like ? If not, can you think of an explanation?
It actually looks like a protein. However, at the beginning, I used fragments from the previous exercise and got structures that did not look like a protein. Then I generated specific fragments for my protein using Rosetta, and I got the results shown above.

Try folding multiple sequences to come up with the most protein-looking structure!
Since aminoacids are coded as letter, I ran a few simulations on proteings created with the following sequences:

-My name misspelled: VELENVELENVELENVELENVELENVELENVELENVELEN

Click here to see the notebook script

And apparently, my name has homology with this cute boy!

-My sister's name (also misspelled): ALVAVICENTEVLACQVECALVAVICENTEVLACQVEC

Click here to see the notebook script

-Some sentence in Spanish: ESTAPRQTEINAFVECREADAPQRVELENENDQSMILVEINTEENPLENACRISISDELCQRQNAVIRVS

Click here to see the notebook script

3. Folding protein homologs. For this exercise you will be running multiple protein folding simulations. If you don't have access to a powerful machine, use any of the folding servers listed in the resources

Take the protein sequence from question 1 and randomly change 5 letters to any other amino acid. Predict the protein structure of the unedited (probably done already in Q.1) and edited protein and compare the results. Did the changes you introduced changed the structure significantly?

Here I show the original sequence and the changed aminoacids:
Original: DAITIHSILDWIEDNLESPLSLEKVSERSGYSKWHLQRMFKKETGHSLGQYIRSRKMTEIAQKLKESNEPILYLAERYGFESQQTLTRTFKNYFDVPPHKYRMTNMQGESRFLHPL
Newseq: DAITIHSILDWIEDNLESPLSTEKVSERSGYYKWHLQRMFKKETGHSLGQYIRQRKMTEIAQKLKESNEPILYLAERYGTESQQTLTRTFKNYFDFPPHKYRMTNMQGESRFLHPL

I created parts with the new sequence in Robetta, and ran the folding simulation:

Click here to see the notebook script

For some reason, the folding simulation worked but I could not retrieve the decoys energies. Therefore I looked and the predicted structures and here is one that looked quite similar!

Take again the original sequence from Q.1 and now change 5 letters to favorable alternatives according to the BLOSUM matrix. Predict the protein structure for the new sequence and compare with the results of 3(a). Did the new changes have the same effect to the structure?

Scores within a BLOSUM are log-odds scores that measure, in an alignment, the logarithm for the ratio of the likelihood of two amino acids appearing with a biological sense and the likelihood of the same amino acids appearing by chance. The matrices are based on the minimum percentage identity of the aligned protein sequence used in calculating them. Every possible identity or substitution is assigned a score based on its observed frequencies in the alignment of related proteins. A positive score is given to the more likely substitutions while a negative score is given to the less likely substitutions.

The new changes in aminoacids according to best BLOSUM options are the following:
Original: DAITIHSILDWIEDNLESPLSLEKVSERSGYSKWHLQRMFKKETGHSLGQYIRSRKMTEIAQKLKESNEPILYLAERYGFESQQTLTRTFKNYFDVPPHKYRMTNMQGESRFLHPL
Newseq: DAITIHSILDWIEDNLESPLSIEKVSERSGYAKWHLQRMFKKETGHSLGQYIRARKMTEIAQKLKESNEPILYLAERYGMESQQTLTRTFKNYFDIPPHKYRMTNMQGESRFLHPL

I created parts with the new sequence in Robetta, and ran the folding simulation

Click here to see the notebook script

For some reason, the folding simulation worked but I could not retrieve the decoys energies. Therefore, I performed all alignments with Pymol, where I could retrieve the RMSD scores and found the decoy with the lowest RMSD, which is decoy #14 in this case. Structures now look almost identical compared to the previous case

Snapshot of the results obtained in the BLAST

Snapshot of the distance tree of results