MIT Class Site
Laura Maria Gonzalez
March 16, 2021
The first part of this week involved answering a few questions to get ourselves into the world of proteins. Similar to the previous week I had to familiarize myself with a few terms and have included them at the bottom of this page.

How many molecules of amino acids do you take with a piece of 500 grams of meat? (on average an amino acid is about 100 Daltons)
100 grams of meat is on average 26g of protein (according to google) so 500 grams = 130g of protein
130 grams = 7.829e25 Dalton
7.829e25 Dalton / 100 Daltons per Amino Acid = 7.829e23 Amino Acids. Very large amounts!

Why are there only 20 natural amino acids?
An answer that made sense to me was that DNA is read in codons made up of bases in groups of 3 which encodes 1 amino acid. 4x4x4 = 64 potential combinations. It is common for the amino acid to relate to combinations of the first 2 bases, ignoring the 3rd. This puts redundancy into the system that would be missing if each combination became a "natural" amino acid. So if you have 20 amino acids, there are up to 3 systems for each as backup.

Why are most molecular helices right handed?
An answer to this question might be that right handedness is energetically more favorable because of fewer steric clashes between the side chains and the main chains that would result due to an overlap of any non-bonding atoms in a protein structure...

Where did amino acids come from before enzymes that made them, and before life started?
From an electrified primordial soup! In 1953, Miller and Urey combined ammonia, hydrogen, methane, and water vapor with some electrical sparks and were able to form new molecules. They identified the molecules as eleven standard amino acids and hypothesized that the first organisms could have come from environments similar to this.

What do digital databases and nucleosomes have in common?
Nucleosomes are a basic structural unit of DNA packaging in eukaryotes. In order to fit DNA into the cell nucleus it must be packaged into compacted structures. They are produced by interactions between DNA and histone proteins. They are dynamic and can spontaneously slide, split, or dissociate. This makes it similar to digital databases which compact information through digitization and can also be reconfigured, moved, or removed.
I chose to investigate aquaporins and specifically AQP1. AQPs were initially discovered by Peter Agre in 1992 while investigating a protein that caused Rh disease. I also chose this protein after reading about their potential use in water filtration systems and their prevalence in a variety of organisms including fungi, animal, and plant cells! Another interesting aspect of Aquaporin proteins is that it is unusually stable. The red blood cells that carry aquaporins don't have repair and replace mechanisms, so red blood cell proteins need to be very stable to withstand the journey through the vascular system. They can also survive up to four months! What are aquaporins? They are "the plumbing system for cells". They selectively conduct water molecules in and out of the cell while preventing passage of other molecules. (some can transport other molecules such as ammonia, CO2, glycerol, and urea.)

Human (Homo Sapiens), Bacteria
Water-specific channel that provides the plasma membranes of red cells and kidney proximal tubules with high permeability to water, thereby permitting water to move in the direction of an osmotic gradient.

269 in length
Most Frequent Amino Acid: L (33, 12%), A (28, 10%), G (27, 10%) ----- Leucine, Alanine, Glycine *Interesting to see the more prevalent aa are hydrophobic
Acts as a glycerol transporter in skin. Involved in skin hydration, wound healing, and tumorigenesis. + others

292 in length
Most Frequent Amino Acid: G (34, 11%), L (34, 11%), A (27, 9%) ----- Glycine, Leucine, Alanine
Channel that permits osmotically driven movement of water in both directions. It mediates rapid entry or exit of water in response to abrupt changes in osmolarity.

2ABM (E.coli)
231 in length
Most Frequent Amino Acid: G (35, 15%), A (34, 14%), L (26, 11%) ----- Glycine, Alanine, Leucine
How many protein sequence homologs are there for the protein?

Using BLAST I found 100 sequence homologs. Almost all proteins were a perfect match with alignment scores over 200.

Does it belong to any protein family?

Yes! these proteins belong to a superfamily known as Major Intrinsic Proteins which are transmembrane protein channels that are grouped together on the basis of homology(similar relation, relative position, and structure). The MIP superfamily has three subfamilies: aquaporins (AQPs) which are water selective, the aquaglyceroporins which are permeable to water and other small uncharged molecules, and superaquaporins (S-aquaporins). The MIP family is large, possessing thousands of members that form transmembrane channels and function in transporting water, small carbohydrates like glycerol, urea, ammonia, carbon dioxide, and hydrogen peroxide.
When was the structure solved? Is it a good quality structure?

Using RCBS I found that the earliest model of AQP1 was released in October, 2000 using electron crystallography. The AQPZ was released in September, 2005 using X-Ray Diffraction. The structures appear to be of good quality.

Are there any other molecules in the solved structure apart from the protein?

No other molecules were included in the models I worked with.

Does your protein belong to any structure classification family?

Searching on SCOP I saw Aquaporin-like as a family. I'm not sure if there is another structure classification family beyond this. Aquaporins consist of six transmembrane alpha-helices. The aquaporins form four part clusters in the cell membrane. Each of the four monomers act as a water channel. Different aquaporins have different sized water channels. The smallest of channel types allow only water. The profile of the aquaporins consist of conical entrances and an overall hourglass shape. The hourglass spans the thickness of the cell membrane and it's central opening serves as a highly selective channel to ferry water bidirectionally.
AQP1 (Human) visualized in RCSB
AQPZ (Bacteria) visualized in RCSB
AQP1 visualized as "cartoon"
AQP1 visualized as "ribbon"
AQP1 visualized as "stick"
AQP1 visualized by residue type showing hydrophobic and hydrophilic distribution
In Aquaporins the amino acids are positioned so that amino acids that attract lipids form the outside of the hourglass and interact with the lipid cell membrane, while amino acids that attract water line the internal surface of the hourglass. I was able to see this in the AQP1 visualizations by residue type. The hydrophobic residues formed the outer central portion of the protein while the ends and inner structure were made from hydrophilic amino acids. Another interesting visualization came from the cartoon view which showed alternating negative and positive charges. In AQPs this is used to "ferry" water molecules, and only water molecules through the channel.

Water molecules have an asymmetric distribution of charges - the single oxygen atom makes one side negatively charged while the two hydrogen atoms make the other side positively charged. Therefore the alternating neg-pos charges in the aquaporin pore's lining escort the water molecule, one by one (at a rate of 3 billion per second!).
AQP1 secondary structure
AQP1 surface "holes"
The next steps this week was to use Robetta (online Rosetta engine) to explore protein folding. To start of we needed to choose a protein that is less than 100 amino acids long. I chose a random protein at first that was about 98 amino acids long and it finished processing in under an hour! Seeing how fast I was getting results I decided to submit the full AQP1 sequence. I submitted the full AQp1 sequence (269aa) for folding to Robetta. Results came back in just a few hours!
AQP1 Robetta Folding Results
To align I used PyMOL and aligned the entire protein together based on it's sequence. The results seemed fairly accurate! In beige is the known protein and in magenta is the Robetta protein folding results. Part of the reason I think the Robetta folding comes close to perfectly matching the known protein is because the structure is primarily composed of alpha helices. Which in the Robetta preview, color set to error estimate shows in blue. The ends however (the loops) appear much less accurate which is also shown in the PyMOL analysis
Results of PyMOL Alignment
The last part of this week's protein adventure was to learn about the use of Machine Learning in protein design and think about how this process could be useful to design proteins or optimize existing proteins to have better or worse properties and abilities. To do this we used a Google Colab Notebook. The notebook was trained with data focused on mutations in a specific enzyme called beta-lactamase, which provides antibiotic resistant to bacteria, and can be used to predict the effects of missense mutations.

The steps in the notebook involved loading the embeddings from a file and formatting into a matrix. We then separated the data into a train/test split using 80% of the data to learn and 20% to test how the model performed. Afterwards, instead of using the original embeddings you can reduce the amount of features through Principal Component Analysis (PCA) which computes a new set that best explains the data. We then choose one of three different machine learning models: K-Nearest-Neighbors, SVM, or random forest regressor. And in the end got a score on how well we can make our predictions and show how useful the embeddings are. The Spearman correlation score is used to assess how well the relationship between two variables can be described. The results will always be between +1 and -1 (perfect positive correlation to perfect negative correlation).

Spearman correlation score with PCA(48) and K-Nearest-Neighbors: 0.7766183167583087

Spearman correlation score with PCA(100) and K-Nearest-Neighbors: 0.8065715450817204

Spearman correlation score with PCA(1000) and K-Nearest-Neighbors: 0.774468719550606

By testing the K-Nearest-Neighbors model we are seeing high positive correlation meaning that the model can be used to predict how mutations alter the activity of beta-lactamase.

How can this be useful?
Going back to aquaporins so much of their functionality results from their form and hydrophilic/hydrophobic relationships. Could we use a similar ml method to study AQP mutations and examine if we can make the proteins even more efficient by perhaps moving water more quickly. This might be useful when studying AQPs in plants and figuring out if there are AQP aa configurations that allow a plant to be more drought tolerant.
Homologs -> similar proteins. We can know a good amount about a protein by looking at similar proteins from other organisms.

Transmembrane Protein (TP) ->
a type of membrane protein that spans the entirety of the cell membrane. Many transmembrane proteins function as gateways to permit the transport of specific substances across the membrane.

Amino Acid Parts ->
amine group, carboxylic acid group, and a residue. The amine and carboxylic groups give the name 'amino acid' and these two parts are identical to those of other amino acids. The residue is unique among the amino acids.

Ligand ->
a substance that has the ability to bind to and form complexes with other biomolecules in order to perform biological processes. It is a molecule that triggers signals and binds to the active site of a protein through intermolecular forces.
“Water, Water Everywhere.” The Age of Living Machines: How Biology Will Build the next Technology Revolution, by Susan Hockfield, W.W. Norton & Company, 2020, pp. 49–72.

Kovach, Tracy. “Classification of Amino Acids (Video).” Khan Academy, Khan Academy,

“Overview of Protein Structure (Video).” Khan Academy, Khan Academy,

Gutiérrez-Preciado, Ana. “An Evolutionary Perspective on Amino Acids.” Nature, Nature Publishing Group, 2010,
Missense Mutation -> a point mutation in which a single nucleotide change results in a codon that codes for a different amino acid.

Proteins that act as biological catalysts (biocatalysts). The molecules upon which enzymes may act are called substrates, and the enzyme converts the substrates into different molecules known as products.

enzymes produced by bacteria that provide multi-resistance to beta-lactam antibiotics such as penicillin, cephalosporins, cephamycin, and carbapenems

Dalton-> also known as an atomic mass unit, it is a unit of mass that is equal to one twelfth of the mass of a free carbon-12 atom at rest. It's value is approximately equal to 1.660 x 10-27 kg

Amino Acid Parts-> an amine group, a carboxylic acid group, and a residue. The amine and carboxylic acid groups give the name amino acid and are identical to those of other amino acids. Residues are unique.
AQP1 visualized in PyMOL