SARS-CoV-2 genome analysis | Tulga-Erdene Sodjargal

I took a class called “Bio-Data Structures” in my second year. Don’t pay attention to the name, the class had nothing to do with data structures. It was based on the famous Compeau and Pevzner book, and was aimed to introduce programming principles and to teach how to translate biological problems into well-defined computational ones. In the final project, we were asked to analyze SARS-CoV-2 genomes from April 19, 2021 to June 28, 2021. Even though I ended up doing more computational chemistry/biophysics things rather than computational omics, I quite enjoyed the course and it ignited my interest in computational biology. Below is a modified version of my final report.

The dataset was in .fasta format, which is basically just text files with the genomic sequences and some metadata. The analysis pipeline was split into three main steps:

Run a multiple sequence alignment (MSA) against a reference spike RNA sequence.
Scan the aligned sequences for mutations and dump them into .csv files.
Plot the mutation frequencies to see how they evolve over time.

Multiple Sequence Alignment (MSA)

Let’s start with the alignment part. The algorithm we used is heuristic: instead of aligning all sequences at once (which is technically the “optimal” way but computationally insane), it builds up a profile from previously aligned sequences and matches new ones against it. That means it’s not perfectly accurate, since the algorithm doesn’t know what future sequences will look like. But in this case, the genomes are all from the same virus, collected within just a couple of months, so the accuracy loss is predicted (and assumed) to be acceptable.

For context: the heuristic method is O(n²) in both time and space. The “true” optimal alignment would blow up to O(nᴺ) because it tries to align all N sequences in one giant multidimensional graph. One can reduce the space complexity to O(n) using divide-and-conquer tricks when facing memory limitations, but then you double the runtime.

The scoring system we used was slightly tweaked compared to the defaults:

match = +1
mismatch = 0
indel (insertion/deletion) = −5

The penalties are a bit lower than widely used numbers as we know the genomic sequences are of the same organism, and the organism studied in this project is an RNA virus with a tendency to quickly mutate, so we should allow a small number of mismatches and indels by design and avoid overfitted sequences.

Mutation Detection

Once the sequences are aligned, we start looking for mismatches and count how often they show up. Each mismatch/indel is considered one single-nucleotide polymorphism. Such simple approach helps us to find these mutations very easily, without major limitations. The main limitation of this approach is the inability to find more complex limitations, i.e., mutations other than single-nucleotide mutations. So, when this simple approach is used, subsequent manual analyses is needed to find more complex mutations and extract more information from the genomic sequences.

To make the mutation data more biologically meaningful, we adjust the nucleotide positions. For example, the most frequent mutation in our dataset was an A→G change at position 1844. When compared to the reference SARS-CoV-2 genome, that actually maps to position 23403, which sits right inside the S gene (the one coding for the spike protein). Once we shifted the numbering baseline, everything lined up correctly with known mutations.

Here are the most common single-nucleotide mutations we found:

Mutation index	Nucleotide(s)’ position	Nucleotide(s) change	Cumulative frequency in assignment dataset	Cumulative frequency in open-source dataset
1	21618	C>G	59.4	77.8
2	21765	TACATG>-	37.4	21.1
3	21846	C>T	48.5	43.4
4	21987	G>A	56.1	47.6
5	21991	TTA>-	37.8	21.4
6*	22028	GATTCA>-	59.0	-
7	22029	AGTTCA>-	59.0	74.9
8	22917	T>G	60.2	77.0
9	22995	C>A	59.4	76.7
10	23063	A>T	39.3	21.0
11	23271	C>A	39.1	21.2
12	23403	A>G	100	99.98
13	23604	C>G	60.0	78.0
14	23604	C>A	39.3	21.5
15	23709	C>T	39.1	21.2
16	24410	G>A	57.5	76.7
17	24506	T>G	39.1	21.2
18	24914	G>C	38.9	21.2

*excluded from the graph as the mutation is not found in open-source dataset

Interestingly, while the frequencies in our small dataset didn’t exactly match the open-source dataset, the trends were consistent. If you plot them, the correlation is basically a straight line with slope ~1, intercept 0. The small discrepancies are just sampling bias: our 100 genomes weren’t fully “representative,” while the open-source dataset is much larger.

Multiple-Nucleotide Mutations

Now, we can see that there are 3 or 6 indels show up right after the other. For example, the sets of mutations at positions 21765–21770, 21991–21993, and 22029–22034 clearly move together. We can combine these the single-nucleotide mutations into a multiple-nucleotide mutation, and this combination is corroborated by statistical and biological evidence:

In the assignment dataset, the frequency plots of the single-nucleotide changes are identical in both frequency and time. Thus, there is a strong chance these changes are connected.
In the open-source dataset, the frequency of only one or few of these single-nucleotide changes is much smaller than the frequency of the multiple-nucleotide mutation.
These nucleotide changes occur for 3 or 6 nucleotides, which is equal to the length of 1 and 2 codons, respectively. So, when 3 or 6 nucleotides are deleted, a whole codon or 2 codons are dropped, which indicates an amino acid deletion mutation.

From Nucleotides to Amino Acids

Once you know the nucleotide positions, it’s straightforward to map them to amino acids using a codon table. That lets you check which changes are synonymous (don’t affect the protein sequence) versus non-synonymous (do change the protein sequence).

Mutation index	Nucleotide(s)’ position	Nucleotide(s) change	Amino acid change
1	21618	C>G	T19R
2	21765	TACATG>-	HV 69–70 deletion
3	21846	C>T	T95I
4	21987	G>A	G142D
5	21991	TTA>-	Y144 deletion
6*	22028	GATTCA>-	-
7	22029	AGTTCA>-	EF 156-157 deletion
8	22917	T>G	L452R
9	22995	C>A	T478K
10	23063	A>T	N501Y
11	23271	C>A	A570D
12	23403	A>G	D614G
13	23604	C>G	P681H
14	23604	C>A	P681R
15	23709	C>T	T716I
16	24410	G>A	D950N
17	24506	T>G	S982A
18	24914	G>C	D1118H

This is where things get interesting, and I think this was the point of the whole project. Some of these mutations we found are known “defining mutations” of major variants:

Delta (21A/S:478K): mutations 1, 4, 7, 8, 9, 12, 14, 16
Alpha (S.501Y.V1): mutations 2, 5, 10, 11, 12, 13, 15, 17, 18

If you track their frequencies over time, you can actually see the variant takeover happening. Before May 3, Alpha mutations dominate (~75–80%) and Delta is still rare (~10%). After May 3, Alpha drops off and Delta rapidly rises. Mutation 12 is present in both, so it stays at 100%. This matches real-world data almost perfectly: by June 4, 2021, the Delta variant had officially become dominant in the UK.

Functional Impact

Using this data, we can also determine which mutations are more likely to be responsible for increased pathogenicity. One of such cases is the case of P681R and P681H mutations of the Delta and Alpha strands, respectively. It has been determined that the P681 mutation is key to the increased pathogenicity of the Delta variant. The reason behind the effectiveness of this mutation is the cleavage of S1 and S2 subunits which is predicted to improve cell-surface-mediated virus entry. The alpha variant also exhibits enhanced S1/S2 cleavage thanks to its P681H mutation, however, the P681R mutation is much more effective than the P681H mutation.