Diffing Coronaviruses

  • diffing is not the right approach, it can't detect smaller blocks moving about.

    since the genome is made out of only A,T,G,C for any change beyond the trivial diffing may generate the incorrect picture, it lines up the "wrong" bases.

    Here is a much simpler solution that does a much better job and you don't have to muck around so much. First install some tools:

      conda install -c bioconda entrez-direct emboss
    
    Now do a:

      # Get the data.
      efetch -db nuccore -id MG772933 -format fasta > MG772933.fa
      efetch -db nuccore -id MN988713 -format fasta > MN988713.fa
      
      # Align globally.
      stretcher -filter  MG772933.fa MN988713.fa  > aligned.txt
    
    A much simpler explanation of what is going on (I picked a variable region here):

      ...
      MG7729 CACTCCTCACTATTCATAGAGGA-----GAC-CCT----ATGCCTAATAA
             :  : ::  :: : :::::: :      ::: :::    :: : :  : :
      MN9887 CTTTACTTGCTTTACATAGAAGTTATTTGACTCCTGGTGATTCTTCTTCA
      ...
    
    it also reports:

      # Length:           29922
      # Identity:   26272/29922 (87.8%)
      # Gaps:         160/29922 ( 0.5%)

  • I did something similar recently using my custom terminal diff tool. Here is a more colourful terminal-based view of the differences between MN908947.3 and sars NC_004718.3:

    https://twitter.com/RobertElderSoft/status/12226906294923796...

    To produce that image I used this tool:

    https://github.com/RobertElderSoftware/roberteldersoftwaredi... With these two genomes:

    https://www.ncbi.nlm.nih.gov/nuccore/MN908947

    https://www.ncbi.nlm.nih.gov/nuccore/NC_004718.3

    with the leading numbers, spaces and newlines removed from each genome file.

    My tool (as well as unix diff with --minimal) uses the myers diff algorithm which just compute a mathematically minimal diff. More advanced algorithms exist for computing phylogenetic trees that take into account the biological likelihood of certain sequence change (deletions vs. additions vs. translations etc.).

  • diff is a special case of a more or less obligate preprocessing step for sequence alignment algorithms.

    You need to go further than diff to make meaningful phylogenetic comparisons, though!

    https://en.wikipedia.org/wiki/Longest_common_subsequence_pro... has much more.

  • FYI—sequence alignment algorithms are really just a specialized subset of diffing algorithms that take into account the properties of biological sequences. (For example, certain edits are more or less likely to happen.)

    https://en.wikipedia.org/wiki/Sequence_alignment

  • Definitely not a great approach.

    It's worth pointing toward the actual algorithms commonly used for local and global alignments (global is the one to use in the coronavirus genome case).

    Local alignments: Smith-Waterman [1]. Global alignments: Needleman–Wunsch [2]

    [1] https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorit... [2] https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algor...

  • undefined

  • I've been using a kmer approach, just a few lines in Python, and it's interesting comparing coronavirus to many different genomes.

    The neat thing about kmers is that it is alignment free, so can find similarities between genomes even if genes have been moved around.

    Interestingly, it finds HIV is quite close to cov compared to a random sampling of refseq viruses. Maybe there is credence to that Indian paper after all.

  • I read once that we share 99% of our DNA with Chimpanzees but we also share 95% of DNA with fruit flies. So if the viruses are 89% similar does that mean the difference between SARs and nCOV is bigger than the difference between a human and fruit fly? Who knows what this analysis might be telling us.

  • This is a good start. However when trying to build phylogenies and understand evolutionary relationshios there are many next steps beyond sequence identity.

  • Interesting, in light of a recent discussion about viruses having very high gene variability: https://news.ycombinator.com/item?id=22276105

  • It seems NCBI's index is not updated. You should try MN996532, which is also from bats and the sequence was uploaded Jan 29. It shows 96% similarity.

  • undefined