DNA Sequence Classification
using Machine Learning

Introduction

DNA sequencing is the process of determining the sequence of nucleotides (As, Ts, Cs, and Gs) in a piece of DNA.

The human genome contains about 3 billion base pairs that spell out the instructions for making and maintaining a human being.

In the DNA double helix, the four chemical bases always bond with the same partner to form "base pairs." Adenine (A) always pairs with thymine (T); cytosine (C) always pairs with guanine (G). This pairing is the basis for the mechanism by which DNA molecules are copied when cells divide, and the pairing also underlies the methods by which most DNA sequencing experiments are done.

Since the completion of the Human Genome Project (1990-2003), technological improvements and automation have increased speed and lowered costs to the point where individual genes can be sequenced routinely, and some labs can sequence well over 100,000 billion bases per year, and an entire genome can be sequenced for just a few thousand dollars.

Methodology

Dataset

The data is a 1990 generated collection of 106 instances across 58 attributes. All the instances are labelled with binary labels i.e. ‘+’ and ‘-’

Train-test split

The data is split into training and test data for validation of the model later. The training data size is set to 25% of the original data

Model Selection

To select which model performs best for the given dataset, we train a host of different models, like Logistic Regression, SVM, Naïve Bayes, Neural Networks etc

Model Training

The models are trained one-by one with the training data. The model parameters are updated using gradient descent.

Model Classification

Model performances are compared using the metric of Accuracy Score, which tells us the proportion of correct classifications made by the model


Coding Part