DNA sequence classification is a challenging problem for machine learning, and also a very relevant one for determining how DNA relates with various conditions such as diseases. This post goes through the process of setting up the data and using kernels to extract features. This is the first part in a 3 part series.

Data Structure

DNA data is generally given in the form of sequences. There are four main components called Nucleotide Bases:

Adenine
Cytosine
Guanine
Thymine

These components are typically given together as a string with only the first capital letter of the components with no spacing (ie ACTGCCTGGGA). Occasionally there may be a N in the sequence due to incomplete data. Imputing the missing data is not reasonable and other methods are utilized to work around missing data. The N placed in the sequence does preserve the structure of the data better and is thus preferred to a complete dropping of the nucleotide base.

Typically these DNA sequences are used to attempt to classify the sequences into particular categories. Before that can be done, it is typically necessary to transform the data into a function space with which our ML algorithms can better quantify. There are two main methods, as well as many sub-methods, which I describe in this post.

The first idea is to obviously import the data into your preferred structure. For string cases I prefer working with Numpy arrays. This can be done via simply looking at the relevant information and importing the data. I import my data in the structure of data matrix \(A = (\)n_data, n_features\()\). I also change the class labels from strings into integers. I typically place my code into functions or classes so any variables equal to another variable is simply because that can change according to my desire.

x_temp = np.transpose(np.loadtxt(csvpath, dtype=str, delimiter=","))

# labels from strings to integers

x_temp[0] = np.unique(x_temp[0], return_inverse=True)[1].tolist()
x_temp = np.char.strip(x_temp)
y_t = x_temp[0].astype(np.int)

# train test split

X_train, X_test, y_train, y_test = train_test_split(
x_temp[2], y_t, test_size=testSize, random_state=randomState
)

Now querying the current structure of the X_test (2068, ) ndarray (numpy array) gives an output like this:

['TCCTTGACCTGGGTTCCCCCTCTCCTGCAGAACGATTCCCTGATGAGGCAGATGCGGGAA'
 'TTGATAACATGACATTTTCCTTTTCTACAGAATGAAACAGTAGAAGTCATCTCAGAAATG'
 ...
 'CATGCCTTGAATTTCTTTTCTGCACGACAGGTCTGCCAGCTTACATTTACCCAAACTGTC'
 'GATTCTCTTCAGCCAATCTTCATTGCTCAAGTATGACTTTAATCTTCCTTACAACTAGGT']

and the y_test (2068, ) ndarray like follows:

[1 1 1 0 1 0 0 2 0 2 2 ... 1 0 2 1 1 1 2 0 0 2 2 0]

This signifies that there are actually three classes here, so multi-class classification is needed. We can either simplify this to binary methods via using One-vs-One strategies or One-vs-Rest strategies, or use a method which has been extended from binary. In this case, either method is likely to work, but when having a more complex dataset with more class labels it is often preferable to find an extended method which can generalize well.

Feature Extraction

Now that we have the strings of the DNA sequences as well as the class labels, feature extraction is needed for some of our ML models to actually work. This is because many ML models are designed to work on vector spaces which contain the data. There is also the possibility of working with ML models which can take sequences in and work on them, and those will be covered later.

Kernels

The motivation behind using kernels are that they allow us to map our data into an inner product space which, when properly configured, allows us to use linear classifiers to solve nonlinear problems.

Let define a kernel as \(K(x,y) = \langle f(x),f(y)\rangle \) which is a mapping on domain \(D\) for which \(K: D \times D \to \mathbb{R} \) with conditions:

symmetric
continuous
positive semi-definite

A few examples of kernels tested within this project are demonstrated below:

linear: \(K(x,y) = x^Ty \quad x,y \in \mathbb{R}^d\)
gaussian: \(K(x,y) = e^{-\frac{\|x-y\|^2}{2\sigma^2}} \quad x,y\in \mathbb{R}^d, \sigma > 0\)
polynomial: \(K(x,y) = (x^Ty + c)^p \quad x,y \in \mathbb{R}^d, c \ge 0\)

I won't go through all the mathematical details, but a key point is that kernels are very useful for mapping nonlinear data to linear spaces which are linearly separable. They are of particular use for SVMs (support Vector Machines) as SVMs optimize via solving a dual quadratic programming problem of which the only dependent is an inner product \(\langle x_i,x_j\rangle\) which can be replaced by \(K(x,y)\) and solved for directly in the feature space.

Spectrum Kernel

The Spectrum Kernel was first introduced by Leslie et al. in 2002 and has since been commonly used for extracting features in a computationally inexpensive manner from DNA and RNA sequences. It is a kernel which is also known as a string subspace kernel on strings over an alphabet \(\mathcal{A}\) with \(|\mathcal{A}| = l\). Searching for all \(k\) length substrings , \(a\), which are greater than 1 allows a feature map from the input space to \({\mathbb{R}^l}^k\) to be defined as \(\Phi_k(x)=(\phi_a(x))_{a \in \mathcal{A}^k}\) with \(\phi_a(x)\) being the number of times \(a\) appears in string \(x\). This allows for a \(k\)-Spectrum Kernel, also widely known as a kMers Kernel, which can be represented as:

\[ K_k(x,y)=\langle \phi_k(x),\phi_k(y)\rangle\]

The implementation of the kMers Kernel can be programmed in a rather simple fashion using python libraries such as itertools and regex. The idea is as follows:

generate all possible permutations of ['A', 'C', 'G', 'T']
iterate through the permutations and data strings
find all occurrences where a permutation is in the data string
sum up the occurrences in a new matrix corresponding to data string and permutation

This can be simply implemented in Python as shown below.

# function to find all occurrences of substrings in the text, overlapping allowed

def count_overlapping(sub, text):
    return len(re.findall(sub, text, overlapping=True))

#generate all permutations

bases = ["A", "T", "G", "C"]
comb = ["".join(p) for p in itertools.product(bases, repeat=k)]
comb = np.asarray(comb)
featuresTrain = np.zeros(shape=(len(X_train), len(comb)), dtype=int)
featuresTest = np.zeros(shape=(len(X_test), len(comb)), dtype=int)

# saves the features, X_train only

for m in range(0, len(X_train)):
    s = str(X_train[m])
    for n in range(0, len(comb)):
        featuresTrain[m, n] = count_overlapping(comb[n], s)

The kMers Kernel can be adjusted to account for gaps, mismatched values, deletions, insertions, etc. With simple modifications to the code structure shown above. This is done via the power of regular expressions (regex).

def count_overlapping_mismatch(sub, text, mismatch):
    regex = str(sub) + "{s<=" + str(mismatch) + "}"
    return len(re.findall(regex, text, overlapping=True))

def count_overlapping_gapped(sub, text, max_gap):
    regex = str(sub) + "{i<=" + str(max_gap) + "}"
    return len(re.findall(regex, text, overlapping=True))

def count_overlapping_error_sum(sub, text, mismatch, max_gap):
    error_sum = mismatch + max_gap
    regex = str(sub) + "{e<=" + str(error_sum) + "}"
    return len(re.findall(regex, text, overlapping=True))

Valued Occurrence Kernel

Another simple Kernel which can be attempted to transform the data is the Valued Occurrence Kernel.

\[ K_n(x,y)=\langle \phi_n(x),\phi_n(y)\rangle\]

Where \(\phi_n(x)\) represents the amount of times a nucleotide has appeared in the data string. This is even more trivial to implement than the kMers Kernel.

def AssignValue(X_train, X_test, A=1, C=2, T=3, G=4, N=0):
    xTrainValues = np.zeros(shape=(len(X_train), 5), dtype=float)
    for m in range(0, len(xTrainValues)):
        s = str(X_train[m])
        xTrainValues[m, 0] = len(re.findall("A", s)) * A
        xTrainValues[m, 2] = len(re.findall("T", s)) * T
        xTrainValues[m, 3] = len(re.findall("G", s)) * G
        xTrainValues[m, 1] = len(re.findall("C", s)) * C
        xTrainValues[m, 4] = len(re.findall("N", s)) * N
...

When the values of the nucleotide bases are all equal to 1 and N is equal to 0 it is the same thing as the kMers Kernel with \(k=1\). This kernel is not very effective with no previous data, but there is the possibility if data was known about nucleotide base proportions that it could be more effective due to the quickness of computation.

Resources

C. Leslie, E. Leskin, and W. S. Noble. The Spectrum Kernel: A String Kernel for SVM Protein Classification. Proceedings of the Pacific Symposium on Biocomputing, 564-575, 2002.