DNA sequence classification is a challenging problem for machine learning, and also a very relevant one for determining how DNA relates with various conditions such as diseases. This post goes through the process of setting up a pipeline to determine optimal hyperparameters in an automated fashion. This is the second part in a three part series.

Transformed Data

The data coming into the ML classifiers used has been transformed by either the Spectrum or Valued Occurrence Kernel. The issue with using these kernels is it becomes difficult to empirically determine which values that affect the kernels are optimal. It also is possible that a different transformed set can work better with different classifier hyperparameters. The solution to this is to test various transformed data sets with various hyperparameters via setting up a proper pipeline.

Classifiers

Given that this data is high dimensional and non-linear; it becomes apparent that whichever classifier selected must be able to handle this kind of problem. One option is the Support Vector Classifier which is an extension of the Support Vector Machine. There are other options, but given my current time constraints and the classifier I truly want to use (a Neural Network), I will only approach this problem right now with the SVC.

Pipeline Set-up

The code in this pipeline is designed to work with Scikit-learn. Pipelines are something which enable us to simplify the exploration process by allowing the researcher to skip the process of tuning the parameters for the data transformer and the classifier. The example pipeline which I have set up is below. Notice that the transformData class is of my own creation and involves the implementation of both the Spectrum and Valued Occurrences kernels. It is also possible to use the transformed data and the SVC in conjunction with another classifier. In this example I've used the linear, polynomial, and gaussian ("rbf") kernels.

pipeline = Pipeline([
    ('spectrum', transformData()),
    ('svm', SVC())
])
param_grid = [
    {
        "spectrum__transformation": ["kMer"],
        "spectrum__k":[2,3,4,5],
        "spectrum__gap":[0,1]
    },
    {
        "spectrum__transformation": ["AssignValue"],
        "spectrum__A":[1,2],
        "spectrum__C":[1,2],
        "spectrum__T":[1,2],
        "spectrum__G":[1,2]                
    },
    {
        "svm__kernel": ["rbf"],
        "svm__C": (.0001,.001,.01,.1,1,10),
        "svm__gamma": (2,3),
    },
    {
        "svm__kernel": ["linear"],
        "svm__C": (.0001,.001,.01,.1,1,10),
        "svm__gamma": (2,3),
    },
    {
        "svm__kernel": ["poly"],
        "svm__C": (.0001,.001,.01,.1,1,10),
        "svm__gamma": (2,3),
        "svm__degree": (2,3,4)
    }
]

In order for a class to work with the Scikit-learn pipeline specific guidelines must be followed. The class must inherit from the BaseEstimator class and another class, in this case TransformerMixin, unless one wants to go through the process of making sure things are all compatible on ones own. The init definition is required for all parameters that one wants to change within the pipeline. The fit function generally returns self in Scikit-learn. The transform function is what outputs the transformed data and in my case it calls the specific kernel functions I've created within the class dependent upon the input.

class transformData(BaseEstimator, TransformerMixin):

    def __init__(self, transformation = "kMer", k=2, gap=0, mismatch=0, A=1, C=1, T=1, G=1, N=0):
        self.transformation = transformation
        self.k = k
        self.gap = gap
        self.mismatch = mismatch
        self.A = A
        self.C = C
        self.T = T
        self.G = G
        self.N = N

    def fit(self, X, y=None):
        return self

    def kMer(self, X, y = None, k = 2, prop = 1, gap = 0, mismatch = 0):
        def count_overlapping(sub, text):
            return len(re.findall(sub, text, overlapping=True))

        def count_overlapping_mismatch(sub, text, mismatch):
            regex = str(sub) + "{s<=" + str(mismatch) + "}"
            return len(re.findall(regex, text, overlapping=True))

        def count_overlapping_gapped(sub, text, max_gap):
            regex = str(sub) + "{i<=" + str(max_gap) + "}"
            return len(re.findall(regex, text, overlapping=True))

        def count_overlapping_error_sum(sub, text, mismatch, max_gap):
            error_sum = mismatch + max_gap
            regex = str(sub) + "{e<=" + str(error_sum) + "}"
            return len(re.findall(regex, text, overlapping=True))

        bases = ["A", "T", "G", "C"]
        comb = ["".join(p) for p in itertools.product(bases, repeat=k)]
        comb = np.asarray(comb)
        Features = np.zeros(shape=(len(X), comb.shape[0]), dtype=int)
        # saves the features
        for m in range(0, X.shape[0]):
            s = str(X[m])
            for n in range(0, comb.shape[0]):
                if mismatch == 0 and gap == 0:
                    Features[m, n] = count_overlapping(comb[n], s)
                elif mismatch > 0 and gap == 0:
                    Features[m, n] = count_overlapping_mismatch(comb[n], s, mismatch)
                elif mismatch == 0 and gap > 0:
                    Features[m, n] = count_overlapping_gapped(comb[n], s, mismatch)
                elif mismatch > 0 and gap > 0:
                    Features[m, n] = count_overlapping_error_sum(
                        comb[n], s, mismatch, gap
                    )

        if mismatch > 0:
            Features = np.unique(Features, axis = 1)

        return Features

    def AssignValue(self, X, A, C, T, G, N):
        Features = np.zeros(shape=(len(X), 5), dtype=float)
        for m in range(0, len(Features)):
            s = str(X[m])
            Features[m, 0] = len(re.findall("A", s)) * A
            Features[m, 2] = len(re.findall("T", s)) * T
            Features[m, 3] = len(re.findall("G", s)) * G
            Features[m, 1] = len(re.findall("C", s)) * C
            Features[m, 4] = len(re.findall("N", s)) * N

        return Features

    def transform(self, X, transformation = "kMer", k = 2, gap = 0, mismatch = 0, A = 1, C = 1, T = 1, G = 1, N = 0):
        if transformation == "kMer":
            Features = self.kMer(X, k, gap, mismatch)
        elif transformation == "value":
            Features = self.AssignValue(X, A, C, T, G,N)
        return Features

An example of calling this function to transform the data is as follows:

X_test = transformData()
X_test.transform(k=3) #k changes the default value specified

Continuing on with the pipeline, the pipeline and parameter grid can be called from the Scikit-learn function GridSearchCV. The following code will output the parameters which performed the best on the training set. It is also possible to see how well all the different combinations performed. Note that using GridSearchCV requires running every combination specified in the parameter grid and is computationally expensive, so in practice it often pays to narrow ones parameter range beforehand if possible.

model = GridSearchCV(estimator = pipeline, param_grid=param_grid, cv=2, n_jobs=-1)
X_train, y_train, X_test, y_test = prepareData()
model.fit(X_train,y_train)
print("Best parameters found by GridSearchCV: ")
print(model.best_params_)

Conclusion

This post simply showed a simple manner in parameter optimization for both the classifier and the data transformation.