DNA sequence classification is a challenging problem for machine learning, and also a very relevant one for determining how DNA relates with various conditions such as diseases. This post goes through the process of setting up a pipeline to determine optimal hyperparameters in an automated fashion. This is the second part in a three part series.
Transformed Data
The data coming into the ML classifiers used has been transformed by either the Spectrum or Valued Occurrence Kernel. The issue with using these kernels is it becomes difficult to empirically determine which values that affect the kernels are optimal. It also is possible that a different transformed set can work better with different classifier hyperparameters. The solution to this is to test various transformed data sets with various hyperparameters via setting up a proper pipeline.
Classifiers
Given that this data is high dimensional and non-linear; it becomes apparent that whichever classifier selected must be able to handle this kind of problem. One option is the Support Vector Classifier which is an extension of the Support Vector Machine. There are other options, but given my current time constraints and the classifier I truly want to use (a Neural Network), I will only approach this problem right now with the SVC.
Pipeline Set-up
The code in this pipeline is designed to work with Scikit-learn. Pipelines are something which enable us to simplify the exploration process by allowing the researcher to skip the process of tuning the parameters for the data transformer and the classifier. The example pipeline which I have set up is below. Notice that the transformData class is of my own creation and involves the implementation of both the Spectrum and Valued Occurrences kernels. It is also possible to use the transformed data and the SVC in conjunction with another classifier. In this example I've used the linear, polynomial, and gaussian ("rbf") kernels.
pipeline = Pipeline([
('spectrum', transformData()),
('svm', SVC())
])
param_grid = [
{
"spectrum__transformation": ["kMer"],
"spectrum__k":[2,3,4,5],
"spectrum__gap":[0,1]
},
{
"spectrum__transformation": ["AssignValue"],
"spectrum__A":[1,2],
"spectrum__C":[1,2],
"spectrum__T":[1,2],
"spectrum__G":[1,2]
},
{
"svm__kernel": ["rbf"],
"svm__C": (.0001,.001,.01,.1,1,10),
"svm__gamma": (2,3),
},
{
"svm__kernel": ["linear"],
"svm__C": (.0001,.001,.01,.1,1,10),
"svm__gamma": (2,3),
},
{
"svm__kernel": ["poly"],
"svm__C": (.0001,.001,.01,.1,1,10),
"svm__gamma": (2,3),
"svm__degree": (2,3,4)
}
]
In order for a class to work with the Scikit-learn pipeline specific guidelines must be followed. The class must inherit from the BaseEstimator class and another class, in this case TransformerMixin, unless one wants to go through the process of making sure things are all compatible on ones own. The init definition is required for all parameters that one wants to change within the pipeline. The fit function generally returns self in Scikit-learn. The transform function is what outputs the transformed data and in my case it calls the specific kernel functions I've created within the class dependent upon the input.
class transformData(BaseEstimator, TransformerMixin):
def __init__(self, transformation = "kMer", k=2, gap=0, mismatch=0, A=1, C=1, T=1, G=1, N=0):
self.transformation = transformation
self.k = k
self.gap = gap
self.mismatch = mismatch
self.A = A
self.C = C
self.T = T
self.G = G
self.N = N
def fit(self, X, y=None):
return self
def kMer(self, X, y = None, k = 2, prop = 1, gap = 0, mismatch = 0):
def count_overlapping(sub, text):
return len(re.findall(sub, text, overlapping=True))
def count_overlapping_mismatch(sub, text, mismatch):
regex = str(sub) + "{s<=" + str(mismatch) + "}"
return len(re.findall(regex, text, overlapping=True))
def count_overlapping_gapped(sub, text, max_gap):
regex = str(sub) + "{i<=" + str(max_gap) + "}"
return len(re.findall(regex, text, overlapping=True))
def count_overlapping_error_sum(sub, text, mismatch, max_gap):
error_sum = mismatch + max_gap
regex = str(sub) + "{e<=" + str(error_sum) + "}"
return len(re.findall(regex, text, overlapping=True))
bases = ["A", "T", "G", "C"]
comb = ["".join(p) for p in itertools.product(bases, repeat=k)]
comb = np.asarray(comb)
Features = np.zeros(shape=(len(X), comb.shape[0]), dtype=int)
# saves the features
for m in range(0, X.shape[0]):
s = str(X[m])
for n in range(0, comb.shape[0]):
if mismatch == 0 and gap == 0:
Features[m, n] = count_overlapping(comb[n], s)
elif mismatch > 0 and gap == 0:
Features[m, n] = count_overlapping_mismatch(comb[n], s, mismatch)
elif mismatch == 0 and gap > 0:
Features[m, n] = count_overlapping_gapped(comb[n], s, mismatch)
elif mismatch > 0 and gap > 0:
Features[m, n] = count_overlapping_error_sum(
comb[n], s, mismatch, gap
)
if mismatch > 0:
Features = np.unique(Features, axis = 1)
return Features
def AssignValue(self, X, A, C, T, G, N):
Features = np.zeros(shape=(len(X), 5), dtype=float)
for m in range(0, len(Features)):
s = str(X[m])
Features[m, 0] = len(re.findall("A", s)) * A
Features[m, 2] = len(re.findall("T", s)) * T
Features[m, 3] = len(re.findall("G", s)) * G
Features[m, 1] = len(re.findall("C", s)) * C
Features[m, 4] = len(re.findall("N", s)) * N
return Features
def transform(self, X, transformation = "kMer", k = 2, gap = 0, mismatch = 0, A = 1, C = 1, T = 1, G = 1, N = 0):
if transformation == "kMer":
Features = self.kMer(X, k, gap, mismatch)
elif transformation == "value":
Features = self.AssignValue(X, A, C, T, G,N)
return Features
An example of calling this function to transform the data is as follows:
X_test = transformData()
X_test.transform(k=3) #k changes the default value specified
Continuing on with the pipeline, the pipeline and parameter grid can be called from the Scikit-learn function GridSearchCV. The following code will output the parameters which performed the best on the training set. It is also possible to see how well all the different combinations performed. Note that using GridSearchCV requires running every combination specified in the parameter grid and is computationally expensive, so in practice it often pays to narrow ones parameter range beforehand if possible.
model = GridSearchCV(estimator = pipeline, param_grid=param_grid, cv=2, n_jobs=-1)
X_train, y_train, X_test, y_test = prepareData()
model.fit(X_train,y_train)
print("Best parameters found by GridSearchCV: ")
print(model.best_params_)
Conclusion
This post simply showed a simple manner in parameter optimization for both the classifier and the data transformation.