Maize Genetics Cooperation Newsletter vol 81 2007
FCIA-UNR
FCA-UNC
A
number of statistical methods based on molecular data are currently available
for assigning new inbreds to heterotic
groups in maize (Zea mays
L.) with variable results (Reif et al., Crop Sci. 45:1-7, 2005; dos Santos Diaz et al.,
Genet. Mol. Res. 3:356-368, 2004). We conjecture that the main flaw of traditional statistical
models is that they do not capture the non-linear relation between parental
data and progeny performance (Tollenar et al., Crop
Sci. 44:2086-2094, 2004); alternatively,
experimental results show that such non-linearity can be easily captured by
supervised machine learning models, i.e., by multiclassifiers (Witten and Eibe,
Data Mining: Practical machine
learning tools with Java implementations, Morgan Kaufmann, San Francisco,
2000).
The
field data analyzed in this study was taken from Nestares
et al. (Pesq. Agropec.
Bras. 34:1399-1406, 1999). Briefly, our investigation involved 26 inbred lines (all
lines but one, B73, were orange flint germplasm developed
by INTA from twelve different sources: synthetics, composites, landraces, etc)
from a total of 48 evaluated for their combining ability with four testers:
sB73 & sMo17 from the Reid x Lancaster pattern and HP3 & P5L2 from the
local orange flint pattern. The 48
lines were grouped according to their combining ability with the tester
populations into 4 heterotic groups (H1-H4) using the
SAS-Fastclus procedure (Nestares
et al., 1999). The 26 lines were
characterized using 21 SSR (simple sequence repeats) evenly distributed in the
genome (Morales Yokobori et al., MNL 79:36, 2005).
A
dataset comprising 42 attributes corresponding to the 21 SSR (2 alleles of each
locus per line) were generated.
This dataset contains 26 instances (26 lines) and 4 classes defined by
the four heterotic groups (H1 = 4 instances, H2 = 8
instances, H3 = 6 instances and H4 = 8 instances). Finally, we considered six standard multiclassifiers
provided by the Java WEKA library (Witten and Eibe,
2000): Na�ve Bayes, Support Vector Machines with
Radial Basis function kernel-one against all (SVM-RBF), Decision Tree (J48 and
random forest), AdaBoost Decision Stumps and
Multilayer Perceptron. Classifiers� performances were evaluated
by 3-, 5- and 10-fold Cross Validation (3-CV, 5-CV and 10-CV) (we run
all classifiers with WEKA�s default values). Results are presented in Table 1.
Table 1. 3-, 5- and 10-fold CV
error on the heterosis dataset using multiclass
classifiers.
Multiclassifier |
3 CV error |
5 CV error |
10 CV
error |
Naive Bayes |
0.654 |
0.692 |
0.769 |
SVM-RBF |
0.654 |
0.769 |
0.769 |
Decision Tree (J48) |
0.808 |
0.769 |
0.769 |
Decision Tree (random forest) |
0.731 |
0.846 |
0.769 |
Adaboost-Decision
Stump |
0.731 |
0.610 |
0.770 |
Multilayer Perceptron |
0.770 |
0.770 |
0.692 |
Error Expected by Chance |
0.774 |
0.774 |
0.774 |
Considering
that our classification results are preliminary, they suggest the usefulness of
a molecular based, machine learning approach for solving general heterotic group assignation problems; we must consider the
effect of population structure (parents highly divergent) which affects linkage disequilibrium between DNA
markers and genes involved in the expression of target traits (Charcosset and Essioux, Theor. Appl. Genet. 89:336-343, 1994). Alternatively, and based on previous work, we hypothesize
that further application of feature selection methods, i.e., the selection of
highly discriminant molecular markers, might improve heterotic group assignation. This hypothesis is supported in the observed similarity
between classification problems involving microsatellite marker and those
involving microarray data. In both
cases, missing and noisy features might be present in scarce data samples. This type of classification noise can
be properly limited by feature selection methods so that resulting data sets
can be safely managed by binary based, Coding Theory inspired multiclassifiers (Ornella et al.,
VIII Argentine Symposium on Artificial Intelligence, Mendoza, Argentina, 2006).
Please Note: Notes submitted to the Maize Genetics Cooperation
Newsletter may be cited only with consent of authors.