Maize Genetics Cooperation Newsletter vol 81 2007

 

Rosario, Argentina

FCIA-UNR

Cordoba, Argentina

FCA-UNC

 

A machine learning approach for heterotic performance prediction of maize (Zea mays L.) based on molecular marker data

--Ornella, L; Balzarini, M; Tapia, E

 

          A number of statistical methods based on molecular data are currently available for assigning new inbreds to heterotic groups in maize (Zea mays L.) with variable results (Reif et al., Crop Sci. 45:1-7, 2005; dos Santos Diaz et al., Genet. Mol. Res. 3:356-368, 2004).  We conjecture that the main flaw of traditional statistical models is that they do not capture the non-linear relation between parental data and progeny performance (Tollenar et al., Crop Sci. 44:2086-2094, 2004); alternatively, experimental results show that such non-linearity can be easily captured by supervised machine learning models, i.e., by multiclassifiers  (Witten and Eibe, Data Mining:  Practical machine learning tools with Java implementations, Morgan Kaufmann, San Francisco, 2000).

          The field data analyzed in this study was taken from Nestares et al. (Pesq. Agropec. Bras. 34:1399-1406, 1999).  Briefly, our investigation involved 26 inbred lines (all lines but one, B73, were orange flint germplasm developed by INTA from twelve different sources: synthetics, composites, landraces, etc) from a total of 48 evaluated for their combining ability with four testers: sB73 & sMo17 from the Reid x Lancaster pattern and HP3 & P5L2 from the local orange flint pattern.  The 48 lines were grouped according to their combining ability with the tester populations into 4 heterotic groups (H1-H4) using the SAS-Fastclus procedure (Nestares et al., 1999).  The 26 lines were characterized using 21 SSR (simple sequence repeats) evenly distributed in the genome (Morales Yokobori et al., MNL 79:36, 2005).

          A dataset comprising 42 attributes corresponding to the 21 SSR (2 alleles of each locus per line) were generated.  This dataset contains 26 instances (26 lines) and 4 classes defined by the four heterotic groups (H1 = 4 instances, H2 = 8 instances, H3 = 6 instances and H4 = 8 instances).  Finally, we considered six standard multiclassifiers provided by the Java WEKA library (Witten and Eibe, 2000): Na�ve Bayes, Support Vector Machines with Radial Basis function kernel-one against all (SVM-RBF), Decision Tree (J48 and random forest), AdaBoost Decision Stumps and Multilayer Perceptron.  Classifiers� performances were evaluated by 3-, 5- and 10-fold Cross Validation (3-CV, 5-CV and 10-CV) (we run all classifiers with WEKA�s default values).  Results are presented in Table 1.

 

Table 1.  3-, 5- and 10-fold CV error on the heterosis dataset using multiclass classifiers.

 

Multiclassifier

3 CV error

5 CV error

10 CV error

Naive Bayes

0.654

0.692

0.769

SVM-RBF

0.654

0.769

0.769

Decision Tree (J48)

0.808

0.769

0.769

Decision Tree

 (random forest)

0.731

0.846

0.769

Adaboost-Decision Stump

0.731

0.610

0.770

Multilayer Perceptron

0.770

0.770

0.692

Error Expected by Chance

0.774

0.774

0.774

 

          Considering that our classification results are preliminary, they suggest the usefulness of a molecular based, machine learning approach for solving general heterotic group assignation problems; we must consider the effect of population structure (parents highly divergent) which affects linkage disequilibrium between DNA markers and genes involved in the expression of target traits (Charcosset and Essioux, Theor. Appl. Genet. 89:336-343, 1994).  Alternatively, and based on previous work, we hypothesize that further application of feature selection methods, i.e., the selection of highly discriminant molecular markers, might improve heterotic group assignation.  This hypothesis is supported in the observed similarity between classification problems involving microsatellite marker and those involving microarray data.  In both cases, missing and noisy features might be present in scarce data samples.  This type of classification noise can be properly limited by feature selection methods so that resulting data sets can be safely managed by binary based, Coding Theory inspired multiclassifiers (Ornella et al., VIII Argentine Symposium on Artificial Intelligence, Mendoza, Argentina, 2006).

 

Please Note: Notes submitted to the Maize Genetics Cooperation Newsletter may be cited only with consent of authors.