Statistical Methods for Assessing Individual Oocyte Viability Through Gene Expression Profiles

STATISTICAL METHODS FOR ASSESSING INDIVIDUAL OOCYTE VIABILITY THROUGH GENE EXPRESSION PROFILES


INTRODUCTION
Oocytes are the precursor cells to what we often think of as the female egg cell.
They operate and grow in essentially the same way within swine as they do in humans (Thomas, 2016). One main reason that one would be interested in studying oocytes is that they are the key to studying life, reproduction, and all other pertinent matters. Without access to oocytes, researchers in the fields of cloning, fertility treatments, and others would not be able to do their work. Currently, there are numerous methods of deriving (growing) oocytes, each with its benefits. In vivo (IVV) derived oocytes are held as the gold standard for viability, and other known origination methods are sub-par by comparison. However, alternatively derived oocytes do have many traits that are desirable. The in vitro maturation (IVM) method is one such method of oocyte derivation, as it allows the researchers to have access to the oocyte from early on. There are however, many problems associated with this method. From past studies, it is shown that the viability rate associated with IVV derived oocytes is upwards of 95%, however for IVM oocytes (and other similar methods) the viability rate is roughly only 25% (Dr. S. Clay Isom, personal communication). Similar low viability concerns arise with the SCNT (somatic cell nuclear transfer, or traditional cloning) origination method. With such a low viability rate it is only natural to try to alleviate the frustration and waste that comes from focusing time, resources and energy on oocytes that are simply not viable.
But how can one know if an oocyte is viable prior to investing in it? In general, one cannot. However, methods can be developed that test individual oocytes for viability, establishing a viability-optimized general gene expression profile. As laboratory originated oocytes can be treated with a wash that promotes growth and expression patterns in chosen directions (Dr. S. Clay Isom, personal communication), it is apparent that establishing methods to test individual oocyte viability and thereby derive a general viable profile for oocytes would, in theory, increase the viability of non-IVV oocytes. Kwon et al. (2015) took embryos that were obtained from three different origination methods (IVV, IVM, and SCNT) and examined the gene expression profiles for 15 different genes of interest. In their project, a Weighted Root Mean Squared Deviation (wRMSD) was calculated based on expression level deviation from the mean expression level of the origination group. The average wRMSD for each origination group was calculated and compared. They found that the difference between the IVV group and the SCNT group was greater than the difference between the IVV group and the IVM group. These conclusions were relevant for comparing methods of origination, however we desire to compare individual oocyte viability.
Within our project, the goal has been to establish origin-independent methods of evaluating individual oocyte viability, based simply on the observed gene expression profiles. As the process of collecting the observed gene expression profiles of oocytes is destructive, making the oocytes non-viable (and thereby making it impossible to tell if the oocyte would have been viable or not, simply by observing), there are a number of assumptions that one must make. First, assume that IVV-derived oocytes are the "gold standard" for oocyte viability. This assumption does not seem to be erroneous, as historically over 95% of all IVV-derived oocytes have been observed to be viable (Dr. S. Clay Isom, personal communication). The second assumption that is made is that the closer an oocyte's gene expression profile gets to the mean of the IVV gene expression profiles, the more likely the oocyte is to be viable. Third, assume that the IVV oocyte gene expression profiles are a representative sample of all IVV oocyte gene expression profiles, and likewise for the IVM oocytes within the data.
Dr. S. Clay Isom provided the project dataset in the form of an Excel spreadsheet, which contained the gene expression profiles for 29 IVV derived oocytes as well as 29 IVM derived oocytes. Each of the gene expression profiles was expressed in the form of a log2 fold-change on 67 genes of interest compared to a common "housekeeping" gene.
These specific 67 genes were selected, as they are affiliated with early cell growth, regulation and viability (Dr. S. Clay Isom, personal communication). Overall there were few values that were missing (about 4.4%), but computationally if the case arose that a value was missing, that value was omitted from gene level calculations in the following methods. Once again, actual viability of these individual sample oocytes is unknown, as they yielded the "ultimate sacrifice" for science, though from the assumptions, greater similarity to IVV is treated as more likely to be viable.
In this MS project, an adaption (on the single oocyte level) of the Kwon et al. (2015) wRMSD method is summarized in Section 2, along with three other novel methods -a Distance Kernel P-value method in Section 3, a Tolerance Interval method in Section 4, and a Decision Tree method in Section 5. Each of these methods is then compared for accuracy via simulation in Sections 6 and 7, and the methods' performance is discussed in Section 8.

WEIGHTED ROOT MEAN SQUARED DEVIATION KERNEL DENSITY P-VALUE
We use an application of the weighted root mean squared deviation (wRMSD) approach proposed by Kwon et al (2015). Here, we compute the wRMSD from the center of the oocyte viability class. In the more practical case that viability status is unknown, we would then compute the wRMSD from the center of the oocyte maturation class (center of IVV if using an IVV oocyte, or IVM if using an IVM oocyte). Again, we operate under the assumption that IVV is considered viable. Kwon et al. (2015) said "We considered each gene expression to be an independent event; therefore, we combined all of the expression measurements of each (gene) sample in the calculation of the wRMSD.
To minimize the bias from a measurement error of a gene expression profile with a low coefficient of variation (CV), the deviation of each gene expression level from the mean was weighted with the CV of the gene in the group." Within the wRMSD calculation there are three main parts: the reference expression level (mean expression level for the gene within group), the expression level of the specified gene within the oocyte of interest, and the weighting coefficient. The weighting coefficient was further made up of "the proportion of the CV for the expression level of the i th gene to the sum of CV for those of all genes in the group" (Kwon, et al., 2015): In Equation 1, wRMSD is defined for a given oocyte. Here, E i refers to the expression level of the i th gene in the oocyte, Em i refers to the reference expression level for the oocyte (i.e. mean of the IVV if the oocyte is IVV or IVM if the oocyte is in the IVM group) of the i th gene, and w i refers to the weight of the mean squared deviation of the gene expression, defined as follows: That is, the weight (w i ) in Equation 1 is defined as the proportion of the Coefficient of Variation (CV) for the expression level of the i th gene (across all oocytes in the group -IVV or IVM) to the sum of the CV for those of all genes in the group.
We set up the null hypothesis that an observed oocyte (with an accompanying wRMSD) came from the viable (or IVV) class, with the respective alternative hypothesis that it did not. P-values for likelihood of belonging to the "viable" class were then computed for the individual oocytes by comparing the oocyte's observed wRMSD to the kernel density of the viable (or IVV) wRMSD distribution, and computing an upper tail area (see Figure 1). These p-values were then compared to an alpha 0.05 level for determining if there was enough evidence to reject our null hypothesis (that the observed oocyte was viable). This Kernel Density p-value portion was not utilized in the Kwon et al. paper, however it is a reasonable and necessary application of their published method that allows for individual oocyte classification and comparison between results from this and other methods. Figure 1. The wRMSD distribution of IVV and IVM oocytes with calculated kernel density for the IVV wRMSD distribution overlaid on both histograms. An example IVM oocyte has been selected and its wRMSD has been found to be 1.19. The wRMSD of the oocyte is then compared to the kernel density of the wRMSD for the IVV group; we can then set up a hypothesis test with a null hypothesis that the observed oocyte comes from the IVV group. By computing an upper tail area from the kernel density, we can observe a p-value for our example oocyte equal to 0.052. This p-value is larger than our 0.05 cutoff, so we would classify the observed oocyte as viable (by our assumption that the more closely related to the IVV group that an oocyte is, the more likely it is to be viable).

DISTANCE KERNEL DENSITY P-VALUE METHOD
We considered also a distance measurement method as a modification to the wRMSD approach presented by Kwon et al (2015). It utilizes only a subset of the available genes; those that a limma eBayes approach has determined are differentially expressed (between IVV and IVM groups) at an alpha 0.05 level (Ritchie, 2015). Briefly, the limma eBayes approach performs a modified t-test on the expression level of each gene, testing for differential expression between two conditions (IVV and IVM here).
The gene expression profile of all IVV (or viable) oocytes is computed, and the mean expression is taken for each gene in order to form an IVV group mean gene expression profile. Using only the subset of differentially expressed genes, the distance measure for each oocyte from the mean of the IVV group gene expression profile is then computed.
Here, distance is measured as an adaption to the Kwon et al. wRMSD approach: (2) In Equation 2, we only use information from differentially expressed genes. W i is the weight for the specified gene i, and is defined as before, based on CVs. Em i represents the mean of the IVV (or viable) group for gene i. E i represents the expression level of the i th differentially expressed gene for the specified oocyte.
The distance kernel density is then calculated for the IVV (or viable) oocytes, based upon the calculated IVV distances. We then can use this kernel density distribution to set up a hypothesis test for each of the individual oocytes, using the following null and alternative hypotheses: H 0 -The oocyte distance comes from the IVV (viable) for the hypothesis test, we compute the upper tail probability above the distance observed in the oocyte of interest (see Figure 2). This upper tail area is our observed p-value. Figure 2. The distance distribution of IVV and IVM oocytes with calculated kernel density for the IVV distance distribution overlaid on both histograms. The same example IVM oocyte has been selected as in Figure 1, and its calculated distance has been found to be 9.78. The distance of the oocyte is then compared to the kernel density of the distances for the IVV group; we can then set up a hypothesis test with a null hypothesis that the observed oocyte comes from the IVV group. By computing an upper tail area from the kernel density, we can observe a p-value for our example oocyte equal to 0.019. This p-value is smaller than our 0.05 cutoff, so we would classify the observed oocyte as non-viable (by our assumption that the more closely related to the IVV group that an oocyte is, the more likely it is to be viable).

TOLERANCE INTERVAL METHOD
A tolerance interval is a numerical interval that is calculated to provide limits wherein at least a specified proportion of a sampled population falls with an indicated level of confidence. Oftentimes, tolerance intervals are constructed and applied in areas such as quality control or manufacturing to establish that certain product standards are being met by the overall bulk of the products. "More specifically, a 100×p%/100×(1−α) tolerance interval provides limits within which at least a certain proportion (p) of the population falls with a given level of confidence (1−α)" (Young, 2010). Tolerance intervals are based on the sampled data; however they allow us to say something about the population distribution. "A tolerance interval differs from a confidence interval in that the former encloses a proportion of the entire population distribution, while the latter is constructed to contain the value of a population parameter" (Millsap, 1988). Frequently, tolerance intervals are based on a specified distribution of the data, and more often than not, that distribution is assumed to be normal. However, we can also make the tolerance interval more general by taking a non-parametric approach (Wilks, 1941). In Wilks' paper, he proves that there is a systematic way of calculating a confidence interval for an unknown data distribution; a tolerance interval can be calculated for a given population coverage proportion, level of confidence and minimum sample size, that guarantees at least the given population coverage proportion. This approach is outlined as follows: • Let a be the average value which p is to have, where p is the proportion of the population to be included in the interval (the mean coverage).
• Draw a sample of size n from the population subject to the constraint that [(1-a)(n +1)]/2 = r, a positive integer.
• Order the sampled data according to increasing magnitude from x 1 to x n • Let L 1 = x n-r+1 , the upper tolerance limit of our 1-sided non-parametric tolerance interval As noted above, we need a minimum sample size value to ensure proper coverage of the population with the specified level of confidence. Within the framework of this project, we utilized the principles of this method to create a 95% non-parametric onesided tolerance interval for 95% coverage of the IVV distance distribution. The distribution of distances for the Isom data IVV oocytes is right skewed (see Figure 3), so we used an application of the non-parametric tolerance interval method. For a non-parametric approach, the calculation of the tolerance interval can be different than when we specify a distribution for the data. As we do not restrict our interval to a specific distribution, we require a larger sample size in order to maintain the same level of confidence and coverage for our tolerance interval than that of a specified distribution (i.e. Gaussian, Weibull, etc). The calculations for the minimum required sample size of a non-parametric tolerance interval are shown below (NIST, 2012): The above equation is an approximation for the minimum sample size n, needed for a non-parametric tolerance interval with confidence level γ, and population coverage proportion p. We also note that we are calling a specified value from a chi-squared distribution with 4 degrees of freedom.
For our project, the minimum required sample size for a 95% confidence and at least 95% coverage was n=94, so we ended up needing a sample size of 100. In the simulations of Section 6 below, we use n=100 oocytes, and we proceed with the construction of the tolerance interval here solely for demonstration purposes (though the actual sample size in the project dataset is n=58, so the actual coverage in this demonstration example is likely less than 95%). For each oocyte in the project, the IVV distance distribution was computed using the same distance function as previously discussed. That distance function again only looks at genes that are differentially expressed in the Isom dataset, as indicated by a limma eBayes approach, using a significance level of 0.05 as a cutoff for the FDR-adjusted p-values. Those oocytes that individually have a distance statistic outside of the constructed tolerance interval for the IVV oocytes (i.e. their distance was greater than that of the upper tolerance bound) were selected as being nonviable (see Figure 4).  Figure 4. The distance distribution of IVV. The upper limit of the 95% confidence, 95% coverage non-parametric tolerance interval for the IVV group was calculated to be 9.63. The same example IVM oocyte as in Figures 1 and 2 has been selected and its distance was calculated to be 9.78. The distance of the oocyte is then compared to the upper limit of the tolerance interval for the IVV group. As the observed distance of 9.78 is larger than the upper tolerance limit of 9.63, we would classify this oocyte as non-viable, as it was classified as being less closely related to the IVV group (by our assumption that the more closely related to the IVV group that an oocyte is, the more likely it is to be viable).

CLASSIFICATION TREE METHOD
Classification trees or decision trees are a graphical representation of a set of rules used to classify data into categories. They are appropriate to use when one has a 2 or more level categorical variable as an outcome, with one or more variables as predictors. In general, we would use a classification tree to predict the class or outcome level of a number of observations within a dataset, based on the observed values of the predictor variables. In R, the default index for choosing the best split in the data (for classification) is the Gini index. The Gini index is a measure of impurity of a node (or a whole tree).
The Gini impurity measure can be calculated in the following way (Kingsford, 2008): Within the Gini calculation, we are trying to classify items into m classes using a set of training items E. Let p i (i= 1,…,m) be the fraction of the items of E that belong to class i. Thus, the Gini index reaches an optimal value of zero when the set E contains items from only one class. Construction of a decision tree using this index can be broken down into a number of simple steps: 1. The tree algorithm uses a training data set to build the tree, in other words, one needs to know the true outcomes in order to build a classification tree that can be used on other datasets drawn from the same population.
2. The algorithm then uses the predictor variables to make the most optimal splits in the data. An optimal split is defined as using a variable to split the data from one group into two that provides the greatest differentiation between the different classes, separating them from one another as well as possible.
3. The algorithm uses the most important variables to make splits in the data. If the observations in a specified branch of the tree contain a diverse group of classes, then the algorithm finds "the best" rule based on a single variable/feature to split that branch into two smaller branches. The quality of the split is again measured based upon a reduction in the Gini impurity measure.
4. Only if every observation in the "branch" of the tree is from the same class, does the tree form a terminal node or leaf. This process of actually using a tree to classify objects was aptly summarized in the following way: "In order to classify an object, we start at the root of the tree, evaluate the test, and take the branch appropriate to the outcome. The process continues until a leaf is encountered, at which time the object is asserted to belong to the class named by the leaf" (Quinlan, 1986).
Within the framework of this project, the outcome or class variable is oocyte viability. A key advantage to a decision tree is that it provides a visual rule for distinguishing between viable and non-viable classifications, based again on the most differentially expressed genes (as was the case with our previous methods using the limma eBayes method). Again, the genes are the variables on which the tree is split, and while all genes have a chance of being chosen initially, the tree will chose the genes to split on that give the best split as calculated by the Gini Index (see Figure 5). 6. SIMULATIONS In order to adequately compare the four previously summarized methods from Sections 2 through 5, simulations were conducted, in which datasets were generated and results were gathered for the four methods. As the goal was to determine how the methods compared to one another, the simulation was constructed as a function in R (see Appendix II) with the following five variables that could be altered to yield different situations for the simulated data: • n, the number of oocytes to be simulated • Pi, the mixing proportion for the mixture distribution of the IVM oocytes (i.e. the degree of similarity between the viable IVM group and the IVV group) • δ, the magnitude of differential expression for genes that are differentially expressed between IVM and IVV groups • WhichGenes, a given list of genes to be differentially expressed in the simulation • ProbViableIVM, the prior probability that any given IVM oocyte is viable At the start of a simulation, the function starts with the first of the n oocytes (for our project, n=100), and decides whether it will be generated as an IVV or IVM oocyte.
This process is done randomly using a binomial generator with probability of IVV equal to 0.5. After maturation type is decided, we determine if the oocyte will be simulated as a viable oocyte or not. For the sake of the simulation, all IVV oocytes were given the viable class, while the probability that an IVM oocyte is deemed viable depends on the simulation variable "ProbViableIVM" (binomial distribution with probability of viable being equal to "ProbViableIVM"). After each of the oocytes has been assigned a maturation and viability type, the individual gene expression profiles for each oocyte are generated.
Within the process of gene expression profile generation, the simulation function takes into account specific genes that the user wants to make sure are differentially expressed between viable and non-viable classes. The list of these gene names is to be supplied to the function by the user in the simulation argument "WhichGenes" (for our project, these were derived from the Isom data). For the genes identified as being differentially expressed, we then determine the degree of differential expression. Each named gene has its individual degree of differential expression, δ i , which is generated randomly from a uniform distribution, with minimum equal to zero, and maximum equal to the absolute value of the user supplied variable δ (in our case, this is calculated as the median of the log fold-change for the differentially expressed genes from the Isom data).
If the gene of interest is not included in the list of genes to be differentially expressed, δ i is simply set to zero. Next, for each gene within the oocyte, we generate an expression value in the following manner: Degree of Dif f erential Expression f or Gene i = i ⇠ U (0, ), The expression value for gene i of the oocyte is simply a random value from a standard normal distribution if the oocyte is an IVV oocyte. If the oocyte is a non-viable IVM oocyte, then the expression value for gene i is a random value from a normal distribution with mean equal to δ i and standard deviation equal to one. If the oocyte is a viable IVM oocyte, then the expression level for gene i is a mixture of the two aforementioned distributions, with the mixing proportion, Pi, determining the degree of the mixture (i.e. the probability it will be more similar to the IVV group, see Figure 6).  In addition to gene expression data generation, the functions created for the simulations also compile results from our primary measures of interest for model evaluation, as averaged over 500 simulations: proportion correctly classified (PCC), sensitivity and specificity for each of the models. In this context, we define proportion correctly classified as the count of those oocytes that were classified as either being viable or non-viable when it was truly their respective viability status, divided by the total number of oocytes that were classified within the simulation (in this case 100). This was repeated 500 times and then averaged across each simulation level combination. A similar process was done for sensitivity and specificity. Sensitivity was defined as the count of correctly classified viable oocytes divided by the total number of viable oocytes, and specificity was defined as the count of correctly classified non-viable oocytes divided by the total number of non-viable oocytes.

On examining Figures 7 and 8 (as well as the full results summarized in
Appendix I), we notice that in general, the tolerance method approach tends to outclass the other methods when looking at the proportion correctly classified (PCC). A similar result is seen in Figures 9 and 10 when sensitivity is used as our measured outcome. In both outcomes, the tolerance interval approach scores roughly two to three percentage points higher on average (after averaging across 500 simulations at each mixing proportion by probability viable IVM combination) than the relatable wRMSD and distance kernel methods. In the aforementioned areas, the disparity between tolerance intervals and that of a decision tree are even more pronounced, as the decision tree looks to be simply lack-luster.
Also, in reviewing Figure 7 and Figure 8, we note that it appears that the variable Pi (mixing proportion) seems to have very little effect on the method outcomes. We do see some slight changes in the slopes of the lines within the classification tree method in Figure 8 and Figure 9; however, nothing of major consequence that could accurately be attributed to the variable Pi.    Something to note is that, while it appears that the classification tree is greatly outclassed by these other three methods (specifically tolerance intervals), it does have situations where it outperforms all others, and in some ways, these situations may actually be quite valuable. In looking at the specificity of the methods (i.e. the ability of the methods to correctly classify non-viable oocytes) in Figure 11, classification trees actually outperform the other methods. This is an important feature, as part of our main goal was to establish a method that could identify both viable and non-viable oocytes. In Figure 11, it appears that as the mixing proportion (Pi) increases in the decision tree method, the harder time it has correctly classifying the non-viable samples. This would make sense, as we bring the two distributions closer together (see Figure 6), such that the IVV oocytes are appearing more similar to the IVM oocytes, and so the harder it would be to accurately separate them. In Figure 12, we observe yet another interesting phenomenon. As the probability increases that any given IVM oocyte will be viable, it makes it increasingly difficult to correctly classify the non-viable samples, as it is simply more likely that any given oocyte will be viable than non-viable at that point. As the probability of any given IVM oocyte being viable increases, it creates more of dispersion between the group sample sizes, and one would expect a drop in the care that is given to oocytes of the lesser class.
However, as we see in Figure 12, it appears as though the distance kernel and the tolerance interval methods actually get slightly better at classifying the non-viable IVM oocytes for the specificity measure, as the probability of viable IVM increases to its highest value.

DISCUSSION
Currently in the fields of cloning and embryology, researchers constantly run into complications and unplanned procedural failures due to the effects of low viability amongst oocytes derived in a laboratory setting. Non-viability of samples is a costly and unwanted outcome that is worth evaluating and attempting to mitigate. Past focus has been in evaluating origination methods and determining their 'closeness' to one another; specifically, the method's proximity to IVV derived oocytes, the gold standard. While this may be a decent approach at evaluating origination methods, we claim that we can use alternate approaches to evaluate, and thereby later improve the quality of individual oocytes within a given method. With recent advances in gene expression technology, nutrient washes can be created in order to encourage the oocytes to express specific genes within a given origination method, thereby improving the viability and quality of the samples, while still retaining the flexibility of using the origination method of choice.
While the goal was to identify methods that were best to use overall, we notice that we actually have a few different situations represented in the data, and as a result we get different "top methods" depending on the situation at hand. As noted above in Figures   7 and 9, when looking at PCC and sensitivity, tolerance intervals seemed to always be the method that performed the best, however the kernel density p-value approach wasn't far behind, with our modification of the published Kwon wRMSD method following shortly behind that. However, one reason that these methods could be performing so well in these result measurements (PCC and sensitivity), is that there is an imbalance of sample sizes.
The viable oocyte group is consistently larger than the non-viable group, simply due to the nature of the probabilities, as all IVV oocytes are simulated as being viable, while an increasing number of IVM oocytes are counted as viable as the variable ProbIVMViable starts to increase. So, as a reiteration, as the variable ProbIVMViable increases, the disparity between the viable and non-viable group sizes starts to increase (note, when ProbIVMViable is set to 0.90, nearly all oocytes are in the viable group). It appears as though the classification tree method is more balanced and more immune to the effects of this imbalance in the data, as it appears to perform roughly equally across all groups, though it does still get swayed some as ProbIVMViable reaches higher values (see At this point, we acknowledge the possible limitations of these simulations. First, we note that we have made a number of assumptions that we cannot fully prove with the resources at hand. First, we assume that oocytes that have gene expression profiles that are more like those of IVV oocytes are going to be more viable. Second, we assume the center of the observed IVV distribution to be the optimal gene expression profile. Third, we assume that the data provided to us from Dr. Isom is a representative sample of all IVV and IVM oocytes (viable and non-viable). Fourth, assumptions were made about the relative rates of viability in the different origination method groups based upon conversations with Dr. Isom, as he was our expert in the field. While these are a fair amount of assumptions, we did take precautions as to try to minimize their effects. The methods used were built or conceived in a way that they operate independent of oocyte origination method, they require no distributional assumptions of the oocyte data, and as demonstrated, do not require a complete balance of group sizes (though it may have yielded slightly different results had we forced this on the generated data).