Statistical Quality Control By M Mahajan Pdfrar 41
DOWNLOAD - https://urloso.com/2t0Nvq
There is no general consensus on the best therapeutic approach, as strong evidence is lacking given the rarity of the disease, although multimodality treatment with chemotherapy, surgery and radiotherapy appears to represent optimal management. The MS of patients diagnosed with DSRCT was 16 months in this study, which is slightly lower than those reported previously [6]. Comparatively, it is clear that the management in our centres took a more conservative approach than others, as evident by the less frequent use of radiotherapy, surgery and myeloablative chemotherapy with stem cell transplantation. In a review by Hassan et al. of 12 patients with intra-abdominal DSRCT (all of whom had received multi-agent chemotherapy), those who underwent surgical resection had a longer MS of 34 months compared to 14 months for those who had biopsy alone [9]. In our study, the MS observed for patients who had resection for their abdominal or pelvic tumours was 47 months, compared to 16 months for those who did not. Moreover, for patients with metastatic intra-abdominal DSRCT, palliative radiotherapy for locoregional disease control appeared to confer a survival advantage (MS of 47 vs 14 months in those who did not have radiotherapy). Although patients with localised abdominal or pelvic disease who underwent surgery appear to have similar MS (i.e. 47 months) compared to those with metastatic disease who received palliative radiotherapy, they are by no means comparable and surgery is still indicated in resectable DSRCT. In our series, the only patient with abdominal disease who has been cured (disease-free 10 years from diagnosis) has had chemotherapy and surgical resection. Hence, a more aggressive multimodality treatment approach would seem to be indicated in order to prolong survival, although larger prospective trials with quality-of-life measures would be necessary to confirm this. This is difficult to perform in such a rare disease.
Descriptive statistics was used to describe the data. For quantitative variables, mean with standard deviation and for qualitative variables, frequency with percentage was used to describe the data. Chi-Square test and independent sample t test were performed to assess the significance of age, weight, height, BMI and total body fat percentage between breast cancer patients and controls. Binary logistic regression was performed to estimate odds ratios (ORs) and to examine the predictive effect of each factor on the risk for breast cancer. All the statistical assessments were two-sided and considered to be significant with p-value
Genotyping arrays preceded WGS and were the standard assay for variant calling and genome-wide association studies (GWAS). Batch effects are well studied in the context of genotyping arrays [5,6,7] and often can be addressed using widely adopted quality control (QC) measures [8]. Standard QC of SNP array data involves excluding samples with high missingness, testing for differences in allelic frequencies between known batches, removing related individuals, and correcting for population structure and possibly batch effects via principal components analysis (PCA) [8, 9]. QC strategies proposed for exome sequencing (WES) include empirically derived variant filtering [10] and methods for removing batch effects in copy number variation calling [11, 12]. These algorithms rely on read depth and either singular value decomposition (SVD), principal components analysis (PCA), or a reference panel to normalize read depth and remove batch effects [11,12,13].
Given that no standardized algorithms or heuristics currently exist to identify or address the issue of batch effects in WGS, batch effects have generally been handled by adopting stringent QC measures. The Type 2 Diabetes Consortium [19] used a series of filters including setting sites with GATK genotype quality less than 20 to missing and eliminating any site with greater than 10 % missingness within each ethnicity, deviation from HWE, and differential call rate between cases and controls on a dataset that included WGS and WES data. This filtering eliminated 9.9 % of SNPs and 90.8 % of indels. Similarly, the UK10K consortium [20] removed any site found as significant after performing an association study with sequencing center as the phenotype. This, alongside additional QC measures, resulted in removal of 76.9 % of variants [21]. Removing repetitive regions of the genome (removes ~53% of the genome) [22] or using established high confidence regions such as genome in a bottle (removes ~26% of the genome) [23] are similarly stringent.
Large-scale WGS efforts are thriving, however few guidelines exist for determining whether a dataset has batch effects and, if so, what methods will reduce their impact. We address both these deficiencies and introduce new software (R package, genotypeeval, see Methods for additional details and web link) that can help identify batch effects. We demonstrate how to identify a detectable batch effect in WGS data via summary metrics computed using genotype calls, their quality values, read depths, and genomic annotations, followed by a PCA of these metrics. We describe our strategy to eliminate unconfirmed genome-wide significant associations (UGAs), which are likely enriched for spurious associations, induced by batch effects. Our aim was to develop filters that removed sites impacted by a detectable batch effect with high specificity so as not to eliminate a large number of variants genome-wide. The filters we developed do not remove all UGAs impacted by batch effects and come at the cost of a reduction in power of 12.5%, however when applied in conjunction with standard quality control measures (see Methods) they can substantially mitigate the impact of batch effects.
We next explored in detail the six quality metrics used in our PCA decomposition (Table 1, Additional file 1: Figure S2, Additional file 2: Table S2). While read depth and GATK genotype quality (GQ) were comparable between the two groups (Table 1, Additional file 2: Table S2), metrics based on transition-transversion ratio (Ti/Tv), heterozygous calls, and percent of variants confirmed in 1000 genomes (%1000 g) showed highly statistically significant differences (Table 1, Additional file 2: Table S2).
To test the hypothesis that only particularly difficult-to-sequence regions of the genome were subject to batch effects, we computed our metrics after removing repeat-masked regions [22] (53.02% of genome), segmental duplications [37] (13.65%), self-chain regions [37] (6.02%), centromeres (2.01%), ENCODE blacklist [38] (0.39%), or low-complexity regions (0.21%). PCA plots of our quality metrics re-computed after filtering out the difficult to assay regions still clearly revealed detectable batch effects (Additional file 1: Figure S3). We again examined the metrics underlying the PCA plot by performing a Wilcoxon-Rank Sum test comparing group 1 and group 2 post-filtering (Additional file 1: Figure S4, Additional file 2: Table S2). Removing all repeat-masked regions narrowed the difference in %1000 g between groups from 4% to 1.8%, however %1000 g between groups was still statistically significant (p-value
Batch effects in WGS data are not well understood and perhaps because of this, we were not able to find an existing method or develop a novel method that removed all sites impacted by batch effects without impacting the power to detect true associations. While we focused on creating targeted filters that removed a small percent of the genome, in practice these need to be used in conjunction with standard quality control measures (for example removing sites out of Hardy-Weinberg equilibrium), which can result in very stringent filtering. In the case of a severe batch effect, such as the chemistry change present in the RA Batch GWAS, more stringent filtering was necessary even after applying standard quality control and our proposed filters as almost 40,000 UGAs remained after filtering. In order to fully address batch effects, disentangling the impact of changes in sequencing chemistry and bioinformatics processing on association analysis will be necessary.
Frey [148] develops a general class of distribution-free statistical intervals based on ranked set samples. Vock and Balakrishnan [149] study nonparametric RSS prediction intervals. Hartlaub and Wolfe [150] propose an RSS test procedure designed to detect umbrella alternatives in the -sample setting and Magel and Qin [151] study a competitor to the Hartlaub and Wolfe procedure. Özturk et al. [152] use simultaneous one-sample sign confidence intervals for population medians to develop a -sample RSS test procedure designed to detect simple-tree alternatives. Özturk and Balakrishnan [153] propose an exact RSS control-versus-treatment test procedure. Chen et al. [154] extend the application of RSS methodology to ordered categorical variables with the goal of estimating the probabilities of all of the categories. They use ordinal logistic regression to aid in the ranking of the ordinal variable of interest and propose an optimal allocation scheme. Özturk [155] explores the adaptation of rank regression methodology to RSS data and Liu et al. [156] study the use of the empirical likelihood in the context of ranked set sampling. Gaur et al. [157] consider an RSS approach to the multiple sample scale problem.
Muttlak and McDonald [158, 159] utilize the RSS scheme in conjunction with size-biased probability of selection and Muttlak and McDonald [160] propose using a two-stage sampling plan with line-intercept sampling in the first stage and RSS in the second stage. Nematollahi et al. [161] employ ranked set sampling in the second stage of a two-stage cluster sampling design. Al-Saleh and Samawi [162] and Frey [163] present results about inclusion probabilities for population elements under RSS designs and Gökpinar and Özdemir [164] use these inclusion probabilities to construct a Horvitz-Thompson RSS estimator for the population mean in a finite population setting. Samawi [165], Al-Saleh and Samawi [166], and Al-Nasser and Al-Talib [167] incorporate the RSS approach to obtain more efficient Monte Carlo methods. Barabesi and Pisani [168] consider the use of RSS in replications of designs such as plot sampling or line-intercept sampling and Barabesi and Pisani [169] continue their work with a study of steady-state RSS for replicated environmental sampling plans. Barabesi and Marcheselli [170] investigate the use of auxiliary variables in design-based ranked set sampling and Chen and Shen [171] approach RSS as a two-layer process with multiple concomitant variables. Muttlak and Al-Sabah [172], Al-Nasser and Al-Rawwash [173], and Al-Omari and Al-Nasser [174] incorporate RSS in statistical quality control. Mode et al. [175] study the general use of incorporating prior knowledge in environmental sampling, including RSS. Ridout and Cobby [176] look at RSS under the condition of non-random selection of sets. Samawi and Muttlak [177] use RSS to estimate a ratio. Patil et al. [178], Norris et al. [179], and Ridout [180] all explore the use of RSS when we are interested in making inferences about multiple characteristics. Ahmed et al.[181] and Muttlak et al. [182] explore the role of RSS in Stein-type estimation and shrinkage estimation, respectively. Modarres et al. [183] investigate the use of resampling techniques with RSS data. 2b1af7f3a8