Project description:In contrast to its common definition and calculating process, interpretations of p-value differ among statisticians. Since p-value is the basis of various methodologies, this divergence has led to distinct test methodologies as well as various opinions for evaluating test results, producing a chaotic situation. Here the origin of the divergence is found in the differences among Pr(H0 = true), which gives a prior probability in the definition of p-values. Effects of differences in the prior probability on the character of p-values are investigated by comparing microarray data and random numbers as subjects. The summarized levels the genes are presented in the matrix files (linked below as supplementary files). Student t-test was applied between the two groups (0h and 14d): p-values presented in the matrix files.
Project description:BackgroundUp to now, microarray data are mostly assessed in context with only one or few parameters characterizing the experimental conditions under study. More explicit experiment annotations, however, are highly useful for interpreting microarray data, when available in a statistically accessible format.ResultsWe provide means to preprocess these additional data, and to extract relevant traits corresponding to the transcription patterns under study. We found correspondence analysis particularly well-suited for mapping such extracted traits. It visualizes associations both among and between the traits, the hereby annotated experiments, and the genes, revealing how they are all interrelated. Here, we apply our methods to the systematic interpretation of radioactive (single channel) and two-channel data, stemming from model organisms such as yeast and drosophila up to complex human cancer samples. Inclusion of technical parameters allows for identification of artifacts and flaws in experimental design.ConclusionBiological and clinical traits can act as landmarks in transcription space, systematically mapping the variance of large datasets from the predominant changes down toward intricate details.
Project description:Inverse probability weighting can be used to correct for missing data. New estimators for the weights in the nonmonotone setting were introduced in 2018. These estimators are the unconstrained maximum likelihood estimator (UMLE) and the constrained Bayesian estimator (CBE), an alternative if UMLE fails to converge. In this work we describe and illustrate these estimators, and examine performance in simulation and in an applied example estimating the effect of anemia on spontaneous preterm birth in the Zambia Preterm Birth Prevention Study. We compare performance with multiple imputation (MI) and focus on the setting of an observational study where inverse probability of treatment weights are used to address confounding. In simulation, weighting was less statistically efficient at the smallest sample size and lowest exposure prevalence examined (n = 1500, 15% respectively) but in other scenarios statistical performance of weighting and MI was similar. Weighting had improved computational efficiency taking, on average, 0.4 and 0.05 times the time for MI in R and SAS, respectively. UMLE was easy to implement in commonly used software and convergence failure occurred just twice in >200 000 simulated cohorts making implementation of CBE unnecessary. In conclusion, weighting is an alternative to MI for nonmonotone missingness, though MI performed as well as or better in terms of bias and statistical efficiency. Weighting's superior computational efficiency may be preferred with large sample sizes or when using resampling algorithms. As validity of weighting and MI rely on correct specification of different models, both approaches could be implemented to check agreement of results.
Project description:ObjectiveTo develop and validate Medicare claims-based approaches for identifying abnormal screening mammography interpretation.Data sourcesMammography data and linked Medicare claims for 387,709 mammograms performed from 1999 to 2005 within the Breast Cancer Surveillance Consortium (BCSC).Study designSplit-sample validation of algorithms based on claims for breast imaging or biopsy following screening mammography.Data extraction methodsMedicare claims and BCSC mammography data were pooled at a central Statistical Coordinating Center.Principal findingsPresence of claims for subsequent imaging or biopsy had sensitivity of 74.9 percent (95 percent confidence interval [CI], 74.1-75.6) and specificity of 99.4 percent (95 percent CI, 99.4-99.5). A classification and regression tree improved sensitivity to 82.5 percent (95 percent CI, 81.9-83.2) but decreased specificity (96.6 percent, 95 percent CI, 96.6-96.8).ConclusionsMedicare claims may be a feasible data source for research or quality improvement efforts addressing high rates of abnormal screening mammography.
Project description:MotivationStudying the interplay between gene expression and metabolite levels can yield important information on the physiology of stress responses and adaptation strategies. Performing transcriptomics and metabolomics in parallel during time-series experiments represents a systematic way to gain such information. Several combined profiling datasets have been added to the public domain and they form a valuable resource for hypothesis generating studies. Unfortunately, detecting coresponses between transcript levels and metabolite abundances is non-trivial: they cannot be assumed to overlap directly with underlying biochemical pathways and they may be subject to time delays and obscured by considerable noise.ResultsOur aim was to predict pathway comemberships between metabolites and genes based on their coresponses to applied stress. We found that in the presence of strong noise and time-shifted responses, a hidden Markov model-based similarity outperforms the simpler Pearson correlation but performs comparably or worse in their absence. Therefore, we propose a supervised method that applies pathway information to summarize similarity statistics to a consensus statistic that is more informative than any of the single measures. Using four combined profiling datasets, we show that comembership between metabolites and genes can be predicted for numerous KEGG pathways; this opens opportunities for the detection of transcriptionally regulated pathways and novel metabolically related genes.AvailabilityA command-line software tool is available at http://www.cin.ufpe.br/~igcf/Metabolites.Contacthenning@psc.riken.jp; igcf@cin.ufpe.br
Project description:For a long time, NMR chemical shifts have been used to identify protein secondary structures. Currently, this is accomplished through comparing the observed (1)H(alpha), (13)C(alpha), (13)C(beta), or (13)C' chemical shifts with the random coil values. Here, we present a new protocol, which is based on the joint probability of each of the three secondary structural types (beta-strand, alpha-helix, and random coil) derived from chemical-shift data, to identify the secondary structure. In combination with empirical smooth filters/functions, this protocol shows significant improvements in the accuracy and the confidence of identification. Updated chemical-shift statistics are reported, on the basis of which the reliability of using chemical shift to identify protein secondary structure is evaluated for each nucleus. The reliability varies greatly among the 20 amino acids, but, on average, is in the order of: (13)C(alpha)>(13)C'>(1)H(alpha)>(13)C(beta)>(15)N>(1)H(N) to distinguish an alpha-helix from a random coil; and (1)H(alpha)>(13)C(beta) >(1)H(N) approximately (13)C(alpha) approximately (13)C' approximately (15)N for a beta-strand from a random coil. Amide (15)N and (1)H(N) chemical shifts, which are generally excluded from the application, in fact, were found to be helpful in distinguishing a beta-strand from a random coil. In addition, the chemical-shift statistical data are compared with those reported previously, and the results are discussed. A JAVA User Interface program has been developed to make the entire procedure fully automated and is available via http://ccsr3150-p3.stanford.edu.
Project description:BackgroundFlow cytometry analysis is the method of choice for the differential diagnosis of hematologic disorders. It is typically performed by a trained hematopathologist through visual examination of bidimensional plots, making the analysis time-consuming and sometimes too subjective. Here, a pilot study applying genetic algorithms to flow cytometry data from normal and acute myeloid leukemia subjects is described.Subjects and methodsInitially, Flow Cytometry Standard files from 316 normal and 43 acute myeloid leukemia subjects were transformed into multidimensional FITS image metafiles. Training was performed through introduction of FITS metafiles from 4 normal and 4 acute myeloid leukemia in the artificial intelligence system.ResultsTwo mathematical algorithms termed 018330 and 025886 were generated. When tested against a cohort of 312 normal and 39 acute myeloid leukemia subjects, both algorithms combined showed high discriminatory power with a receiver operating characteristic (ROC) curve of 0.912.ConclusionsThe present results suggest that machine learning systems hold a great promise in the interpretation of hematological flow cytometry data.
Project description:BackgroundMost machine learning approaches only provide a classification for binary responses. However, probabilities are required for risk estimation using individual patient characteristics. It has been shown recently that every statistical learning machine known to be consistent for a nonparametric regression problem is a probability machine that is provably consistent for this estimation problem.ObjectivesThe aim of this paper is to show how random forests and nearest neighbors can be used for consistent estimation of individual probabilities.MethodsTwo random forest algorithms and two nearest neighbor algorithms are described in detail for estimation of individual probabilities. We discuss the consistency of random forests, nearest neighbors and other learning machines in detail. We conduct a simulation study to illustrate the validity of the methods. We exemplify the algorithms by analyzing two well-known data sets on the diagnosis of appendicitis and the diagnosis of diabetes in Pima Indians.ResultsSimulations demonstrate the validity of the method. With the real data application, we show the accuracy and practicality of this approach. We provide sample code from R packages in which the probability estimation is already available. This means that all calculations can be performed using existing software.ConclusionsRandom forest algorithms as well as nearest neighbor approaches are valid machine learning methods for estimating individual probabilities for binary responses. Freely available implementations are available in R and may be used for applications.
Project description:Antiretroviral treatment history and past HIV-1 genotypes have been shown to be useful predictors for the success of antiretroviral therapy. However, this information may be unavailable or inaccurate, particularly for patients with multiple treatment lines often attending different clinics. We trained statistical models for predicting drug exposure from current HIV-1 genotype. These models were trained on 63,742 HIV-1 nucleotide sequences derived from patients with known therapeutic history, and on 6,836 genotype-phenotype pairs (GPPs). The mean performance regarding prediction of drug exposure on two test sets was 0.78 and 0.76 (ROC-AUC), respectively. The mean correlation to phenotypic resistance in GPPs was 0.51 (PhenoSense) and 0.46 (Antivirogram). Performance on prediction of therapy-success on two test sets based on genetic susceptibility scores was 0.71 and 0.63 (ROC-AUC), respectively. Compared to geno2pheno[resistance], our novel models display a similar or superior performance. Our models are freely available on the internet via www.geno2pheno.org. They can be used for inferring which drug compounds have previously been used by an HIV-1-infected patient, for predicting drug resistance, and for selecting an optimal antiretroviral therapy. Our data-driven models can be periodically retrained without expert intervention as clinical HIV-1 databases are updated and therefore reduce our dependency on hard-to-obtain GPPs.