Effect of single nucleotide polymorphisms on Affymetrix match-mismatch probe pairs.
ABSTRACT: Microarrays provide a means of studying expression level of tens of thousands of genes by providing one or more oligonucleotide probe(s) for each transcript studied. Affymetrix(R) GeneChiptrade mark platforms historically pair each 25-base perfect match (PM) probe with a mismatch probe (MM) differing by a complementary base located in the 13(th) position to quantify and deflate effects of cross-hybridization. Analytical routines for analyzing these arrays take into account difference in expression levels of MM and PM probes to determine which ones are useful for further study. If a single nucleotide polymorphism (SNP) occurs at the 13(th) base, a probe with a higher MM expression level may be incorrectly omitted. In order to examine SNP affects on PM and MM expression levels, known human SNPs from dbSNP were mapped to probe sets within the Affymetrix(R) HG-U133A platform. Probe sets containing one or more probe pairs with a single SNP at the 13(th) position were extracted. A set of twelve microarray experiments were analyzed for the PM and MM expression levels for these probe sets. Over 6,000,000 human SNPs and their flanking regions were extracted from dbSNP. These sequences were aligned against each of the 247,965 probe pair sequences from the Affymetrix(R) HG-U133A platform. A total of 915 probe sets containing a single probe sequence with a SNP mapped to the 13(th) base were extracted. A subset containing 166 probe sets result in complementary base SNPs. Comparison of gene expression levels for the SNP to non-SNP PM and MM probes does not yield a significant difference using chi2 analysis. Thus, omission of probes with MM expression levels higher than PM expression levels does not appear to result in a loss of information concerning SNPs for these regions.
Project description:BACKGROUND: Affymetrix gene expression arrays incorporate paired perfect match (PM) and mismatch (MM) probes to distinguish true signals from those arising from cross-hybridization events. A MM signal often shows greater intensity than a PM signal; we propose that one underlying cause is the presence of allelic variants arising from single nucleotide polymorphisms (SNPs). To annotate and characterize SNP contributions to anomalous probe binding behavior we have developed a software tool called AffyMAPSDetector. RESULTS: AffyMAPSDetector can be used to describe any Affymetrix expression GeneChip with respect to SNPs. When AffyMAPSDetector was run on GeneChip HG-U95Av2 against dbSNP-build-123, we found 7286 probes (belonging to 2,582 probesets) containing SNPs, out of which 325 probes contained at least one SNP at position 13. Against dbSNP-build-126, 8758 probes (belonging to 3,002 probesets) contained SNPs, of which 409 probes contained at least one SNP at position 13. Therefore, depending on the expressed allele, the MM probe can sometimes be the transcript complement. This information was used to characterize probe measurements reported in a published, well-replicated lung adenocarcinoma study. The total intensity distributions showed that the SNP-containing probes had a larger negative mean intensity difference (PM-MM) and greater range of the difference than did probes without SNPs. In the sample replicates, SNP-containing probes with reproducible intensity ratios were identified, allowing selection of SNP probesets that yielded unique sample signatures. At the gene expression level, use of the (MM-PM) value for SNP-containing probes resulted in different Presence/Absence calls for some genes. Such a change in status of the genes has the clear potential for influencing downstream clustering and classification results. CONCLUSION: Output from this tool characterizes SNP-containing probes on GeneChip microarrays, thus improving our understanding of factors contributing to expression measurements. The pattern of SNP binding examined so far indicates distinct behavior of the SNP-containing probes and has the potential to help us identify new SNPs. Knowing which probes contain SNPs provides flexibility in determining whether to include or exclude them from gene-expression intensity calculations; selected sets of SNP-containing probes produce sample-unique signatures. AffyMAPSDetector information is available at http://www.binf.gmu.edu/weller/BMC_bioinformatics/AffyMapsDetector/index.html.
Project description:BACKGROUND: Affymetrix GeneChip microarrays are popular platforms for expression profiling in two types of studies: detection of differential expression computed by p-values of t-test and estimation of fold change between analyzed groups. There are many different preprocessing algorithms for summarizing Affymetrix data. The main goal of these methods is to remove effects of non-specific hybridization, and to optimally combine information from multiple probes annotated to the same transcript. The methods are benchmarked by comparison with reference methods, such as quantitative reverse-transcription PCR (qRT-PCR). RESULTS: We present a comprehensive analysis of agreement between Affymetrix GeneChip and qRT-PCR results. We analyzed the influence of filtering by fraction Present calls introduced by J.N. McClintick and H.J. Edenberg (2006) and 2 mapping procedures: updated probe sets definitions proposed by Dai et al. (2005) and our "naive mapping" method. Because of evolution of genome sequence annotations since the time when microarrays were designed, we also studied the effect of the annotation release date. These comparisons were prepared for 6 popular preprocessing algorithms (MAS5, PLIER, RMA, GC-RMA, MBEI, and MBEImm) in the 2 above-mentioned types of studies. We used data sets from 6 independent biological experiments. As a measure of reproducibility of microarray and qRT-PCR values, we used linear and rank correlation coefficients. CONCLUSIONS: We show that filtering by fraction Present calls increased correlations for all 6 preprocessing algorithms. We observed the difference in performance of PM-MM and PM-only methods: using MM probes increased correlations in fold change studies, but PM-only methods proved to perform better in detection of differential expression. We recommend using GC-RMA for detection of differential expression and PLIER for estimation of fold change. The use of the more recent annotation improves the results in both types of studies, encouraging re-analysis of old data.
Project description:BACKGROUND: Microarray technology is a high-throughput method for measuring the expression levels of thousand of genes simultaneously. The observed intensities combine a non-specific binding, which is a major disadvantage with microarray data. The Affymetrix GeneChip assigned a mismatch (MM) probe with the intention of measuring non-specific binding, but various opinions exist regarding usefulness of MM measures. It should be noted that not all observed intensities are associated with expressed genes and many of those are associated with unexpressed genes, of which measured values express mere noise due to non-specific binding, cross-hybridization, or stray signals. The implicit assumption that all genes are expressed leads to poor performance of microarray data analyses. We assume two functional states of a gene - expressed or unexpressed - and propose a robust method to estimate gene expression states using an order relationship between PM and MM measures. RESULTS: An indicator 'probability of a gene being expressed' was obtained using the number of probe pairs within a probe set where the PM measure exceeds the MM measure. We examined the validity of the proposed indicator using Human Genome U95 data sets provided by Affymetrix. The usefulness of 'probability of a gene being expressed' is illustrated through an exploration of candidate genes involved in neuroblastoma prognosis. We identified the candidate genes for which expression states differed (un-expressed or expressed) when compared between two outcomes. The validity of this result was subsequently confirmed by quantitative RT-PCR. CONCLUSION: The proposed qualitative evaluation, 'probability of a gene being expressed', is a useful indicator for improving microarray data analysis. It is useful to reduce the number of false discoveries. Expression states - expressed or unexpressed - correspond to the most fundamental gene function 'On' and 'Off', which can lead to biologically meaningful results.
Project description:BACKGROUND: Affymetrix Genechips are characterized by probe pairs, a perfect match (PM) and a mismatch (MM) probe differing by a single nucleotide. Most of the data preprocessing algorithms neglect MM signals, as it was shown that MMs cannot be used as estimators of the non-specific hybridization as originally proposed by Affymetrix. The aim of this paper is to study in detail on a large number of experiments the behavior of the average PM/MM ratio. This is taken as an indicator of the quality of the hybridization and, when compared between different chip series, of the quality of the chip design. RESULTS: About 250 different GeneChip hybridizations performed at the VIB Microarray Facility for Homo sapiens, Drosophila melanogaster, and Arabidopsis thaliana were analyzed. The investigation of such a large set of data from the same source minimizes systematic experimental variations that may arise from differences in protocols or from different laboratories. The PM/MM ratios are derived theoretically from thermodynamic laws and a link is made with the sequence of PM and MM probe, more specifically with their central nucleotide triplets. CONCLUSION: The PM/MM ratios subdivided according to the different central nucleotides triplets follow qualitatively those deduced from the hybridization free energies in solution. It is shown also that the PM and MM histograms are related by a simple scale transformation, in agreement with what is to be expected from hybridization thermodynamics. Different quantitative behavior is observed on the different chip organisms analyzed, suggesting that some organism chips have superior probe design compared to others.
Project description:Affymetrix GeneChip high-density oligonucleotide arrays are widely used in biological and medical research because of production reproducibility, which facilitates the comparison of results between experiment runs. In order to obtain high-level classification and cluster analysis that can be trusted, it is important to perform various pre-processing steps on the probe-level data to control for variability in sample processing and array hybridization. Many proposed preprocessing methods are parametric, in that they assume that the background noise generated by microarray data is a random sample from a statistical distribution, typically a normal distribution. The quality of the final results depends on the validity of such assumptions.We propose a Distribution Free Convolution Model (DFCM) to circumvent observed deficiencies in meeting and validating distribution assumptions of parametric methods. Knowledge of array structure and the biological function of the probes indicate that the intensities of mismatched (MM) probes that correspond to the smallest perfect match (PM) intensities can be used to estimate the background noise. Specifically, we obtain the smallest q2 percent of the MM intensities that are associated with the lowest q1 percent PM intensities, and use these intensities to estimate background.Using the Affymetrix Latin Square spike-in experiments, we show that the background noise generated by microarray experiments typically is not well modeled by a single overall normal distribution. We further show that the signal is not exponentially distributed, as is also commonly assumed. Therefore, DFCM has better sensitivity and specificity, as measured by ROC curves and area under the curve (AUC) than MAS 5.0, RMA, RMA with no background correction (RMA-noBG), GCRMA, PLIER, and dChip (MBEI) for preprocessing of Affymetrix microarray data. These results hold for two spike-in data sets and one real data set that were analyzed. Comparisons with other methods on two spike-in data sets and one real data set show that our nonparametric methods are a superior alternative for background correction of Affymetrix data.
Project description:<h4>Background</h4>The availability of a recently published large-scale spike-in microarray dataset helps us to understand the influence of probe sequence in non-specific binding (NSB) signal and enables the benchmarking of several models for the estimation of NSB. In a typical microarray experiment using Affymetrix whole genome chips, 30% to 50% of the probes will apparently have absent target transcripts and show only NSB signal, and these probes can have significant repercussions for normalization and the statistical analysis of the data if NSB is not estimated correctly.<h4>Results</h4>We have found that the MAS5 perfect match-mismatch (PM-MM) model is a poor model for estimation of NSB, and that the Naef and Zhang sequence-based models can reasonably estimate NSB. In general, using the GC robust multi-array average, which uses Naef binding affinities, to calculate NSB (GC-NSB) outperforms other methods for detecting differential expression. However, there is an intensity dependence of the best performing methods for generating probeset expression values. At low intensity, methods using GC-NSB outperform other methods, but at medium intensity, MAS5 PM-MM methods perform best, and at high intensity, MAS5 PM-MM and Zhang's position-dependent nearest-neighbor (PDNN) methods perform best.<h4>Conclusion</h4>A combined statistical analysis using the MAS5 PM-MM, GC-NSB and PDNN methods to generate probeset values results in an improved ability to detect differential expression and estimates of false discovery rates compared with the individual methods. Additional improvements in detecting differential expression can be achieved by a strict elimination of empty probesets before normalization. However, there are still large gaps in our understanding of the Affymetrix GeneChip technology, and additional large-scale datasets, in which the concentration of each transcript is known, need to be produced before better models of specific binding can be created.
Project description:BACKGROUND: High-density oligonucleotide arrays are widely used for analysis of genome-wide expression and genetic variation. Affymetrix GeneChips - common high-density oligonucleotide arrays - contain perfect match (PM) and mismatch (MM) probes generated by changing a single nucleotide of the PMs, to estimate cross-hybridization. However, a fraction of MM probes exhibit larger signal intensities than PMs, when the difference in the amount of target specific hybridization between PM and MM probes is smaller than the variance in the amount of cross-hybridization. Thus, pairs of PM and MM probes with greater specificity for single nucleotide mismatches are desirable for accurate analysis. RESULTS: To investigate the specificity for single nucleotide mismatches, we designed a custom array with probes of different length (14- to 25-mer) tethered to the surface of the array and all possible single nucleotide mismatches, and hybridized artificially synthesized 25-mer oligodeoxyribonucleotides as targets in bulk solution to avoid the effects of cross-hybridization. The results indicated the finite availability of target molecules as the probe length increases. Due to this effect, the sequence specificity of the longer probes decreases, and this was also confirmed even under the usual background conditions for transcriptome analysis. CONCLUSION: Our study suggests that the optimal probe length for specificity is 19-21-mer. This conclusion will assist in improvement of microarray design for both transcriptome analysis and mutation screening.
Project description:The present study deals with genome wide identification of single-nucleotide polymorphism (SNP) markers related to powdery mildew (PM) resistance in two pepper varieties. Capsicum baccatum (PRH1- a PM resistant line) and Capsicum annuum (Saengryeg- a PM susceptible line), were resequenced to develop SNP markers. A total of 6,213,009 and 6,840,889 SNPs for PRH1 and Saengryeg respectively have been discovered. Among the SNPs, majority were classified as homozygous type SNPs, particularly in the resistant line. Moreover, the SNPs were differentially distributed among the chromosomes in both the resistant and susceptible lines. In total, 4,887,031 polymorphic SNP loci were identified between the two lines and 306,871 high-resolution melting (HRM) marker primer sets were designed. In order to understand the SNPs associated with the vital genes involved in diseases resistance and stress associated processes, chromosome-wise gene ontology analysis was performed. The results revealed the occurrence that SNPs related to diseases resistance genes were predominantly distributed in chromosome 4. In addition, 6281 SNPs associated with 46 resistance genes were identified. Among the lines, PRH1 consisted of maximum number of polymorphic SNPs related to NBS-LRR genes. The SNP markers were validated using HRM assay in 45 F<sub>4</sub> populations and correlated with the phenotypic disease index.
Project description:Microarrays have been used extensively to analyze the expression profiles for thousands of genes in parallel. Most of the widely used methods for analyzing Affymetrix Genechip microarray data, including RMA, GCRMA and Model Based Expression Index (MBEI), summarize probe signal intensity data to generate a single measure of expression for each transcript on the array. In contrast, other methods are applied directly to probe intensities, negating the need for a summarization step.In this study, we used the Affymetrix rat genome Genechip to explore variability in probe response patterns within transcripts. We considered a number of possible sources of variability in probe sets including probe location within the transcript, middle base pair of the probe sequence, probe overlap, sequence homology and affinity. Although affinity, middle base pair and probe location effects may be seen at the gross array level, these factors only account for a small proportion of the variation observed at the gene level. A BLAST search and the presence of probe by treatment interactions for selected differentially expressed genes showed high sequence homology for many probes to non-target genes.We suggest that examination and modeling of probe level intensities can be used to guide researchers in refining their conclusions regarding differentially expressed genes. We discuss implications for probe sequence selection for confirmatory analysis using real time PCR.
Project description:Affymetrix SNP arrays have been widely used for single-nucleotide polymorphism (SNP) genotype calling and DNA copy number variation inference. Although numerous methods have achieved high accuracy in these fields, most studies have paid little attention to the modeling of hybridization of probes to off-target allele sequences, which can affect the accuracy greatly. In this study, we address this issue and demonstrate that hybridization with mismatch nucleotides (HWMMN) occurs in all SNP probe-sets and has a critical effect on the estimation of allelic concentrations (ACs). We study sequence binding through binding free energy and then binding affinity, and develop a probe intensity composite representation (PICR) model. The PICR model allows the estimation of ACs at a given SNP through statistical regression. Furthermore, we demonstrate with cell-line data of known true copy numbers that the PICR model can achieve reasonable accuracy in copy number estimation at a single SNP locus, by using the ratio of the estimated AC of each sample to that of the reference sample, and can reveal subtle genotype structure of SNPs at abnormal loci. We also demonstrate with HapMap data that the PICR model yields accurate SNP genotype calls consistently across samples, laboratories and even across array platforms.