Introduction to OMiCC
OMiCC has two major components: 1) a database containing pre-processed gene expression data sets from NCBI GEO, as well as sample groups, comparison group pairs (refer to Terminology used in OMiCC section) created by the user community annotated using MeSH terms, and user-defined compendia (refer to Terminology used in OMiCC section); 2) a web interface that interacts with the OMiCC database and enables users to create and (meta-) analyze gene expression compendia across studies and microarray platforms. Analysis results and raw data can also be exported for further analyses outside of OMiCC.
An important feature of OMiCC is that it allows users to contribute to the community by sharing meta-data essential for data collation, reuse and (meta-) analysis across studies. Thus, users of OMiCC can reuse sample groupings and pairings created by other users to construct cross-study data sets. We envision that as more users take advantage of OMiCC to perform biological hypothesis generation and discovery, more users will create and share meta-data, data compendia, and analyses with the biomedical research community. OMiCC can also serve as a didactic tool to help educate biomedical students and researchers to perform hands-on exploration of publicly available data and thus introduce them to data-driven and meta-analysis approaches.
Currently OMiCC contains more than 24,000 human and mouse studies drawn from GEO. Gene expression data from these studies have been normalized and those with raw expression data have undergone our own internal quality control analysis. Up to three types of expression data are made available to OMiCC users depending on the microarray platform and data availability in GEO: 1) for Affymetrix platforms, RMA normalized data derived from the raw CEL files using the Affymetrix power tools package (APT); 2) GEO series matrix data sets (i.e., GEO user submitted versions of the data); and 3) quantile normalized versions of GEO series matrix data. OMiCC provides probe-to-gene (HUGO gene symbols) mappings for more than 1900 GEO platforms (GPLs) covering over 90% of the GEO samples available at the time of data import (see Status page for updated numbers).
For further details and to cite OMiCC, please refer to:
Shah N*, Guo Y*, Wendelsdorf KV*, Lu Y, Sparks R, and Tsang JS. "A crowdsourcing approach for reusing and meta-analyzing gene expression data across studies and platforms". Nature Biotechnology 2016 http://dx.doi.org/10.1038/nbt.3603 (*equal contributing authors)
Here we provide concise definitions for the key terminologies used throughout the OMiCC site. For further details about related concepts and procedures, please refer to the sections in the Tutorial below.
A GEO study is made up of individual samples; each sample is a unique gene expression profile and has its own GSM sample ID. A sample group is a collection of samples (gene expression profiles) from a single study. The OMiCC user decides which samples in a GEO study to group together in a sample group. Two sample groups are required to create a comparison group pair (CGP; see below). In the example below, one sample group is made up of all the specimens collected before an influenza challenge. A second sample group is made up of all the specimens collected 12 hours after an influenza challenge. Users must annotate sample groups using at least two MeSH terms. These annotations facilitate sample group and CGP searches. Sample groups are stored on the OMiCC server and a user can elect to share sample groups with other users by making them public.
Comparison group pair (CGP)
A CGP is two sample groups that one wants to compare. For example, in the introductory example below, we want to compare the gene expression profiles of healthy subjects before and after influenza challenge. Multiple CGPs can be made from a single study. For example, if one wants to compare gene expression between healthy men and women, one could make a CGP with one sample group containing the specimens from the women prior to influenza challenge, and the second sample group containing the specimens from the men prior to influenza challenge. CGPs are also stored on the OMiCC server and can be made public to share with other users.
A compendium consists of a collection of CGPs. By using either all or a selected subset of CGPs within a compendium, one can: 1) compute differential expression profiles (DEPs) and obtain a list of differentially expressed genes for individual CGPs using a statistical significance cutoff threshold; multiple DEPs can also be merged into a single gene/probe-by-CGPs matrix file for further analysis outside of OMiCC; 2) perform meta-analysis.
Differential expression profile (DEPs)
A differential expression profile (DEP) consists of the gene expression differences for all genes between the two sample groups in a CGP (see figure below). OMiCC can also compute statistics for each gene to identify statistically significant differentially expressed genes using the following analysis methods: a) Limma (default), b) Wilcoxon test, and c) Student’s t-test.
Meta-analysis is a method of extracting statistically coherent signals in differential expression from multiple CGPs, even when CGPs are obtained from different studies or generated using distinct technology platforms. Meta-analysis is performed using the RankProd meta-analysis package from R.
Once an analysis is performed, the results and the parameters used for the analysis are stored and available for the user to view, download and share with the OMiCC community.
OMiCC TutorialNote: In addition to using the context-dependent tooltips, please check out the "Take a Tour" feature available on a number of key OMiCC pages to obtain a quick overview of the functionalities and workflow of the page you are on.
Several functionalities, such as creating CGPs and performing analyses, are only available to registered OMiCC users. Registration is free and the process is quick, which can be completed by going to the registration page.
OMiCC is a platform for generating biological hypotheses using publicly available gene expression data. Users search the OMiCC database for studies that contain gene expression data, identify groups of samples from that data, and use computational analysis tools to determine differences in gene expression between the two groups. The researcher can share these groups publicly, which allows other researchers to use them for subsequent analyses.
Illustrative example: Let’s say that one wants to explore how gene expression as measured in peripheral blood changes during acute infection with influenza. One type of study that might be useful to answer this question is an "influenza challenge." During an influenza challenge, subjects are directly infected with influenza virus and various measurements are taken post-infection.
First search OMiCC for potentially useful studies using the keywords "influenza challenge". Two studies (GSE61754 and GSE30550) seem very applicable. Save these search results ("influenza challenge") to your study list for future reference. You may also save just those two studies by selecting them from the list prior to saving the study.
Next you want to create sample groups. In addition to sample information provided by OMiCC/GEO, it is recommended that you refer to the primary publications associated with the data. For GSE61754, the publication shows that 22 individuals were challenged with H3N2 influenza. However, 11 had received a study vaccine prior to the challenge, while the other 11 were healthy (unvaccinated) subjects. For this example, we want to focus on healthy control subjects. Which subjects are healthy (unvaccinated) controls may not be annotated in GEO/OMiCC. In this case, you will have to refer to the Series Matrix File in GEO to identify which subjects were unvaccinated healthy controls (see video).
Once you know which subjects to include in the study, click on the GEO ID for GSE61754. You will see the list of the samples in the study. Make a sample group of all the samples collected before the influenza challenge. Now make a sample group of all the samples collected 12 hours after the influenza challenge.
The next step is to create a comparison group pair containing the sample group with the baseline samples and the sample group with the samples collected 12 hours after influenza infection. Because these samples were collected on the same individuals before and after infection, they should be paired. After pairing, add the CGP to the compendium of your choice.
These steps are illustrated in the following video:
This process can be repeated with study GSE30550. Reading the manuscript, all 17 adults were healthy and can be included in the study. However, you will note that these subjects had blood sampling both 24 hours prior to the influenza challenge, and just prior to the influenza challenge. In order to match the conditions with study GSE61754, we should select the samples from hour 00, rather than baseline (defined in the manuscript as 24 hours prior to the challenge).
Once you have made the correct CGP containing sample groups at baseline and 12 hours after infection, you can evaluate differential gene expression between the two sample groups that make up the CGP. If you want to examine more than one CGP, OMiCC can compute a DEP by CGP matrix for further analysis with advanced bioinformatics tools such as GenePattern. You can also perform a meta-analysis by selecting CGPs from the two studies.
Your sample groups and CGPs can be shared with the OMiCC community by making them "public." This will allow others to find your CGPs if they search, for example, by your user name, GEO study ID, or MeSH terms used to annotate the sample groups (Please refer to search section below).
In addition to creating CGPs de novo and adding them to a compendium for downstream analyses, one can also add CGPs to a compendium by searching for public CGPs authored by other OMiCC users (see video below). For example, the CGPs from GSE61754 and GSE30550 (baseline vs. 12 hours after influenza challenge) can be retrieved via OMiCC’s search interface for publicly available CGPs. After identification, they can be added to a user’s compendium for use in future analyses.
As an initial proxy for gauging the utility of a publicly available CGP, one can inspect its usage in OMiCC in terms of the number of compendia the CGP has been used in. We also encourage users in the OMiCC community to provide professional profiles, e.g., information regarding professional expertise and publications, which can be helpful for the community to assess whether CGPs authored by a user fall within her/his domain of professional expertise.
We provide four search categories: 1) search GEO studies; 2) search publicly available and the user’s private sample groups; 3) search publicly available and the user’s private comparison group pairs (CGPs); and 4) search search publicly available and the user's private compendia.
- Search GEO studies
- GEO studies can be searched using keywords extracted from fields associated with study records and can be filtered using technology platforms (e.g., restricting to human or mouse only, or to specific platforms by expanding on the “Filter on Platform” option). Popular platforms are highlighted in bold.
- GEO studies in the OMiCC database can be browsed by clicking "Search" without any keywords (or by clicking the "Browse All" button).
- To search within the studies that have at least one publicly available CGP, check the option "Studies with public Comparison Group Pairs."
- Selected studies can be saved to assess at a later time with the "Save to my study list" functionality.
- Advanced search can be performed using various fields, e.g., PubMed ID of the publication associated with the data set.
- Search sample groups
- Available sample groups with annotations can be searched using keywords extracted from sample group annotations and metadata (e.g., owner ID).
- By selecting the "Public Groups" option, only the sample groups that are public are displayed (not showing the user’s private sample groups).
- Advanced search can be performed using various fields, e.g., the user ID of the owner of the group or the ID or GEO microarray platform ID.
- Search comparison group pairs (CGPs)
- Available CGPs can be searched using keywords extracted from the CGP title, owner ID, or MeSH terms used to annotate the sample groups.
- By selecting the "Public CGPs" option, only the CGPs that are public are displayed (i.e., it will not list the user’s private CGPs).
- Selected CGPs can be added to a compendium by clicking the “Add to the Compendium” button.
- Advanced search can be performed using various fields.
- Search compendia
- Search for available compendia using keywords in the following fields (Advanced Search): compendia name, description, or owner ID (user account).
- In selecting the "Public Compendia Only" option, only compendia that are marked as public by the owner are displayed. A users' own private compendia will not be displayed
- The CGPs within a selected compendium can be added to user's own compendium.
- To add only specific CGPs from a selected compendium, click the + button in front of the compendium name, and then check the specific CGPs to add.
- Add CGPs to the user's own compendium by clicking the green button "Add to the Compendium".
In order to create a sample group, the user must be logged in.
- Once on a study page, select the samples to be grouped. The filter may be used to identify samples of interest. For example, if you would like to limit your search to samples that were collected 12 hours after an influenza challenge, type the words "12 hours" (no quotation marks) in the filter. Samples whose annotations contain this phrase (12 hours) will be displayed. Wildcard searches are not supported in the filter. A minimum of 3 samples per sample group is required (multiple rows/samples can be selected by one click using the shift key).
- After appropriate samples are selected, click on "Create a Sample Group" button. This will open a popup window titled "New Sample Group".
- In order to create a sample group successfully, a minimum of two (2) annotations for the group need to be entered. We provide a controlled vocabulary (MeSH) to assist with annotating sample groups. MeSH annotation suggestions would appear as the user types in the boxes. While we strongly encourage the use of MeSH terms, free text is allowed if the appropriate annotation term is unavailable. Once annotations are entered in the appropriate annotation categories, click the "Add Annotations" button.
- Sample group name is automatically generated based on the annotations. However, it can be changed by clicking the "Update" button and then the "Save" button, which will appear once "Update" is clicked.
- To remove any unwanted annotations, simply click on the "x" next to the annotation term.
- We provide a copy/paste functionality to allow copying of annotation tags across sample groups. This feature lets a user copy annotation terms from one sample group and paste it to another sample group.
- The sample group can then be created by clicking "Save" at the bottom of the page.
WARNING: Once a sample group is saved and made public, the sample group cannot be edited. However, if the sample group is not being used by any other OMiCC user, the owner of the sample group can change the status back to private and edit it.
In order to create a CGP, the user must be logged in.
- Once on a study page, click "Next" to go to the page for creating CGPs. At least two (2) sample groups are needed to create CGP(s).
- To create a comparison group pair (CGP), the user must specify the two constituent sample groups. Select a sample group from the displayed list of existing sample groups.
- Click on "Add to Comparison Group Pair" button and assign the selected sample group as "Condition 1" or "Condition 2 (reference)". After both the groups are assigned, a popup window will appear. Note that by convention, OMiCC treats the sample group labeled as "Condition 2" as the reference group for downstream differential expression analysis and DEP matrix generation—e.g., a positive change in gene expression means that the transcript level is higher in Condition 1 compared to Condition 2 (the "reference") and vice versa for negatively changed transcripts.
- (Optional) Samples in the two groups of a CGP can be paired (e.g.: samples obtained before and after a perturbation from the same subject should be paired). If the samples are paired, a paired analysis would be performed for the CGP downstream (e.g., paired t test). If the samples appear in the right order to be paired, the user can click on "Set Default Sample Pair Index" button to automatically create pair index in the order the samples appear. Note that pairing can only occur when the CGP is private.
- The pair index can be manually entered/changed.
- The pair index for the samples must be saved by clicking "Save Sample Pair Index" button.
- The CGP is created in the system once the "Close" button is clicked.
- After creating a CGP, the next step is to add CGP(s) to a compendium for performing analysis.
- Select CGP(s) to add to a compendium.
- (Optional) Create a new compendium by clicking on the "Create New Compendium" button.
- Select an existing compendium from the dropdown list of compendia and click "Add to Compendium" to add selected CGPs.
- After adding CGP(s) to a compendium, three options are provided: a) to go back to the search study page, b) create more sample groups in the current study, and c) go to the compendium to perform analysis on the CGP(s) in the compendium. The user can choose to ignore these options and maneuver using the tool bar on top of the page.
A compendium contains user-selected CGPs that can come from different GEO studies and microarray platforms. There are two main types of analyses that one can perform on the CGPs: a) compute differential expression profiles (DEPs), and b) meta-analysis. In the "Compendium" page there are four tabs: 1) the "CGPs" tab; 2) the "Compute Differential Expression Profiles (DEPs)" tab; 3) the "Meta-analysis" tab; and 4) the "Analysis Results" tab.
On all tabs (except for the "Analysis Results" tab), the number of samples in the sample groups within each CGP is displayed ("C1" and "C2" columns) alongside the number of probes and genes. Once CGPs are selected (by toggling the checkbox next to each CGP), OMiCC only performs analyses on the common genes across the selected CGPs. Thus, the inclusion of CGPs covering relatively few genes compared to other CGPs can result in the analysis of a small number of genes. We suggest that the user should first assess the probe/gene coverage of a CGP to help decide whether to include the CGP in a joint analysis (e.g., meta-analysis) with other CGPs. When working with multiple CGPs that use the same platform, one has the option of performing analyses at the individual probe or gene levels (a gene is typically cover by multilple probes). When working across platforms, analyses can only be done at the gene level.
Up to three types of expression data can be used for performing analyses depending on the platform and data availability when study data were being imported into OMiCC from GEO: 1) for Affymetrix platforms, we provide RMA normalized data derived from the raw CEL files by using the Affymetrix power tools package—when available these are labeled as "RMA Normalized Data" in OMiCC; 2) the GEO series matrix data (i.e., GEO user-submitted data)—when available, these are labeled as "GEO Data"; and 3) quantile normalized versions of (2)—when available, these are labeled as "Normalized GEO Data". Quantile normalization is a widely used statistical approach for data normalization, which was performed by OMiCC during data import from GEO if the user-submitted series matrix data was available. We provide (3) in case the user-supplied series matrix data have not been normalized
- The "CGPs" tab lists the CGPs in the compendium. A specific subset of CGPs can be selected and made public by clicking the "Make Public" button. Meta-data, CGP information and raw GEO expression data (from the series matrix file) of selected CGPs can be downloaded by clicking the "Export Raw Data On" button, through which the user can select whether the expression data should be exported in probe or gene space. When the selected CGPs originate from studies that used different platforms, raw data can only be exported in gene space. In a "GEO Data" (i.e., data in the series matrix file from GEO)
- OMiCC offers a "one-click" approach to compute DEPs and associated statistics with default settings. The default settings consist of the following configurations: a) Using Limma with BH multiple-testing correction to perform differential expression analysis, b) using "Normalized GEO Data" (see above), and c) using t-statistics to generate a gene-by-CGP data matrix (i.e., the "DEP" matrix). CGPs assayed on platforms with missing probe-to-gene mapping information or whose "Normalized GEO data" is unavailable cannot be analyzed using this "one-click" approach. Also note that the quality and usability of the "Normalized GEO Data" are questionable some studies. For these studies, please refer to the warning message display on the study page for details.
- The "Compute Differential Expression Profiles (DEPs)" tab provides more flexibility by allowing the user to choose 1) the statistical method for generating differential-expression gene signatures; 2) the statistic to use in the gene-by-CGP matrix; 3) the data source to use (e.g., "RMA Normalied Data" vs. "Normalized GEO Data"); and 4) whether to perform analysis on gene or probe space (see Analysis Methods below for further details). The user can select CGPs by filtering on organism, studies, or microarray platform. A DEP analysis run is initiated by clicking "Compute DEP(s)".
- The "Meta-analysis" tab provides an interface for performing meta-analysis on selected CGPs. As in DEP analysis, the user can select the type of data matrix to use for each CGP in the meta-analysis. Here, unique for the "Meta-analysis" tab, the user can select the reference condition for each CGP so that the comparison direction is aligned and thus biologically consistent across CGPs. For example, consider a CGP comprising Crohn’s Disease samples as the non-reference ("Condition 1") vs. those from healthy subjects as a reference ("Condition 2" in OMiCC is by default the reference condition for DEP analysis), while another CGP has healthy samples as the non-reference (i.e., "Condition 1"). Meta-analyzing the two together would require that the reference conditions be standardized to either the healthy or the disease sample groups.
- The "Analysis Results" tab provides an interface for accessing results from analyses performed. Each analysis is tracked using a unique identifier. By selecting the specific Analysis “Run ID", the user can view, download and share results (see the section below on "Results from OMiCC analysis" for details).
Here we provide brief descriptions of the analysis methods and parameters that can be customized on the "Compute Differential Expression Profile (DEPs)" and "Meta-analysis" tabs. All paired CGPs are analyzed using paired tests for determining the statistical significance of differential expression. Further details and examples can be found in the references cited below.
We encourage beginning users to start by using the default settings, e.g., via the one-click "Compute DEPs with Default Settings" button approach (see above). When in doubt, we recommend that users consult with their bioinformatics or biostatistics colleagues.
Limma for testing statistical significance of differential expression
The default method used to assess the statistical significance of expression differences between two sample groups in a CGP is Limma. It is one of the most widely used software packages for analyzing gene expression data. Using this method, a linear model for each probe/gene is fitted. Unlike traditional tests (e.g., t-test; see below), Limma uses an Empirical Bayesian approach and takes advantage of information across all samples to increase robustness for estimating the underlying parameters even when the sample sizes are relatively small. In general, Limma can be more sensitive than traditional tests such as the ones discussed below. For further details, please consult the Limma page and paper.
In addition to Limma, users can also use the following alternatives to assess the statistical significance of differential expression between the sample groups within CGPs:
Mann-Whitney U test (also known as Wilcoxon rank-sum test)
This is a classical non-parametric test—i.e., one that does not make any explicit assumptions about the underlying distribution of the data—that assesses whether samples from the two groups were drawn from the same population. The advantage of this test is that it works well for data distributed in any manner, however, sensitivity (or statistical power) is often more limited compared to parametric tests such as the Student’s t-test when the data are distributed in a Gaussian (or normal) manner. Please check out the following resources for further information: Resource 1 and Resource 2.
This is a classical test to assess whether the difference in mean between the two sample groups is statistically significant. This test assumes that the data are normally distributed and the groups have equal variances. Please see this page for more information.
Multiple Testing Correction
Given that multiple genes are tested for differential expression between the two sample groups within a CGP, one needs to adjust the p values of the individual tests (per gene/probe) to account for multiple hypothesis testing. Please see this lecture for an introduction to multiple testing. To intuitively understand why adjusting for multiple testing is needed, imagine a situation where none of the genes are differentially expressed in a CGP because samples in each sample group were randomly drawn from a single population of samples. Due to random sampling fluctuations, however, for some of the genes the difference in the mean expression level between the sample groups can be substantial and therefore they could have significant p values. Such random, but apparently "significant" results are bound to appear when one assesses many genes.
A number of statistical approaches have been devised to account for multiple testing in order to control for the number of false positives. One of the most widely used approaches is to estimate and control for the false discovery rate (FDR). The FDR is defined as the fraction of false positives at a given p value cutoff. For example, if 100 genes were deemed differentially expressed at a FDR cutoff of 0.2, we would expect 20 of these 100 genes to be false positives (i.e., they are not truly differentially expressed). For the statistical methods supported by OMiCC above, by default we use an FDR approach called "BH" (first proposed by Benjamini and Hochberg in their 1995 paper—see references below). We also support several other multiple testing correction methods to calculate adjusted P-values, including Holm (1979), Hochberg (1988), Hommel (1988), Benjamini & Yekutieli (2001), as well as Bonferroni.
Benjamini, Y., and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B 57, 289–300
Holm, S. (1979). A Simple Sequentially Rejective Multiple Test Procedure. Scandinavian Journal of Statistics 6, 65–70
Benjamini, Y., and Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. Ann. Statist. 29, 1165–1188
Hochberg, Y. (1988). A sharper Bonferroni procedure for multiple tests of significance. Biometrika 75, 800–802
Hommel, G. (1988). A stagewise rejective multiple test procedure based on a modified Bonferroni test. Biometrika 75, 383–386
RankProd for meta-analysis
OMiCC uses the RankProd package, which implements a non-parametric test that detects genes or probes that are consistently different between the non-reference and reference sample groups across a given collection of CGPs. Please refer to the reference below for more details about RandProd.
Hong, F., Breitling, R., McEntee, C.W., Wittner, B.S., Nemhauser, J.L., and Chory, J. (2006). RankProd: a bioconductor package for detecting differentially expressed genes in meta-analysis. Bioinformatics 22, 2825–2827
Results from DEP analysis and meta-analysis can be accessed by clicking on the analysis run ID in the "Analysis Results" tab.
- Differential Expression Profiles (DEPs) Analysis:
- For each CGP, the following files are available to view/download:
- Differential Expression Profile (DEPs): A file containing analysis result statistics for each gene or probe from condition 1 vs. condition 2 (the reference condition) comparison. The statistics returned depend on the analysis method used. For example, for the Limma method, the values include log(fold change), average expression, t-statistics, P value, adjusted p value and B statistics.
- Differentially Expressed Gene List: A list of differentially expressed (DE) genes or probes determined based on a user-provided statistical cutoff, or by default, genes with an adjusted P-value of less than 0.05. This list can be used to perform gene-set enrichment analysis using web tools such as DAVID.
- Heatmap (100 most DE genes/probes): A heatmap of the top 100 differentially expressed genes or probes showing all samples from sample groups in conditions 1 and 2. The values shown in the heatmap reflect the data the user selected to perform the DEP analysis; if the "one-button" approach is used, by default the "Normalized GEO Data" is shown.
- GenePattern Input Files: Two files (GCT and CLS formatted) are provided containing gene expression values and sample grouping information for down-stream analysis outside of OMiCC. This format is supported by tools such as Gene Pattern, Gene Set Enrichment Analysis and Integrative Genomics Viewer.
- Gene-E Launcher: A Java Web Start file that launches the GENE-E tool with expression data from the CGP pre-loaded. GENE-E requires Java 7 installed locally. Please consult your local IT personnel if help is needed.
- Download CGP Information: A file containing CGP information such as GEO sample IDs of condition 1 and condition 2 sample groups together with sample group annotations.
- The following are available if two or more CGPs are included in the run:
- DEP x CGP matrix file: In addition to the individual DEP files, we provide a merged matrix of DEPs (using the user-selected statistics such as t-statistics (default)) with genes or probes as rows and CGPs as columns.
- Heatmap (500 most varying genes/probes): A clustered heatmap of the 500 most varying genes or probes across CGPs. The value displayed is the user-selected statistics for the DEP x CGP matrix discussed above. OMiCC uses the pheatmap package in R to create the heatmap. The default clustering method (complete linkage) and distance measure (euclidean) are used.
- Number of Significant DE Genes per CGP Bar Plot: A bar plot showing the number of statistically significant differentially expressed genes for each CGP.
- For each CGP, the following files are available to view/download:
- Meta-analysis results file: A file containing analysis results generated using the RankProd package in R. RankProd identifies up-regulated and down-regulated genes. The values include:
- Differentially Expressed Gene List: Lists of differentially upregulated or downregulated genes or probes based on the user-provided adjusted p-value (here the pfp is used) cutoff. A cutoff of 0.05 is used by default.
pfp Estimated percentage of false positive predictions (pfp) up to the position of each gene per direction of change (i.e., up-regulated and down-regulated genes). According to the RankProd paper, pfp is similar to the false discovery rate (FDR). Pval Estimated p-value for being up-regulated or down-regulated for each gene. Fc.avg Log fold change of average expression in condition 1 over average expression in condition 2. Fc (per CGP) Log fold change for each CGP (as labeled by the name of the CGP).
As detailed above, conducting a study to generate and test hypotheses in OMiCC involves constructing new CGPs and/or utilizing public CGPs to build a compendium, followed by performing statistical analyses on the CGPs. A key step for constructing new CGPs involves the identification of suitable studies based on the biological goals and technical parameters such as the microarray platform used to generate the expression data. We recommend users consult the following excellent review paper on this and related issues. Note that many of the steps discussed in this review, such as extracting data from studies, annotating samples, resolving relationships between probes and genes, are already automated or enabled by OMiCC.
Ramasamy, A., Mondry, A., Holmes, C.C., and Altman, D.G. (2008). Key issues in conducting a meta-analysis of gene expression microarray datasets. PLoS Medicine 5, e184
Aside from biological considerations, there is often no clear quantitative criteria for deciding which study/data set should be included in an (meta-) analysis. As a starting point, we recommend that users inspect the gene coverage (the number of genes/probes is displayed on various pages in OMiCC – see above) of the platform and check whether the platform is widely used to help determine whether the study is worthy of consideration. Unless biologically highly relevant, a study that uses a rarely used platform or one covering too few genes is generally less desirable. Below, separately for human and mouse, we provide a list of the most-used platform in OMiCC based on the number of studies that use the platform (note that some platforms on their own are not used by many studies, but they belong to a platform family that is high ranked by popularity.)
|Platform Family||Platform ID||Platform Name||Count of Samples||Count of Studies|
|Illumina WG||GPL6884||Illumina HumanWG-6 v3.0 expression beadchip||6798||228|
|Illumina HT||GPL10558||Illumina HumanHT-12 V4.0 expression beadchip||28520||811|
|Illumina HT||GPL6947||Illumina HumanHT-12 V3.0 expression beadchip||19992||409|
|Agilent 44k||GPL4133||Agilent-014850 Whole Human Genome Microarray 4x44K G4112F (Feature Number version)||11510||583|
|Agilent 44k||GPL6480||Agilent-014850 Whole Human Genome Microarray 4x44K G4112F (Probe Name version)||13596||563|
|Affy U95 v2||GPL8300||[HG_U95Av2] Affymetrix Human Genome U95 Version 2 Array||5241||289|
|Affy U133 2.0||GPL570||[HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array||101923||3691|
|Affy U133 2.0||GPL571||[HG-U133A_2] Affymetrix Human Genome U133A 2.0 Array||11814||463|
|Affy U133 2.0||GPL11670||Affymetrix Human Genome U133 Plus 2.0 Array [Hs133P_Hs_ENTREZG.cdf]||214||12|
|Affy U133 2.0||GPL6791||Affymetrix GeneChip Human Genome U133 Plus 2.0 Array [CDF: Hs_ENTREZG_10]||76||7|
|Affy U133||GPL96||[HG-U133A] Affymetrix Human Genome U133A Array||35373||1009|
|Affy Gene ST||GPL5175||[HuEx-1_0-st] Affymetrix Human Exon 1.0 ST Array [transcript (gene) version]||11164||275|
|Affy Gene ST||GPL6244||[HuGene-1_0-st] Affymetrix Human Gene 1.0 ST Array [transcript (gene) version]||20936||1016|
|Platform Family||Platform ID||Platform Name||Count of Samples||Count of Studies|
|Affy 430 2.0||GPL1261||[Mouse430_2] Affymetrix Mouse Genome 430 2.0 Array||42138||3172|
|Affy 430 2.0||GPL339||[MOE430A] Affymetrix Mouse Expression 430A Array||5101||404|
|Affy 430 2.0||GPL8321||[Mouse430A_2] Affymetrix Mouse Genome 430A 2.0 Array||5589||394|
|Affy Gene ST||GPL6246||[MoGene-1_0-st] Affymetrix Mouse Gene 1.0 ST Array [transcript (gene) version]||16579||1290|
|Affy U74||GPL81||[MG_U74Av2] Affymetrix Murine Genome U74A Version 2 Array||6598||512|
|Agilent 44k||GPL4134||Agilent-014868 Whole Mouse Genome Microarray 4x44K G4122F (Feature Number version)||4469||407|
|Agilent 44k||GPL7202||Agilent-014868 Whole Mouse Genome Microarray 4x44K G4122F (Probe Name version)||6267||376|
|Illumina MouseRef||GPL6885||Illumina MouseRef-8 v2.0 expression beadchip||8926||457|
|Illumina WG||GPL6887||Illumina MouseWG-6 v2.0 expression beadchip||8465||595|
A useful approach enabled by OMiCC is meta-analysis of CGPs constructed from different studies and technology platforms. In general, meta-analysis of similar biological phenotypes across independent studies can offer increased statistical power and precision, thus potentially leading to more robust outcomes that are less prone to platform and study biases. For example, by combining gene expression or genetic data across studies, recent examples of meta-analyses listed below have demonstrated that despite technical, platform, and cohort-characteristic heterogeneities, coherent signals and experimentally verifiable hypotheses can be extracted by standard data normalization and analysis approaches. We also provide additional references below that discuss methodological and analysis considerations for meta-analysis of gene expression data.
In general, we recommend conducting meta-analysis using CGPs constructed from at least three independent studies. One might also consider constructing and using independent "training" and "validation" data compendia when a sufficient number of independent data sets/studies are available. Using this approach, the user could generate hypotheses and perform initial explorations in OMiCC using the training compendium only. When clear hypotheses emerged from the training compendium, they can be further assessed using the validation compendium. This approach could mitigate potential data "over-fitting", e.g., those obtained via iterative rounds of experimenting with different analysis parameters, because such not-so-robust results are less likely to be replicated using the validation compendium.
Example meta-analysis applications:
Granlund, A. van B. et al. Whole genome gene expression meta-analysis of inflammatory bowel disease colon mucosa demonstrates lack of major differences between Crohn’s disease and ulcerative colitis. PloS One 8, e56818 (2013).
Khatri, P. et al. A common rejection module (CRM) for acute rejection across multiple organs identifies novel therapeutics for organ transplantation. J. Exp. Med. 210, 2205–2221 (2013).
Chen, R. et al. A Meta-analysis of Lung Cancer Gene Expression Identifies PTK7 as a Survival Gene in Lung Adenocarcinoma. Cancer Res. 74, 2892–902 (2014).
Teslovich, T. M. et al. Biological, clinical and population relevance of 95 loci for blood lipids. Nature 466, 707–713 (2010).
Dudley, J. T. et al. Computational repositioning of the anticonvulsant topiramate for inflammatory bowel disease. Sci. Transl. Med. 3, 96ra76 (2011).
Sweeney, T. E., Shidham, A., Wong, H. R. & Khatri, P. A comprehensive time-course–based multicohort analysis of sepsis and sterile inflammation reveals a robust diagnostic gene set. Sci. Transl. Med. 7, 287ra71–287ra71 (2015).
Sirota, M. et al. Discovery and preclinical validation of drug indications using compendia of public gene expression data. Sci. Transl. Med. 3, 96ra77 (2011).
Method and analysis considerations:
Ramasamy, A., Mondry, A., Holmes, C. C. & Altman, D. G. Key issues in conducting a meta-analysis of gene expression microarray datasets. PLoS Med. 5, e184 (2008).
Hughey, J. J. & Butte, A. J. Robust meta-analysis of gene expression using the elastic net. Nucleic Acids Res. gkv229 (2015). doi:10.1093/nar/gkv229
Dudley, J. T., Tibshirani, R., Deshpande, T. & Butte, A. J. Disease signatures are robust across tissues and experiments. Mol. Syst. Biol. 5, (2009).
Sweeney, T.E., Haynes, W.A., Vallania, F., Ioannidis, J.P., and Khatri, P. Methods to increase reproducibility in differential gene expression via meta-analysis. Nucl. Acids Res. gkw797. (2016)
General Tips and Frequently Asked Questions
- When working in OMiCC, it can be useful to open two web browser screens: one with the Tutorial and GEO tabs, the other for working with OMiCC.
- When looking for sample group annotations, good sources of information include the original manuscript describing the data, text in the GEO record, and information on the sample group page in OMiCC.
- How do I delete a sample group I created?
Answer: On the study page, a list of sample groups appears on the right-hand side. Click the "detail" button. In order to delete a sample group, it must be made private. If a sample group is part of another user’s CGP or compendium, it cannot be deleted. If the sample group or CGP is not used by another OMiCC user, it can be made private and then deleted or modified.
- How do I save a selected number of studies?
Answer: After searching for GEO studies, a small subset can be saved to your study list by clicking the box adjacent to those studies of interest and selecting "Save to my study list".
- How do I obtain another user’s sample group?
Answer: If you identify a sample group that you would like to add to your compendium, click on the associated GEO ID. This will take you to the study page where the sample groups are listed on the right-hand side. You must next create a CGP with the sample group of interest; sample groups can only be added to a compendium as part of a CGP. One can then save the CGP to a designated compendium.
- Why are some CGPs unable to be used in generation of a DEP or meta-analysis?
Answer: OMiCC uses the gene symbol as a common link across studies. If the probe-to-gene map is not available, there is no way to generate a cross-study analysis.
- Why do I get an error when I open my text file in Excel, indicating that my text
file is a SYLK file?
Answer: This error occurs because the text file begins with the term "ID". You can ignore this error. Click OK and proceed to inspect the file.
- I see that many genes have the same adjusted p-values (FDRs); are these correct?
Answer: Yes, this can happen depending on the multilple testing correction approach used. BH, for example, can often generate such results. Please refer to the discussion and references above.
- Why do I get an error when I open my text file in Excel, indicating that my text
file is a SYLK file?
Answer: This error occurs because the text file begins with the term "ID". You can ignore this error. Click OK and proceed to inspect the file.
- How do I export all the sample groups and CGP information, together with the raw data from a compendium, for analysis outside of OMiCC?
Answer: Open the compendium, go to the “CGPs” tab, selected all or a subset of the CGPs, click “Export Raw Data On”, then select to export either in probe or gene space. The output is in standard SOFT format , which is the standard format used in GEO series matrix files and for uploading data to GEO. The exported expression data originated from the series matrix file in GEO.