AffySTExpressionFileCreator (v1) BETA

This module is currently in beta release. The module and/or documentation may be incomplete.

Creates a GCT file from a set of CEL files from Affymetrix ST arrays.

Author: David Eby, Broad Institute

Contact:

gp-help@broadinstitute.org

Algorithm Version:

Summary

Please Note that version 0.14 is currently only available in beta on GenePattern Team hosted servers. We are working to release updates which will be available for use on all platforms. Feel free to contact us with any questions.

This module creates a gene expression dataset from a set of CEL files for Affymetrix ST arrays.  It is similar to ExpressionFileCreator, which operates on CEL files from the older 3' biased IVT-based Affymetrix arrays.  The conversion is done using the Robust Multi-array Average (RMA) algorithm as provided by the 'oligo' package in Bioconductor.  The result is a matrix containing one intensity value per probe set per sample in the GCT file format. 

Note that the RMA algorithm will log-transform the data during processing.  This may affect downstream processing by other modules, some of which will produce erroneous results with log-transformed data unless adjustments are made.  For example, the ComparativeMarkerSelection module has a parameter that must be set for it to accept and adjust for log-transformed data.

Multiple CEL files can be uploaded directly to the input file parameter for processing.  The parameter also accepts CELs packaged as a ZIP or TAR bundle or supplied as a directory input if your GenePattern server is configured to allow it.  You can provide multiple ZIPs, TARs, or directory inputs as well, or mix all of these forms.  The CEL files can be compressed in GZ format and the TAR bundles can be in GZ, XZ, or BZ2 format.  Any directory inputs will be recursively searched for CEL files (uncompressed or in GZ format) to include in the dataset; ZIPs and TARs in these inputs will not be included, however.

You can supply an optional CLM file listing the CEL files to be included in the dataset, their order, their phenotypic categories, and their alternate sample names.  Note that if there are any files submitted for a job but not listed in the included CLM file, those files will not be included in the dataset.  The column order of the dataset will match the order of the CLM listing.  If no CLM file is provided, the CEL file names will be used as sample names and the order will match the module's processing order.  This can be somewhat unpredictable, so if order is important then the use of a CLM is recommended.

The default behavior is to normalize and background correct the dataset upon extraction, but the appropriate parameters can be set to 'no' if raw data extraction is desired.
 
Also by default, the dataset will be annotated with gene identifiers for each probeset; set annotate probes to 'no' if you don't want these included.  Where available, the Entrez Gene number, RefSeq ID, and gene symbol will be provided in the "Description" column of the dataset, in the format "[EntrezGene number] // [RefSeq ID] // [gene symbol]".  The text "NA" is given instead if any of these are missing.  These annotations come from the Bioconductor bundle for the array set being analyzed.  Unfortunately, no annotation information is available for organisms other than Human, Mouse, or Rat.
 
For users with Human, Mouse, or Rat CEL files, you may wish to use the ReannotateGCT module with the CHIP files from the GSEA project to reannotate your GCT files as these have been reviewed and curated by GSEA staff to work better with GSEA and MSigDB than the Affymetrix-provided annotations.
 
Lastly, the module is capable of producing a set of plots in multiple formats that may be useful for QC purposes.  See the Output Files section below for more details.  Plot generation is turned off by default as it can be quite time-consuming.

References

Carvalho BS and Irizarry RA (2010). “A Framework for Oligonucleotide Microarray Preprocessing.” Bioinformatics. ISSN 1367-4803.

Carvalho BS and Irizarry RA (2014). "Package 'oligo'" documentation from Bioconductor 2.14.

Parameters

Name Description
input file * One or more Affymetrix ST CEL files either uploaded directly, packaged into a ZIP or TAR bundle, or supplied through a directory input.  The CEL files can be in GZ format and the TAR can be in GZ, XZ, or BZ2 format.  The parameter will accept multiple inputs in any of these forms.
normalize * Whether to normalize data using quantile normalization.
background correct * Whether to perform background correction.
clm file  A tab-delimited text file containing one scan, sample, and class per line.
annotate probes * Whether to annotate probes with the gene symbol and description.
output file base * The base name of the output file(s). File extensions will be added automatically.

* - required

Input Files

  1. input.file
    One or more Affymetrix ST CEL files.  These can be supplied as individual CEL files, in a ZIP or TAR bundle, or in a directory.  The CEL files can be in GZ format and the TAR can be in GZ, XZ, or BZ2 format.  Note that the CEL file names must be unique, ignoring any compression format extensions.  Also note that all CEL files must be of the same array type.
  2. clm.file
    An optional CLM file listing the CEL files to be included in the dataset, their order, their phenotypic categories, and their alternate sample names.  Note that if there are any files submitted for a job but not listed in the included CLM file, those files will not be included in the dataset.  The column order of the dataset will match the order of the CLM listing.  If no CLM file is provided, the CEL file names will be used as sample names and the order will match the module's processing order.  This can be somewhat unpredictable, so if order is important then the use of a CLM is recommended.

Output Files

  1. <output.file.base>.gct
    The expression dataset in GCT format.
  2. <output.file.base>.cls
    A categorical label CLS file, listing the categories of all the samples in the dataset as determined by the input CLM file.
  3. <output.file.base>.QC.Density_histogram.pdf (or .png or .svg)
    A histogram plot of the density estimates for each sample.  This may be useful for QC purposes.
  4. <output.file.base>.QC.Boxplot.pdf (or .png or .svg)
    A boxplot of the observed intensities for each sample.  This may be useful for QC purposes.
  5. <output.file.base>.QC.[sample name]_MAplot.pdf (or .png or .svg)
    A plot of Average Intensity vs. log ratio (M vs. A, or MA) for each sample versus a reference array.  This Wikipedia entry gives some background on MA plots.
  6. <output.file.base>.QC.[sample name]_Cel_image.pdf (or .png or .svg)
    A psuedo-image of the array for each sample, based on the observed intensities.  This may be useful for QC purposes.

Example Data

[Yet to be posted]

Requirements

Requires R 3.1.3 and a set of R package dependencies from CRAN and Bioconductor.  R 3.1.3 must be installed and configured by the GenePattern administrator before this module can be installed [Instructions yet to be posted.  Will link to an updated version of our Admin Guide on the subject].  The package dependencies will be automatically installed when the module is installed.

Platform Dependencies

Task Type:
Preprocess & Utilities

CPU Type:

Operating System:
any

Language:
R

Version Comments

Version Release Date Description
0.14 2015-10-22 Updated to make use of the R package installer.