|
GP3: GenePix Post-Processing Script
This script performs a number of steps to correct, filter and normalize
raw microarray data from GenePix image analysis software (i.e gpr files).
Often these steps are done manually or not at all. This can lead to
significant errors associated with ratio measurements in two-color fluorescence-based
cDNA microarray assays. This script automates the procedure to increase
throughput and reduce error. It is also customizable so that analyses
can be tailored to individual needs.
Why do I need to correct, filter and normalize
my microarray data?
Spots (i.e. genes on a microarray) may be manually flagged for anomalies,
or may be below or above detection levels in one or both channels of
fluor detection (i.e. Cy3 and Cy5). This can often lead to invalid or
undefined ratios. These spots need to be removed from the analysis to
reduce the noise associated with the data. Valid spots also need to
be corrected to remove the signal intensity associated with background
signal. Furthermore, systematic and experimental biases can exist between
two fluor-labeled cDNA populations being compared in a two-color fluorescence-based
cDNA microarray assay. This results in inaccurate quantitation of relative
differences in gene expression. The error is often associated with
- differences in the efficiency of incorporation of fluor-labeled
nucleotides into cDNA
- differences in the emission characteristics of the fluors
- and differences in RNA loading, RNA quality and sample handling.
Consequently, signal intensity values for each fluor are often normalized,
or transformed, by a correction factor. This mathematical correction attempts
to remove systematic biases in fluor characteristics, as well as correcting
for differences in RNA loading and sample handling such that accurate
ratios can be calculated. Click here for a detailed description of the
calculation algorithm implemented
by the script.
System Requirements
The script requires Perl in order to run. ActivePerl
for Win32 is available from Active
State. In addition, the script also requires an additional Perl
module, Statistics-Descriptive which is
available from www.cpan.org.
Download
The script is available for both Win32 and Linux. Select the appropriate
link for your operating system.
NOTE: GP3 was written and tested under Win32.
Changes were made to make it run under Linux, but thorough testing has
not been done to verify this.
Notes
This version of GP3 is known to have issues with various versions of
the GenePix software resulting in a "log0 error" similar to
the following:
cannot take log of 0 at gp3.pl line 458 <INPUT> LINE 28.
This error is thrown under the following circumstances:
- The number of control lines in the GPR file is not 26 (therefore
the data header is not line 27, and the data begins on line 28)
- The header titles in your file reference the wavelengths incorrectly
(GP3 expects F532, B532, F635, B635).
To fix the first circumstance: Change line 180 (my $control_lines =
26) to the appropriate number
To fix the second circumstance: Change the code to reflect your proper
wavelengths (use Find/Replace in your word processing program), or change
the wavelengths in your GPR file.
Running the Script
Input File Format
At the moment, the script only processes GPR files generated by GenePix.
In calculating the values for the summary file, the script uses the
value in the ID column of the GPR file to determine the number of replicate
spots. To ensure that the replicate spots are correctly accounted for,
they must have exactly the same ID value.
The script is executed from the command line like most Perl scripts.
Typing "gp3" at the command prompt will bring up the help
message for the script that illustrates how to run it. Command-line
switches define how the program is run, and define what file or directory
is to be processed.
-h: help function
-f: single file mode
-d: batch mode; operates on an entire
directory and processes only GPR files in the directory
-b: user defined baseline value.
Signals that fall below the calculated threshold level will be increased
to this value (default = calculated threshold value)
-z: z-score normalization flag (default
= global normalization)
-p: percentile of mean signal across
array for normalization factor (default = 90)
-t: threshold value (default = 3)
-s: supress the output and summary
files
Only one of either -f or -d is mandatory. The script will not run if
both switches are set.
The remaining switches are optional. If they are not present, then the
script uses the default values
Examples:
To process all files in the directory c:\data\gpr_files:
gp3 -d e:\data\gpr_files -z -p 75 -t2
This processes all gpr files in the directory c:\data\gpr_files using
a z-score normalization, with a 75% trimmed mean, a threshold value
of 2, and the default baseline value.
To process a single file called array1.gpr:
gp3 -f e:\data\gpr_files\array1.gpr -b 1000
This processes the file array1.gpr using the default options (global
normalization, 90% trimmed mean and threshold value of 3) with a baseline
value of 1000.
Output Files Generated by the Script
An example of the files created by the script are included with the
download of the scripts. The control file is a simple text file while
the output and summary file are comma-separated-value files that can
be easily viewed using a spreadsheet program such as Microsoft Excel.
Control File
Inputfilename_control.txt - this file describes
and summarizes the results of the experiment. The first 26 lines consist
of the header information from the gpr file.
Output File
Inputfilename_output.txt - this file contains the original
contents of the gpr file appended with flags, log2 corrected signal
intensities, normalized signal intensities, ratios, log2 ratios (for
both all and valid ratios), geometric mean signal intensities of both
channels and others as listed below. Fields 1 to 43 are directly from
the GenePix gpr file and are unchanged. Descriptions for these fields
can be found in the GenePix user manual.
|
Column
|
Description
|
|
Column
|
Description
|
|
|
1
|
Block
|
|
29
|
Median of Ratios
|
|
|
2
|
Column
|
|
30
|
Mean of Ratios
|
|
|
3
|
Row
|
|
31
|
Ratios SD
|
|
|
4
|
Name
|
|
32
|
Rgn Ratio
|
|
|
5
|
ID
|
|
33
|
Rgn R²
|
|
|
6
|
X
|
|
34
|
F Pixels
|
|
|
7
|
Y
|
|
35
|
B Pixels
|
|
|
8
|
Dia
|
|
36
|
Sum of Medians
|
|
|
9
|
F635 Median
|
|
37
|
Sum of Means
|
|
|
10
|
F635 Mean
|
|
38
|
Log Ratio
|
|
|
11
|
F635 SD
|
|
39
|
F635 Median - B635
|
|
|
12
|
B635 Median
|
|
40
|
F532 Median - B532
|
|
|
13
|
B635 Mean
|
|
41
|
F635 Mean - B635
|
|
|
14
|
B635 SD
|
|
42
|
F532 Mean - B532
|
|
|
15
|
% > B635+1SD
|
|
43
|
Flags
|
|
|
16
|
% > B635+2SD
|
|
44
|
Signal_threshold_flag_F635
|
|
|
17
|
F635 % Sat.
|
|
45
|
Signal_threshold_flag_F532
|
|
|
18
|
F532 Median
|
|
46
|
Signal_sat_flag_F636
|
|
|
19
|
F532 Mean
|
|
47
|
Signal_sat_flag_F532
|
|
|
20
|
|
22
|
B532 Mean
|
|
50
|
F636 Normalized_valid
Ni1 for valid spots only
|
|
|
23
|
B532 SD
|
|
51
|
F532 Normalized_valid
Ni2 for valid spots only
|
|
|
24
|
% > B532+1SD
|
|
52
|
Ratio(F636/F532)_valid
Ri = Ni1 / Ni2 for valid spots only
|
|
|
25
|
% > B532+2SD
|
|
53
|
Log2Ratio(F636/F532)_valid
Ri' = log2Ri for valid spots only
|
|
|
26
|
F532 % Sat
|
|
54
|
Mean Signal Intensity
Geometric mean (Gi) of F636 Median and F532 Median
|
|
|
27
|
Ratio of Medians
|
|
55
|
Log2 Mean Signal Intensity
Log2(Gi)
|
|
|
28
|
Ratio of Means
|
|
56
|
Percent Signal Intensity
100 * (Gi / 65535): Percent of maximum
|
|
Summary File
Inputfilename_summary.txt - this file reports statistical
information related to replicated spots for a gene on a microarray.
The following table provides a description of the columns in the summary
file.
|
|
Column
|
Description
|
|
|
|
|
1
|
Identification:
ID field from the output file
|
|
|
|
|
2
|
Name: Name field
from the output file
|
|
|
|
|
3
|
Average F636 Normalized:
Average of replicate F636 Normalized values
|
|
|
|
|
4
|
SD F636 Normalized:
Standard deviation of F636 Normalized values
|
|
|
|
|
5
|
CV% F636 Normalized:
Coefficient of variation of replicate F636 Normalized values
|
|
|
|
|
6
|
Average F532 Normalized:
Average of replicate F532 normalized values
|
|
|
|
|
7
|
SD F532 Normalized:
Standard deviation of replicate F532 Normalized values
|
|
|
|
|
8
|
CV% F532 Normalized:
Coefficient of variation of replicate F532 Normalized values
|
|
|
|
|
9
|
Average Log2Ratio:
Arithmetic average of log2 ratio values
|
|
|
|
|
10
|
SD Log2Ratio:
Standard deviation of log2 ratio values
|
|
|
|
|
11
|
N: Number of
replicate spots (based on the identification)
|
|
|
|
|
12
|
Ratio(F636/F532):
Inverse transformation of average log2Ratio
|
|
|
|
|
13
|
Mean Signal Intensity:
Geometric mean (Gi) of F636 Median and F532 Median
|
|
|
|
|
14
|
Log2 Mean Signal
Intensity: Log2(Gi)
|
|
|
|
|
15
|
Percent Signal Intensity:
100 * (Gi / 65535): Percent of maximum signal
|
|
|
|