GP3: GenePix Post-Processing Script


This script performs a number of steps to correct, filter and normalize raw microarray data from GenePix image analysis software (i.e gpr files). Often these steps are done manually or not at all. This can lead to significant errors associated with ratio measurements in two-color fluorescence-based cDNA microarray assays. This script automates the procedure to increase throughput and reduce error. It is also customizable so that analyses can be tailored to individual needs.

Why do I need to correct, filter and normalize my microarray data?


Spots (i.e. genes on a microarray) may be manually flagged for anomalies, or may be below or above detection levels in one or both channels of fluor detection (i.e. Cy3 and Cy5). This can often lead to invalid or undefined ratios. These spots need to be removed from the analysis to reduce the noise associated with the data. Valid spots also need to be corrected to remove the signal intensity associated with background signal. Furthermore, systematic and experimental biases can exist between two fluor-labeled cDNA populations being compared in a two-color fluorescence-based cDNA microarray assay. This results in inaccurate quantitation of relative differences in gene expression. The error is often associated with

  • differences in the efficiency of incorporation of fluor-labeled nucleotides into cDNA
  • differences in the emission characteristics of the fluors
  • and differences in RNA loading, RNA quality and sample handling.
Consequently, signal intensity values for each fluor are often normalized, or transformed, by a correction factor. This mathematical correction attempts to remove systematic biases in fluor characteristics, as well as correcting for differences in RNA loading and sample handling such that accurate ratios can be calculated. Click here for a detailed description of the calculation algorithm implemented by the script.

System Requirements


The script requires Perl in order to run. ActivePerl for Win32 is available from Active State. In addition, the script also requires an additional Perl module, Statistics-Descriptive which is available from www.cpan.org.

Download


The script is available for both Win32 and Linux. Select the appropriate link for your operating system.

NOTE: GP3 was written and tested under Win32. Changes were made to make it run under Linux, but thorough testing has not been done to verify this.

Notes


This version of GP3 is known to have issues with various versions of the GenePix software resulting in a "log0 error" similar to the following:

cannot take log of 0 at gp3.pl line 458 <INPUT> LINE 28.

This error is thrown under the following circumstances:

  1. The number of control lines in the GPR file is not 26 (therefore the data header is not line 27, and the data begins on line 28)
  2. The header titles in your file reference the wavelengths incorrectly (GP3 expects F532, B532, F635, B635).

To fix the first circumstance: Change line 180 (my $control_lines = 26) to the appropriate number

To fix the second circumstance: Change the code to reflect your proper wavelengths (use Find/Replace in your word processing program), or change the wavelengths in your GPR file.

Running the Script


Input File Format
At the moment, the script only processes GPR files generated by GenePix. In calculating the values for the summary file, the script uses the value in the ID column of the GPR file to determine the number of replicate spots. To ensure that the replicate spots are correctly accounted for, they must have exactly the same ID value.

The script is executed from the command line like most Perl scripts. Typing "gp3" at the command prompt will bring up the help message for the script that illustrates how to run it. Command-line switches define how the program is run, and define what file or directory is to be processed.

-h: help function
-f: single file mode
-d: batch mode; operates on an entire directory and processes only GPR files in the directory
-b: user defined baseline value. Signals that fall below the calculated threshold level will be increased to this value (default = calculated threshold value)
-z: z-score normalization flag (default = global normalization)
-p: percentile of mean signal across array for normalization factor (default = 90)
-t: threshold value (default = 3)
-s: supress the output and summary files

Only one of either -f or -d is mandatory. The script will not run if both switches are set.
The remaining switches are optional. If they are not present, then the script uses the default values

Examples:

To process all files in the directory c:\data\gpr_files:

gp3 -d e:\data\gpr_files -z -p 75 -t2

This processes all gpr files in the directory c:\data\gpr_files using a z-score normalization, with a 75% trimmed mean, a threshold value of 2, and the default baseline value.

To process a single file called array1.gpr:

gp3 -f e:\data\gpr_files\array1.gpr -b 1000

This processes the file array1.gpr using the default options (global normalization, 90% trimmed mean and threshold value of 3) with a baseline value of 1000.

Output Files Generated by the Script


An example of the files created by the script are included with the download of the scripts. The control file is a simple text file while the output and summary file are comma-separated-value files that can be easily viewed using a spreadsheet program such as Microsoft Excel.

Control File
Inputfilename_control.txt - this file describes and summarizes the results of the experiment. The first 26 lines consist of the header information from the gpr file.

Output File
Inputfilename_output.txt - this file contains the original contents of the gpr file appended with flags, log2 corrected signal intensities, normalized signal intensities, ratios, log2 ratios (for both all and valid ratios), geometric mean signal intensities of both channels and others as listed below. Fields 1 to 43 are directly from the GenePix gpr file and are unchanged. Descriptions for these fields can be found in the GenePix user manual.

Column
Description
Column
Description

1

Block

29

Median of Ratios

2

Column

30

Mean of Ratios

3

Row

31

Ratios SD

4

Name

32

Rgn Ratio

5

ID

33

Rgn R²

6

X

34

F Pixels

7

Y

35

B Pixels

8

Dia

36

Sum of Medians

9

F635 Median

37

Sum of Means

10

F635 Mean

38

Log Ratio

11

F635 SD

39

F635 Median - B635

12

B635 Median

40

F532 Median - B532

13

B635 Mean

41

F635 Mean - B635

14

B635 SD

42

F532 Mean - B532

15

% > B635+1SD

43

Flags

16

% > B635+2SD

44

Signal_threshold_flag_F635

17

F635 % Sat.

45

Signal_threshold_flag_F532

18

F532 Median

46

Signal_sat_flag_F636

19

F532 Mean

47

Signal_sat_flag_F532

20

22

B532 Mean

50

F636 Normalized_valid Ni1 for valid spots only

23

B532 SD

51

F532 Normalized_valid Ni2 for valid spots only

24

% > B532+1SD

52

Ratio(F636/F532)_valid Ri = Ni1 / Ni2 for valid spots only

25

% > B532+2SD

53

Log2Ratio(F636/F532)_valid Ri' = log2Ri for valid spots only

26

F532 % Sat

54

Mean Signal Intensity Geometric mean (Gi) of F636 Median and F532 Median

27

Ratio of Medians

55

Log2 Mean Signal Intensity Log2(Gi)

28

Ratio of Means

56

Percent Signal Intensity 100 * (Gi / 65535): Percent of maximum

Summary File
Inputfilename_summary.txt - this file reports statistical information related to replicated spots for a gene on a microarray. The following table provides a description of the columns in the summary file.


Column
Description

1

Identification: ID field from the output file

2

Name: Name field from the output file

3

Average F636 Normalized: Average of replicate F636 Normalized values

4

SD F636 Normalized: Standard deviation of F636 Normalized values

5

CV% F636 Normalized: Coefficient of variation of replicate F636 Normalized values

6

Average F532 Normalized: Average of replicate F532 normalized values

7

SD F532 Normalized: Standard deviation of replicate F532 Normalized values

8

CV% F532 Normalized: Coefficient of variation of replicate F532 Normalized values

9

Average Log2Ratio: Arithmetic average of log2 ratio values

10

SD Log2Ratio: Standard deviation of log2 ratio values

11

N: Number of replicate spots (based on the identification)

12

Ratio(F636/F532): Inverse transformation of average log2Ratio

13

Mean Signal Intensity: Geometric mean (Gi) of F636 Median and F532 Median

14

Log2 Mean Signal Intensity: Log2(Gi)

15

Percent Signal Intensity: 100 * (Gi / 65535): Percent of maximum signal