Biology Education
The
Aspire Q&A program includes many biology-related quizzes:
Aspire Q&A: Create
and use quizzes to quickly test your knowledge of a subject. Includes quizzes
that cover physics, chemistry and biology. Or, create your own with the included
easy-to-use quiz editor. Click
here to get it.
Here are instructions on performing microarray analysis (this is a high-level tutorial intended for graduate students in the biological sciences):
Step-by-Step Tutorial for analyzing Affymetrix
expression microarrays
in the R statistical environment:
This tutorial is intended for Microsoft Windows users, but
will work on any platform with small modifications (installing packages is
not done using ZIP files on other platforms):
Installing the R statistical
environment:
Go to http://www.r-project.org
and download the latest version of R. This tutorial is based on version 2.4.0.
On the R website select the ‘Cran Mirror’s link
and select the download site closest to you. Select the version of R for your
operating system. Select the ‘base’ link and then select and run the install
executable (something like R-2.4.0-win32.exe) and complete the installation
process.
Installing Bioconductor:
To install Bioconductor, which is a set of extra
packages required for microarray analysis, follow
these steps:
source("http://www.bioconductor.org/getBioC.R")
getBioC()
If you need a microarray data set to practice on:
You can download microarray data from this site:
http://www.broad.mit.edu/cgi-bin/cancer/datasets.cgi
In the section marked ‘RNA interference model of RPS19 deficiency…’ download
the file ‘DBA_cel_files.zip’ and unzip the files
(Using a utility such as WinZip). For this tutorial the unzipped files are
located in the 'c:\MicroarrayData\' directory.
Bioconductor needs to know how to group your
microarray files by phenotype, so you need to create
a text file with information on each CEL file and the phenotype group to which
it belongs. In your data directory, create a text file called ‘Phenotypes.txt’
with this info:
Name FileName
Target
1 CL20030502101AA Control
2 CL20030506102AA Control
3 CL20030502103AA Control
4 CL2004050501AA EpoSCF10Days
5 CL2004050502AA EpoSCF10Days
6 CL2004050503AA EpoSCF10Days
In
this tutorial we are only going to use six of the CEL files in the data set,
for simplicity sake. Here I have the control erythrocytes at time zero in
one group and those treated with erythropoietin and stem cell factor for 10
days in the second group.
Inside the R environment type in these commands:
library(affy)
This sets the working/default directory (strings are in the C-programming style, so you need two backslashes to denote a single backslash):
setwd("c:\\MicroarrayData\\")
This reads in the phenotype data (the text file we created above):
pd <- read.AnnotatedDataFrame( "RNATargets8.txt", header=TRUE, sep="", row.names=1 )
This reads in the CEL files and links them to our phenotype data (type in all on one line):
rawAffyData
<- ReadAffy("CL20030502101AA.CEL", "CL20030506102AA.CEL",
"CL20030502103AA.CEL",
"CL2004050501AA.CEL",
"CL2004050502AA.CEL",
"CL2004050503AA.CEL",phenoData=pd,verbose=TRUE)
Each CEL file contains cell expression data for one expression
microarray chip. This file contains intensity levels
that are matched to each oligo probe on the chip.
Viewing your CEL data:
To view a picture of the cel file data type
in this command (this will display the first CEL file’s data in the data set):
image(rawAffyData[,1])
Viewing the CEL image may uncover defects in the chip or
processing of the chip if there obvious smears or large patterns. These chips,
though, look fine.
To plot one CEL file’s data against another use this command:
plot(log(intensity(rawAffyData[,2])),log(intensity(rawAffyData[,3])))
This plots the log-transformed intensity levels of each oligo for both chips (in this case CL20030506102AA and chip CL20030502103AA) against
each other. If the intensity levels are exactly identical on both chips you’ll
see a straight, thin diagonal line. The more the chips deviate the thicker and more dispersed the plotted data will
get.
This command will plot every paired combination of the specified chips against each other:
mva.pairs(pm(rawAffyData)[,c(1,2,3)])
This boxplot will show the ranges of your data across chips:
boxplot(rawAffyData)
Analyzing your data
using the RMA method:
Type in this on the command line:
eset <- expresso(rawAffyData, bgcorrect.method="rma", normalize.method="quantiles", pmcorrect.method="pmonly", summary.method="medianpolish")
This uses the 'expresso' function
which allows you to perform several analysis operations at once. It allows
you to mix different background correction, normalization, perfect match correction,
and summarization methods. Here we’re using the basic RMA method, which uses
‘rma’ background correction, ‘quantiles’ normalization, ‘pmonly’
perfect match correction, and ‘medianpolish’ summarization
functions.
The eset variable now contains the analysis results. To extract the result to a tab-delimited text file use these commands:
sink("ExpressionVals.txt")
write.table(exprs(eset),sep="\t")
sink()
You can now open/import ExpressionVals.txt into a spreadsheet
program (like Excel) to view it. Note that the column headers that identify
the chips are shifted over, so you will have to manually shift them over one
to the right. The row headers are the names of the oligo
probes. The values are the normalized expression levels for the specified
oligo for each of the microarray
chips.
To perform a t-test on the data (to determine which oligos had a significant change between the control group and the treated group) you can use these commands:
First, create a t-test function by copying and pasting this entire block into R (*note: this is a non-paired t-test):
myTTest <- function(x) {
# Split the row of expression values based
on which phenotype category they are in.
# note: This assumes there are exactly two
phenotypes, no more, no less.
valsByPhenotype <- split(x, eset$Target)
# note: Assuming the variances are equal
in the t-test.
# Do a t-test comparison between the expression
values of the two phenotypes.
result <- t.test(x=valsByPhenotype[[1]],y=valsByPhenotype[[2]],var.equal=TRUE)$p.value
# Return the p-value
result.
return(result)
}
Then you can apply the t-test function to your entire data set like this (this will apply the myTTest function to each row of expression values):
pValues <- esApply(eset,1,myTTest)
Then write out the t-test values to a text file:
sink("TTestVals.txt")
write.table(pValues,sep="\t")
sink()
The 'TTestVals.txt' file will be in the same order as the
'ExpressionVals.txt' file created earlier.
Performing Global
False Discovery Rate Analysis:
Global false discovery rate analysis will try to estimate the number of
false positives in your analysis results. One way to perform this using Bioconductor
is to use the 'Twilight' R module. To install this module, first download
the zip file by going to the bioconductor site ( http://bioconductor.org ) and select 'Browse
Packages' and then 'Software'. Find ‘twilight’ in the list and download the
latest install ZIP file. Then select the 'Packages'-'Install package(s) from
local zip files' option from the menu.
Once you’ve installed the package you can load the library:
library(twilight)
To perform the global false discovery rate analysis, first create the phenotype pattern in an array like this (indicating the first 3 chips are in the first phenotype group and the next 3 chips are in the second phenotype group):
id <- c(1,1,1,2,2,2)
Then perform the analysis – this command uses the top 5% as a cutoff (*note: this has a 'paired=TRUE' option if your data is paired):
pval <- twilight.pval(exprs(eset),id,quant.ci=0.95)
You can then just type in the variable name pval to see a summary, including the ‘Estimated percentage of non-induced genes’:
pval
Then export the p-value results to a file:
sink("fdrResults.txt")
write.table(pval$result,sep="\t")
sink()