Friday, March 22, 2013

Great Science : It’s All About the Story

During a recent meeting I had the pleasure of talking with Iain Patten, a professional science writer in Europe.  Iain travels throughout Europe teaching scientists how to write high impact papers.  We were very like minded about the scientific process.  I wanted to share some of information from our discussion here.
 ----------------------------------------------------------
Great, high impact science papers tell an important new story with broad implications.  The story involves effort and planning from day one.  Before you start experiments, read and ask big questions.
·   What are the potential stories that could be told depending on the answers to your questions?
·   What would be new and exciting about those answers and where could they lead experimentally?
·      What are the different ways of telling the story?  Which ones are likely to be exciting and which are not exciting?

You should have a clear idea of the potential stories when you start your project.  Planning, drafting and making figures with real data and imagined data must be a weekly activity.  DO NOT gather data mindlessly and then try to write a paper with everything in year 5….most people do this to some degree.

·      Make pretend figures of the result(s) that would fit with your storyline and see what they might look like if they were really convincing.  This way you won’t forget controls, you will have a clearer sense of the statistics you will use, and you will think through the potential outcomes. 
·      Constantly draft ‘pretend’ figures of results describing alternative outcomes and think about where they might take you.  This way you know where you could be headed and you will be prepared.  This will help elevate the pressure to make the experimental result fit your hypothesis.
·   Constantly reevaluating your story will reduce wasted effort/experiments, help you maintain focus and set goals, and prevent you from overlooking holes in your story or failing to complete experiments.
·      Remember that the story does not have to fit the chronological order of the experiments.  Rearrange the work to tell the story.
· Plan, draft, make figures, see the best experiments and storylines, do the experiments…repeat

High impact scientists are journalists that gather evidence and assemble it into a compelling story that changes how people think.  You must work on your story constantly.  It will light the path in front of you and the paper will be written as you go.

·      When putting the final paper together, all that matters is the data that tells the most compelling, clear and well-supported story.  You must be prepared to leave things out and do more work that further supports the best story.  This stuff happens near the end and it is the toughest to accept when you are close to finishing.
·      Your conclusions must be supported by the data.  Overstatement is the kiss of death from editors.
·      As provided to me by an editor from Nature Medicine, a great story often has the following components:
o   Novel, surprising, entertaining and broad implications/impact
o   Strong mechanistic insights
o   In vivo relevance
o   Functional manipulations
o   Necessity and sufficiency
o   Elegance
·      Be clear about the core message of the paper.  If you can’t state the core message in 10 words or less then you haven’t found the core message.
o   Each subheading and associated section of the paper builds logically toward the core message
o   Paragraphs of each section have one message per paragraph and a logical transition from one paragraph into the next.  All working to support the core message of the section.
o   Plan – draft – plan – draft…and repeat to get an elegant and clear message and structure for the story.
·      Beware of stories that just connect existing dots – the impact and advance tends to be small.

Think carefully about the journal that fits your story best.  A preliminary inquiry can save time.  You must clearly (and without exaggeration) explain how your article fits with the journal’s scope and what the core message is and why it is novel and important.
o   The best reviewers are often experts outside of your field.  They can judge the technical aspects of the work and assess the impact without the internal biases of the field.
o   Editors recommend looking at the journal editorial board for good reviewers to suggest.  These people have a reputation for solid, fair reviews.  Exclude competitors.

Finally, what are you reading for?  It is good to be an expert, but what you are really looking for is the following:
o   New and exciting plot lines for your story.
o   Gems of information that help you support and tell your story.
o   New and exciting plot lines for future stories (untold stories in waiting).
o   New and elegant techniques and approaches that will help you tell a better story.
o   Literature that will confound your story and contradict your interpretations/results.
o   Evidence that your story is novel and high impact.

Good luck!!!  Have lots of ideas and chisel out the best stuff.

Sunday, February 17, 2013

The Epigenome Cometh

The factors that contribute to the development of common diseases have been challenging to define.  Epigenetic mechanisms may play a role and the field is hopeful that epigenome-wide association studies (EWAS) studies will gain new insights.  However, EWAS studies face challenges that genome-wide association studies studies do not.  First, the epigenome has a dizzying array of components involving different forms of methylated DNA, numerous histone modifications and various non-coding RNAs.  Second, these components assemble in a highly cell-type specific manner.  Finally, some elements of the epigenome change in response to disease, making it challenging to find epigenetic signatures with a causal role.  Nonetheless, the first signs that EWAS studies have potential are upon us.

            A recent study by Liu and colleagues undertook an EWAS study of rheumatoid arthritis (RA) to uncover DNA methylation changes that interact with genetic factors to mediate disease risk.  The authors note that RA is an ideal test case for EWAS because the cell-types involved (leukocytes) are well defined and easily isolated.  In addition, the disease state can be ascertained by measures of anti-citrullinated protein (CP) antibodies.  The authors performed a genome-wide DNA methylation analysis of whole blood from 354 rheumatoid arthritis patients and 335 healthy controls for which genome-wide SNP and CP antibody data were also available.  They first correct for cellular heterogeneity in their blood samples by effectively normalizing the data using available DNA methylation signatures for major blood cell types.  Second, the authors use a clever series of conditional correlation analyses involving genotype, methylation and phenotype data to filter out differentially methylated positions (DMPs) that are not likely to be causally related to RA.  Remarkably, this revealed significant associations between a set of SNPs and DMPs located in the MHC gene cluster, which has previously been linked to rheumatoid arthritis.  In a final step, the authors used a causal inference test to define 9 DMPs that mediate the genetic risk for RA through interactions with 264 SNPs in the MHC region and one SNP-DMP pair outside of the MHC region.

            This study not only reveals the importance of understanding the relationships between genetic and epigenetic factors in common diseases, but also establishes a clear methodology to overcome many of the issues inherent to EWAS studies.



Liu et al.  Epigenome-wide association data implicate DNA methylation as an intermediary of genetic risk in rheumatoid arthritis. Nature Biotechnology, published online January 2013

Thursday, January 24, 2013

Retrieve SNPs in Promoters and Flanking Sequence for Many Genes

This post presents R code to retrieve SNPs in promoters for a list of genes.  You provide your list of genes to the "gene" variable and then use biomaRt (mus_musculus) to get the transcriptional start sites (TSS) for each transcript for your list of genes.  The code then uses the TSS info to search in promoter regions (defined here as [-1000, +200 bp] relative to TSS) for SNPs for each gene using biomaRt - my code looks for SNPs distinguishing C57BL/6J and CastEiJ mouse strains, but this can be easily altered in the getSnps function by changing "sp$cast_eij" to another strain name (use listAttributes(snpmart) in biomaRt).

You can view the promoters found to have SNPs in the "dataSnps" variable.   Select the transcript SNP set you are interested in and run the final "getflank" function to get the surrounding sequence information using the mouse genome data provided in the BSgenome library.  The code finds 100 bp of flanking sequence around each SNP site and this can adjust with the "offset" variable in "getflank" function.  The "final" output provides row names and writes a tab delimited file that opens easily in Excel.

This is a quick way to get SNP data in promoter regions for genes of interest.


library(biomaRt)
library(BSgenome.Mmusculus.UCSC.mm10)
setwd("")

#######################INPUT GENES OF INTEREST####
genes = c("Igf2","H19", "Igf2R", "Rasgrf", "Magel2")


#######################GET TSS#############################
#get transcriptional start sites for all genes of interest

ensembl = useMart("ensembl")
ensembl = useDataset("mmusculus_gene_ensembl", mart=ensembl)
tss <- getBM(attributes=c('ensembl_gene_id', 'ensembl_transcript_id', 'chromosome_name','transcript_start', 'transcript_end','mgi_symbol'), filters = c("mgi_symbol"), values=genes, mart=ensembl)

########################GET SNPS IN PROMOTER###########################
#get snps in promoter region : promoter region is defined as [-1000,+200] relative to transcriptional start site

snpmart = useMart("snp")
snpmart = useDataset("mmusculus_snp", mart=snpmart)

getSnps <- function(x){
    txstart = as.numeric(x[4])
    txend = as.numeric(x[5])
    id = x[1]
    txid = x[2]
    name = x[6]    
    chr = x[3]
    if ( txstart > txend ) {
        getBM(attributes=c('refsnp_id', 'chr_name', 'chrom_start', 'allele', 'c57bl_6j', 'cast_eij'), filters=c("chr_name", "chrom_start", "chrom_end"), values=list( chr, (txstart - 200), (txstart + 1000)), mart = snpmart)->sp
        sp[sp$cast_eij %in% c("A","T","G","C"), ]->dataSnps

        }    else if ( txstart < txend ) {    
        getBM(attributes=c('refsnp_id', 'chr_name', 'chrom_start', 'allele', 'c57bl_6j', 'cast_eij'), filters=c("chr_name", "chrom_start", "chrom_end"), values=list( 2, ( txstart - 1000 ), ( txstart +200 )), mart = snpmart)->sp
        sp[sp$cast_eij %in% c("A","T","G","C"), ]->dataSnps
        }
    if ( nrow(dataSnps) > 1 ) {
    cbind(id,txid,name,dataSnps, txstart)->results
    return(results)
    } 
}
    
snps<-apply(tss, 1, getSnps)
dataSnps <- snps[!sapply(snps, is.null)]#All SNP Info with NULLS removed

###########################GET FLANKING SEQUENCE FOR EACH SNP###########################
#Get sequence for snps

dataSnps #view output 

dataSnps[[19]] -> d #select SNP set for transcript of interest

getflank <- function(x) {
    id = x[1]
    txid = x[2]
    name = x[3]
    position = as.numeric(x[6])
    alleles = paste("[",x[7]," > ",x[9],"]", sep="")
    chr = paste("chr", x[5], sep="")
    txstart = x[10]
    offset = 100
    leftflank  <- getSeq(BSgenome.Mmusculus.UCSC.mm10,chr,position-offset,position-1)
    rightflank <- getSeq(BSgenome.Mmusculus.UCSC.mm10,chr,position+1,position+offset)
    paste(leftflank,alleles,rightflank,sep="")->seq
    cbind(id, txid, name, txstart, position, alleles, chr, seq)->out
    return(out)
}


final <- apply(d, 1, getflank)
r<- c("id","txid", "name", "txstart", "snp position", "Ref > Alt", "chr", "Seq")
rownames(final) <- r
write.table(final, file = "SNPSandFlankingSEQinPROMOTERS_Gene.txt", sep="\t")