Friday, March 22, 2013

Great Science : It’s All About the Story

During a recent meeting I had the pleasure of talking with Iain Patten, a professional science writer in Europe.  Iain travels throughout Europe teaching scientists how to write high impact papers.  We were very like minded about the scientific process.  I wanted to share some of information from our discussion here.
Great, high impact science papers tell an important new story with broad implications.  The story involves effort and planning from day one.  Before you start experiments, read and ask big questions.
·   What are the potential stories that could be told depending on the answers to your questions?
·   What would be new and exciting about those answers and where could they lead experimentally?
·      What are the different ways of telling the story?  Which ones are likely to be exciting and which are not exciting?

You should have a clear idea of the potential stories when you start your project.  Planning, drafting and making figures with real data and imagined data must be a weekly activity.  DO NOT gather data mindlessly and then try to write a paper with everything in year 5….most people do this to some degree.

·      Make pretend figures of the result(s) that would fit with your storyline and see what they might look like if they were really convincing.  This way you won’t forget controls, you will have a clearer sense of the statistics you will use, and you will think through the potential outcomes. 
·      Constantly draft ‘pretend’ figures of results describing alternative outcomes and think about where they might take you.  This way you know where you could be headed and you will be prepared.  This will help elevate the pressure to make the experimental result fit your hypothesis.
·   Constantly reevaluating your story will reduce wasted effort/experiments, help you maintain focus and set goals, and prevent you from overlooking holes in your story or failing to complete experiments.
·      Remember that the story does not have to fit the chronological order of the experiments.  Rearrange the work to tell the story.
· Plan, draft, make figures, see the best experiments and storylines, do the experiments…repeat

High impact scientists are journalists that gather evidence and assemble it into a compelling story that changes how people think.  You must work on your story constantly.  It will light the path in front of you and the paper will be written as you go.

·      When putting the final paper together, all that matters is the data that tells the most compelling, clear and well-supported story.  You must be prepared to leave things out and do more work that further supports the best story.  This stuff happens near the end and it is the toughest to accept when you are close to finishing.
·      Your conclusions must be supported by the data.  Overstatement is the kiss of death from editors.
·      As provided to me by an editor from Nature Medicine, a great story often has the following components:
o   Novel, surprising, entertaining and broad implications/impact
o   Strong mechanistic insights
o   In vivo relevance
o   Functional manipulations
o   Necessity and sufficiency
o   Elegance
·      Be clear about the core message of the paper.  If you can’t state the core message in 10 words or less then you haven’t found the core message.
o   Each subheading and associated section of the paper builds logically toward the core message
o   Paragraphs of each section have one message per paragraph and a logical transition from one paragraph into the next.  All working to support the core message of the section.
o   Plan – draft – plan – draft…and repeat to get an elegant and clear message and structure for the story.
·      Beware of stories that just connect existing dots – the impact and advance tends to be small.

Think carefully about the journal that fits your story best.  A preliminary inquiry can save time.  You must clearly (and without exaggeration) explain how your article fits with the journal’s scope and what the core message is and why it is novel and important.
o   The best reviewers are often experts outside of your field.  They can judge the technical aspects of the work and assess the impact without the internal biases of the field.
o   Editors recommend looking at the journal editorial board for good reviewers to suggest.  These people have a reputation for solid, fair reviews.  Exclude competitors.

Finally, what are you reading for?  It is good to be an expert, but what you are really looking for is the following:
o   New and exciting plot lines for your story.
o   Gems of information that help you support and tell your story.
o   New and exciting plot lines for future stories (untold stories in waiting).
o   New and elegant techniques and approaches that will help you tell a better story.
o   Literature that will confound your story and contradict your interpretations/results.
o   Evidence that your story is novel and high impact.

Good luck!!!  Have lots of ideas and chisel out the best stuff.

Sunday, February 17, 2013

The Epigenome Cometh

The factors that contribute to the development of common diseases have been challenging to define.  Epigenetic mechanisms may play a role and the field is hopeful that epigenome-wide association studies (EWAS) studies will gain new insights.  However, EWAS studies face challenges that genome-wide association studies studies do not.  First, the epigenome has a dizzying array of components involving different forms of methylated DNA, numerous histone modifications and various non-coding RNAs.  Second, these components assemble in a highly cell-type specific manner.  Finally, some elements of the epigenome change in response to disease, making it challenging to find epigenetic signatures with a causal role.  Nonetheless, the first signs that EWAS studies have potential are upon us.

            A recent study by Liu and colleagues undertook an EWAS study of rheumatoid arthritis (RA) to uncover DNA methylation changes that interact with genetic factors to mediate disease risk.  The authors note that RA is an ideal test case for EWAS because the cell-types involved (leukocytes) are well defined and easily isolated.  In addition, the disease state can be ascertained by measures of anti-citrullinated protein (CP) antibodies.  The authors performed a genome-wide DNA methylation analysis of whole blood from 354 rheumatoid arthritis patients and 335 healthy controls for which genome-wide SNP and CP antibody data were also available.  They first correct for cellular heterogeneity in their blood samples by effectively normalizing the data using available DNA methylation signatures for major blood cell types.  Second, the authors use a clever series of conditional correlation analyses involving genotype, methylation and phenotype data to filter out differentially methylated positions (DMPs) that are not likely to be causally related to RA.  Remarkably, this revealed significant associations between a set of SNPs and DMPs located in the MHC gene cluster, which has previously been linked to rheumatoid arthritis.  In a final step, the authors used a causal inference test to define 9 DMPs that mediate the genetic risk for RA through interactions with 264 SNPs in the MHC region and one SNP-DMP pair outside of the MHC region.

            This study not only reveals the importance of understanding the relationships between genetic and epigenetic factors in common diseases, but also establishes a clear methodology to overcome many of the issues inherent to EWAS studies.

Liu et al.  Epigenome-wide association data implicate DNA methylation as an intermediary of genetic risk in rheumatoid arthritis. Nature Biotechnology, published online January 2013

Thursday, January 24, 2013

Retrieve SNPs in Promoters and Flanking Sequence for Many Genes

This post presents R code to retrieve SNPs in promoters for a list of genes.  You provide your list of genes to the "gene" variable and then use biomaRt (mus_musculus) to get the transcriptional start sites (TSS) for each transcript for your list of genes.  The code then uses the TSS info to search in promoter regions (defined here as [-1000, +200 bp] relative to TSS) for SNPs for each gene using biomaRt - my code looks for SNPs distinguishing C57BL/6J and CastEiJ mouse strains, but this can be easily altered in the getSnps function by changing "sp$cast_eij" to another strain name (use listAttributes(snpmart) in biomaRt).

You can view the promoters found to have SNPs in the "dataSnps" variable.   Select the transcript SNP set you are interested in and run the final "getflank" function to get the surrounding sequence information using the mouse genome data provided in the BSgenome library.  The code finds 100 bp of flanking sequence around each SNP site and this can adjust with the "offset" variable in "getflank" function.  The "final" output provides row names and writes a tab delimited file that opens easily in Excel.

This is a quick way to get SNP data in promoter regions for genes of interest.


#######################INPUT GENES OF INTEREST####
genes = c("Igf2","H19", "Igf2R", "Rasgrf", "Magel2")

#######################GET TSS#############################
#get transcriptional start sites for all genes of interest

ensembl = useMart("ensembl")
ensembl = useDataset("mmusculus_gene_ensembl", mart=ensembl)
tss <- getBM(attributes=c('ensembl_gene_id', 'ensembl_transcript_id', 'chromosome_name','transcript_start', 'transcript_end','mgi_symbol'), filters = c("mgi_symbol"), values=genes, mart=ensembl)

########################GET SNPS IN PROMOTER###########################
#get snps in promoter region : promoter region is defined as [-1000,+200] relative to transcriptional start site

snpmart = useMart("snp")
snpmart = useDataset("mmusculus_snp", mart=snpmart)

getSnps <- function(x){
    txstart = as.numeric(x[4])
    txend = as.numeric(x[5])
    id = x[1]
    txid = x[2]
    name = x[6]    
    chr = x[3]
    if ( txstart > txend ) {
        getBM(attributes=c('refsnp_id', 'chr_name', 'chrom_start', 'allele', 'c57bl_6j', 'cast_eij'), filters=c("chr_name", "chrom_start", "chrom_end"), values=list( chr, (txstart - 200), (txstart + 1000)), mart = snpmart)->sp
        sp[sp$cast_eij %in% c("A","T","G","C"), ]->dataSnps

        }    else if ( txstart < txend ) {    
        getBM(attributes=c('refsnp_id', 'chr_name', 'chrom_start', 'allele', 'c57bl_6j', 'cast_eij'), filters=c("chr_name", "chrom_start", "chrom_end"), values=list( 2, ( txstart - 1000 ), ( txstart +200 )), mart = snpmart)->sp
        sp[sp$cast_eij %in% c("A","T","G","C"), ]->dataSnps
    if ( nrow(dataSnps) > 1 ) {
    cbind(id,txid,name,dataSnps, txstart)->results
snps<-apply(tss, 1, getSnps)
dataSnps <- snps[!sapply(snps, is.null)]#All SNP Info with NULLS removed

###########################GET FLANKING SEQUENCE FOR EACH SNP###########################
#Get sequence for snps

dataSnps #view output 

dataSnps[[19]] -> d #select SNP set for transcript of interest

getflank <- function(x) {
    id = x[1]
    txid = x[2]
    name = x[3]
    position = as.numeric(x[6])
    alleles = paste("[",x[7]," > ",x[9],"]", sep="")
    chr = paste("chr", x[5], sep="")
    txstart = x[10]
    offset = 100
    leftflank  <- getSeq(BSgenome.Mmusculus.UCSC.mm10,chr,position-offset,position-1)
    rightflank <- getSeq(BSgenome.Mmusculus.UCSC.mm10,chr,position+1,position+offset)
    cbind(id, txid, name, txstart, position, alleles, chr, seq)->out

final <- apply(d, 1, getflank)
r<- c("id","txid", "name", "txstart", "snp position", "Ref > Alt", "chr", "Seq")
rownames(final) <- r
write.table(final, file = "SNPSandFlankingSEQinPROMOTERS_Gene.txt", sep="\t")

Monday, December 17, 2012

New Approaches to Gene Co-expression Network Analysis  An important new paper on the methodology for doing gene co-expression network analysis was recently published in PLoS ONE by Kumari et al. (2012).  The paper is entitled "Evaluation of Gene Association Methods for Coexpression Network Construction and Biological Knowledge Discovery".

The authors perform a comparative analysis of several different approaches for constructing co-expression networks.


Constructing coexpression networks and performing network analysis using large-scale gene expression data sets is an effective way to uncover new biological knowledge; however, the methods used for gene association in constructing these coexpression networks have not been thoroughly evaluated. Since different methods lead to structurally different coexpression networks and provide different information, selecting the optimal gene association method is critical.

Methods and Results: In this study, we compared eight gene association methods – Spearman rank correlation, Weighted Rank Correlation, Kendall, Hoeffding’s D measure, Theil-Sen, Rank Theil-Sen, Distance Covariance, and Pearson – and focused on their true knowledge discovery rates in associating pathway genes and construction coordination networks of regulatory genes. We also examined the behaviors of different methods to microarray data with different properties, and whether the biological processes affect the efficiency of different methods.

Conclusions: We found that the Spearman, Hoeffding and Kendall methods are effective in identifying coexpressed pathway genes, whereas the Theil-sen, Rank Theil-Sen, Spearman, and Weighted Rank methods perform well in identifying coordinated transcription factors that control the same biological processes and traits. Surprisingly, the widely used Pearson method is generally less efficient, and so is the Distance Covariance method that can find gene pairs of multiple relationships. Some analyses we did clearly show Pearson and Distance Covariance methods have distinct behaviors as compared to all other six methods. The efficiencies of different methods vary with the data properties to some degree and are largely contingent upon the biological processes, which necessitates the pre-analysis to identify the best performing method for gene association and coexpression network construction.

Sunday, December 9, 2012

Dr. Coni Horndli Receives Prestigious Swiss Fellowship!! Coni Horndli has been awarded a prestigious fellowship from the Swiss National Science Foundation.  Dr. Horndli is a postdoctoral fellow in the Gregg Lab developing novel approaches to study genetic and epigenetic pathways in the brain that modulate complex feeding and foraging behaviors.  She has a particular interest in molecular mechanisms that influence stress and anxiety.  Dr. Horndli's work is anticipated to transform our understanding of the mechanisms in the brain that contribute to susceptibility to eating disorders, stress and anxiety-related disorders, and depression.

Wednesday, December 5, 2012

IGF2:IGF2R Evolution

By:  Dr. Coni Horndli

An Exon Splice Enhancer Primes IGF2:IGF2R Binding Site Structure and Function Evolution

Christopher Williams,1* Hans-Jürgen Hoppe,2* Dellel Rezgui,2 Madeleine Strickland,1 Briony E. Forbes,3 Frank Grutzner,3 Susana Frago,2 Rosamund Z. Ellis,1 Pakorn Wattana-Amorn,1 Stuart N. Prince,2 Oliver J. Zaccheo,2 Catherine M. Nolan,4 Andrew J. Mungall,5 E. Yvonne Jones,6 Matthew P. Crump,1† A. Bassim Hassan2†

Placental development and genomic imprinting coevolved with parental conflict over resource distribution to mammalian offspring. The imprinted genes IGF2 and IGF2R code for the growth promoter insulin-like growth factor 2 (IGF2) and its inhibitor, mannose 6-phosphate (M6P)/IGF2 receptor (IGF2R), respectively. M6P/IGF2R of birds and fish do not recognize IGF2. In monotremes, which lack imprinting, IGF2 specifically bound M6P/IGF2R via a hydrophobic CD loop. We show that the DNA coding the CD loop in monotremes functions as an exon splice enhancer (ESE) and that structural evolution of binding site loops (AB, HI, FG) improved therian IGF2 affinity. We propose that ESE evolution led to the fortuitous acquisition of IGF2 binding by M6P/IGF2R that drew IGF2R into parental conflict; subsequent imprinting may then have accelerated affinity maturation.


This report published by Matthew Crump’s and Bassim Hassan’s groups this week in Science analyses the evolutionary molecular changes, which led to high affinity binding of IGF2R to IGF2 in mammals but not birds and reptiles. IGF2 and IGF2R are two of the roughly 100 canonically imprinted genes found in mammals, with IGF2 expressed only from the paternal allele and IGF2R only from the maternal allele. In mice, deletion of the maternal IGF2R gene results in overly large offspring while deletion of the paternal IGF2 gene results in dwarf offspring. In humans, only IGF2 is imprinted but not its receptor. Activation of the maternal IGF2 allele causes Beckwith-Wiedemann syndrome, which is characterized large body size at birth and an increased risk for childhood cancer. The reciprocal expression of IGF2 and IGF2R underscores the parental conflict over the distribution of resources to their offspring. This hypothesis is based on the theory that mothers want to distribute their resources equally to all their current and future offspring, while fathers favor the maternal investment into the current offspring.
This study correlates the appearance of IGF2/R with the occurrence of their monoallelic expression. Specifically, Williams et al. show that binding appeared in all primitive mammals, while imprinting is only found in theria, such as rodents, kangaroos and opossums. Therefore, the authors hypothesize that the evolution of IGF2/R imprinting was facilitated by the appearance of their molecular binding, which may conversely have accelerated the selection for improved regulation of IGF2 through IGF2R.
This report thoroughly reveals the structural changes that lead to IGF2:IGF2R complex formation but falls short on explaining the mechanism of how IGF2/R binding facilitates genomic imprinting of these two genes.