QC workflow for Proteomics Data
The RMD_RUNS function will log10 transform peptide peak intensity, that is, peptide abundance data and determine if any LC-MS analyses in a peptide data set are statistical outliers. The statistical analysis is based on summarizing each LC-MS run as a vector of q=5 summary statistics which describe the peptide abundance distribution for a specific run; a N x q matrix is then analyzed using robust PCA to compute a robust estimate of the covariance matrix used in a the calculation of a robust Mahalanobis distance. [Matzke MM, et al., 2011. "Improved quality control processing of peptide-centric LC-MS proteomics data." – Bioinformatics 2011 Oct 15;27(20):2866-72]
SPANS is an approach to evaluate normalization strategies, which includes the peptide selection component associated with the derivation of normalization values. Our approach evaluates the effect of normalization on the between-group variance structure in order to identify the most appropriate normalization methods that improve the structure of the data without introducing bias into the normalized peak intensities. The SPANS protocol was implemented in Java and all statistical methods were performed using MatLab® 2011a. [Webb-Robertson BJM, et al., 2011. "A Statistical Protocol for the Selection of Appropriate LC-MS Proteomics Peptide Dataset Normalizations.-Proteomics 2011 11(24):4736-41]
Quantification of sequence abundance in RNA-Seq experiments is often conflated by protocol-specific sequence bias.The exact sources of the bias are unknown, but may be influenced by PCR amplification, or differing primer affinities and mixtures, for example. The result is decreased accuracy in many applications, such as de novo gene annotation and transcript quantification.
We developed a new method to measure and correct for these influences using a simple graphical model. Our model does not rely on existing gene annotations, and model selection is performed automatically making it applicable with few assumptions. We evaluated our method on several data sets, and by multiple criteria, demonstrating that it effectively decreases bias and increases uniformity. Additionally, our theoretical and empirical results show that the method is unlikely to have any effect on unbiased data, suggesting it can be applied with little risk of spurious adjustment.
The method is described in Jones, D.C., W.L. Ruzzo, X. Peng, and M.G. Katze “A new approach to bias correction in RNA-Seq” Bioinformatics 2012; doi: 10.1093/bioinformatics/bts055 and the method is implemented in the seqbias R/Bioconductor package which is available freely under the LGPL license from http://bioconductor.org.
Compression of next-generation sequencing reads aided by highly-efficient de novo assembly
We present Quip, a lossless compression algorithm for next-generation sequencing data in the FASTQ and SAM/BAM formats. In addition to implementing reference-based compression, we have developed, to our knowledge, the first assembly-based compressor, using a novel de novo assembly algorithm. A probabilistic data structure is used to dramatically reduce the memory required by traditional de Bruijn graph assemblers, allowing millions of reads to be assembled very efficiently. Read sequences are then stored as positions within the assembled contigs. This is combined with statistical compression of read identifiers, quality scores, alignment information and sequences, effectively collapsing very large data sets to <15% of their original size with no loss of information. Availability: Quip is freely available under the 3-clause BSD license from http://cs.washington.edu/homes/dcjones/quip Nucleic Acids Res. 2012 Aug 16