轉載:http://www.bio-info-trainee.com/1327.html
收集了那么多的癌癥細胞系的表達數據,拷貝數變異數據,突變數據,總不能放著讓它發霉吧!
這些數據可以利用的地方非常多,但是在谷歌里面搜索引用了它的文章卻不多,我挑了其中幾個,解讀了一下別人是如何利用這個數據的,當然,主要是用那個mRNA的表達數據咯!
這篇文獻對CCLE的數據進行了八個步驟的處理,一個合格的生物信息學分析著完全可以重寫這個過程
step1:Affymetrix U133 Plus2 DNA microarray gene expressions of 27 gastric cancer cell lines (Kato-III, IM95, SNU-620, SNU-16, OCUM-1, NUGC-4,
2313287, HUG1N, MKN45, NCIN87, KE39, AGS, SNU-5, SNU-216, NUGC-3, NUGC-2, MKN74, MKN7, RERFGC1B, GCIY, KE97, Fu97, SH10TC, MKN1, SNU-1, Hs746 T, HGC27) were downloaded from Cancer Cell Line Encyclopedia (CCLE)
[16] in March 2013.
step2: Robust Multi-array Average (RMA) normalization was performed. Principal component analysis plot show no obvious batch effect.
step3: The normalized data is then collapsed by taking the probe sets with highest gene expression.
前三步是為了得到27個胃癌相關細胞系的mRNA表達矩陣,方法是下載cel文件,用RMA歸一化,對多探針基因去最大表達量探針!
step4:Unsupervised hierarchical clustering (1-Spearman distance, average linkage) was performed on the cell lines using the aCGH data.
Putative driver genes of which copy number aberrations correlated to mRNA gene expression were identified to determine subtypes or clusters that are driven by different mechanisms. This was done using Mann Whitney U-test with p<0.05, and Spearman Correlation Coefficient test with Rho >0.6.
step5:We then performed consensus clustering[17] on the gene expression data of the 27 gastric cancer cell lines from CCLE using these putative driver genes. We selected k?=?2 as it gives sufficiently stable similarity matrix.
step6: In order to assign new samples to this integrative cluster, significance analysis of microarray (SAM) [18]with threshold q<2.0 was used to generate subtype signature based on the mRNA expression data of the 1762 genes from the 27 gastric cancer cell lines in CCLE.
先用甲基化數據來聚類,得到putative driver genes,然后再用這些基因的表達數據來再次聚類,分成兩類,然后對這兩類進行SAM找差異基因
step7:ssGSEA (single sample GSEA)was used to estimate pathway activities of the gastric cancer cell line in the Molecular Signature Database v3.1
(Msigdb v3.1) [19],
[20]. The pathway activities are represented in enrichment scores which were rank normalized to [0.0, 1.0].
step8:SAM analysis was performed with threshold q<0.2, and fold change >2.0 (for up-regulated pathways), or <0.5 (for down-regulated pathways) to obtain subtype-specific pathways from the 27 gastric cell lines in CCLE.
這里既用來gene set的富集分析,又用來超幾何分布的富集分析,結果去看看這篇文章就知道了!
這篇文章只用了CCLE的一個地方,就是看看不同cancer type里面的某個基因表達boxplot
這個圖的數據用GEOquery可以得到,樣本的分類信息也用GEOquery可以得到,這樣就可以做下面這個圖了,非常簡單
Further, the Cancer Cell Line Encyclopedia (CCLE) database demonstrated that of 1062 cell lines representing 37 distinct cancer types, glioma cell lines express the highest levels of STK17A
這篇文獻更簡單了,直接對這個表達矩陣進行聚類:
The 5,000 most variable genes were used for unsupervised clustering of cell lines by mRNA expression data. Cell lines are colour-coded (vertical bars) according to the reported tissue of origin (a PDF version that can be enlarged at high resolution is in
Supplementary Information,
Supplementary Fig. S4); horizontal labels at bottom indicate the dominating tissue types within the respective branches of the dendrogram. Most ovarian cancer cell lines (magenta) cluster together, interspersed with endometrial cell lines. However, some ovarian cancer cell lines cluster with other tissue types (*). Top right panels: neighbourhoods (1) of the top cell lines in our analysis, (2) of cell line IGROV1, and (3) of cell line A2780. For the ovarian cancer cell lines in these enlarged areas, the histological subtype as assigned in the original publication is indicated by coloured letters.
就直接拿整個表達矩陣即可,然后挑選變異最大的5000個基因來進行聚類,就可以得到類似的圖