Introduction

This tutorial will show you how to use the ppi subset of crisscrosslinker functions in the context of the yeast dataset from the paper Architecture of the RNA polymerase II–TFIIF complex revealed by cross-linking and mass spectrometry (Chen et. al 2010).

This tutorial will cover the basics including:

Install crisscrosslinkeR

If you have not done so already, please install crisscrosslinkeR:

library(devtools)
install_github('egmg726/crisscrosslinker')

Load Libraries

First, load the crisscrosslinker package into your R session.

library(crisscrosslinker)
library(Biostrings)
library(XML)
library(ggplot2)
library(reshape2)

Download Data from PRIDE (Optional)

Many of the publically available mass spectrometry datasets are available from PRIDE. You can download mass spectrometry .raw files from here and then process it for crisscrosslinker using the open-access software pLink.

See the prideR package if you want to search through the accession numbers using your R console.

Load Data

You will then need to load the data into your R session.

The data used for this tutorial was processed using pLink2. crisscrosslinker is designed to handle output from both versions of pLink (1 and 2). At this time, this is the only type of data that will be read by the program.

The uploaded .raw files to PRIDE can be accessed here. The plink2-processed data and its associated FASTA file will be used for input. All of these files are available for download via crisscrosslinkeR.

You can access the plink2-processed data using the following code:

#Get the file path for the files needed
file_path <- system.file("extdata/RNAPolII_ppi",package = 'crisscrosslinker',mustWork = TRUE)

#Names of files used for the experiment. All experiment files must be within the same list for ppi.loadData()
file_names <- c('Sample_Cross-linked_peptides.csv','Sample_loop-linked_peptides.csv')

#Read FASTA file used for analysis (should be the same one used for )
fasta_file <- seqinr::read.fasta(paste0(file_path,'/Sample.fasta'))

ppi.data <- ppi.loadData(file_names,fasta_file,file_directory = file_path)

Currently, the program assumes all files used for the experiments are within the same directory and will automatically get the current working directory using getwd() or using the file_directory variable within ppi.loadData.

Combine the Data

Once the data has been loaded from the files into your R envionment, the ppi.combineData function will create a table to show the frequency of binding sites within your loaded datasets.

xlink.df <- ppi.combineData(ppi.data,fasta_file)

This will create a table with the calculated frequencies. We can then check out how the data looks:

head(xlink.df)
##                                                    seq
## 1 AAAALGLFNEEGLESTGEDFLKK(22)-GFPGPPGMKGPAGMPGFPGMK(9)
## 2            ADPTTTVNAKVGAENDGDSSLFLR(10)-FLSGSWDKR(8)
## 3                          ADPVALYLGLKR(11)-LSSDEKR(6)
## 4                                AEICLKR(6)-TVTAKVR(5)
## 5                        AGDIQLQYGWKVER(11)-GNLMGKR(6)
## 6                    AGDIQLQYGWKVER(11)-VGQQYSSAPLR(1)
##                                                                     pro freq
## 1 sp|P32494|NGG1_YEAST(258)-gi|CON_IPI00731432.2|SWISS-PROT:P04258(251)    1
## 2                  sp|P35177|SPT7_YEAST(1284)-sp|P38915|SPT8_YEAST(329)    1
## 3                 sp|P20434|RPAB1_YEAST(191)-sp|P20434|RPAB1_YEAST(161)    1
## 4                   sp|P35177|SPT7_YEAST(751)-sp|P35177|SPT7_YEAST(743)    1
## 5                   sp|P04050|RPB1_YEAST(431)-sp|P04050|RPB1_YEAST(343)    1
## 6                     sp|P04050|RPB1_YEAST(431)-sp|P04050|RPB1_YEAST(2)    1
##   freq_color                            files pro_pos1 pro_pos2
## 1        red Sample_Cross-linked_peptides.csv      258      251
## 2        red Sample_Cross-linked_peptides.csv     1284      329
## 3        red Sample_Cross-linked_peptides.csv      191      161
## 4        red Sample_Cross-linked_peptides.csv      751      743
## 5        red Sample_Cross-linked_peptides.csv      431      343
## 6        red Sample_Cross-linked_peptides.csv      431        2
##               pro_name1                              pro_name2
## 1  sp|P32494|NGG1_YEAST gi|CON_IPI00731432.2|SWISS-PROT:P04258
## 2  sp|P35177|SPT7_YEAST                   sp|P38915|SPT8_YEAST
## 3 sp|P20434|RPAB1_YEAST                  sp|P20434|RPAB1_YEAST
## 4  sp|P35177|SPT7_YEAST                   sp|P35177|SPT7_YEAST
## 5  sp|P04050|RPB1_YEAST                   sp|P04050|RPB1_YEAST
## 6  sp|P04050|RPB1_YEAST                   sp|P04050|RPB1_YEAST
##                   pep_seq1              pep_seq2 pep_pos1 pep_pos2        score
## 1  AAAALGLFNEEGLESTGEDFLKK GFPGPPGMKGPAGMPGFPGMK       22        9 2.609783e-01
## 2 ADPTTTVNAKVGAENDGDSSLFLR             FLSGSWDKR       10        8 6.387972e-03
## 3             ADPVALYLGLKR               LSSDEKR       11        6 1.722353e-14
## 4                  AEICLKR               TVTAKVR        6        5 5.247864e-12
## 5           AGDIQLQYGWKVER               GNLMGKR       11        6 3.948047e-12
## 6           AGDIQLQYGWKVER           VGQQYSSAPLR       11        1 5.510616e-15

Visualize Results with xiNET

In this particular dataset, there are contaminants we do not want to visualize. We will first filter them out.

xlink.df.filtered <- xlink.df[(!grepl('CON',xlink.df$pro)),]

From here, we can create a file that can be used as input for xiNET (Combe et. al 2015) and its successor, xiVIEW

xlink.df.xinet <- ppi.xinet(xlink.df.filtered,xlink_viewer_csv_file_name='xinet_input.csv')
head(xlink.df.xinet)
##   PepPos1 PepPos2                  PepSeq1     PepSeq2 LinkPos1 LinkPos2
## 1    1275     322 ADPTTTVNAKVGAENDGDSSLFLR   FLSGSWDKR       10        8
## 2     181     156             ADPVALYLGLKR     LSSDEKR       11        6
## 3     746     739                  AEICLKR     TVTAKVR        6        5
## 4     421     338           AGDIQLQYGWKVER     GNLMGKR       11        6
## 5     421       2           AGDIQLQYGWKVER VGQQYSSAPLR       11        1
## 6      48      29               AKIGGLNDPR    AISVAKIR        2        6
##                Protein1              Protein2        Score
## 1  sp|P35177|SPT7_YEAST  sp|P38915|SPT8_YEAST 6.387972e-03
## 2 sp|P20434|RPAB1_YEAST sp|P20434|RPAB1_YEAST 1.722353e-14
## 3  sp|P35177|SPT7_YEAST  sp|P35177|SPT7_YEAST 5.247864e-12
## 4  sp|P04050|RPB1_YEAST  sp|P04050|RPB1_YEAST 3.948047e-12
## 5  sp|P04050|RPB1_YEAST  sp|P04050|RPB1_YEAST 5.510616e-15
## 6  sp|P04050|RPB1_YEAST  sp|P04050|RPB1_YEAST 7.147918e-12

This file can be directly inputted into xiNET along with the accompanying FASTA file.

This allows you to visualize your data in a web-based platform after appropriate filtration and data aggregation steps have been performed.

Match PDB Structures to your Crosslinked Sites

For this dataset, we want to map all of the proteins within the same PDB file. We have chosen 1WCM for this purpose since it used in the publication as well.

ppi.alignIDs <- read.csv(paste0(file_path,'/Sample_alignIDs.csv'))
ppi.alignIDs
##              ProteinName UniProtID    PDB
## 1  sp|P20434|RPAB1_YEAST    P20434 1WCM_E
## 2   sp|P04050|RPB1_YEAST    P04050 1WCM_A
## 3  sp|P38902|RPB11_YEAST    P38902 1WCM_K
## 4  sp|P40422|RPAB4_YEAST    P40422 1WCM_L
## 5   sp|P08518|RPB2_YEAST    P08518 1WCM_B
## 6   sp|P16370|RPB3_YEAST    P16370 1WCM_C
## 7   sp|P20433|RPB4_YEAST    P20433 1WCM_D
## 8  sp|P20435|RPAB2_YEAST    P20435 1WCM_F
## 9   sp|P34087|RPB7_YEAST    P34087 1WCM_G
## 10 sp|P22139|RPAB5_YEAST    P22139 1WCM_J
## 11  sp|P27999|RPB9_YEAST    P27999 1WCM_I

We are going to further filter our dataset to only those proteins which show up in this particular dataframe.

xlink.df.filtered2 <- xlink.df.filtered[xlink.df.filtered$pro_name1 %in% ppi.alignIDs$ProteinName,]
xlink.df.filtered2 <- xlink.df.filtered2[xlink.df.filtered2$pro_name2 %in% ppi.alignIDs$ProteinName,]

##Note about Sequence Alignments

Sequence alignments are done using the PDB mapping function through RCSB to ensure accuracy.

Since many PDB files contain large gaps, if your FASTA file is not the original UniProt sequence but is similar in sequence, doing a direct sequence alignment may not produce the best results.It is often better to align it to the canonical UniProt sequence and then use the RCSB API to ensure that the sequences are being mapped accurately based on their database.

Alignment to PDB

xlink.df.pdb <- ppi.matchPDB2(xlink.df.filtered2,fasta_file = fasta_file, alignIDs = ppi.alignIDs,uniprot2pdb = FALSE)

As you can see, 3 additional columns have been added:

We can use the dist column to look at the distibution of distances within our dataset and see if they’re reasonable given the crosslinker used (in this case BS3).

qplot(na.omit(xlink.df.pdb$dist),xlab='Distance',ylab='Frequency')

However, we may want see if this distribution is random since it somewhat resembles a normal distribution. We can check this by using the function ppi.distAnalysis.

set.seed(224)
ppi.distAnalysis(xlink.df.pdb = xlink.df.pdb)
##   HEADER    TRANSCRIPTION                           17-NOV-04   1WCM

Based on this plot, it appears that the experimental distances are not random and skew right, consistent with a short crosslinker like BS3 which is approximately 11.3 Angstroms.

We can further customize the plots if we want certain aesthetics, such as for a publication. This can be done by turning off the plot variable within the ppi.distAnalysis function.

set.seed(224)
dist.hist.df.mega <- ppi.distAnalysis(xlink.df.pdb = xlink.df.pdb, plot=FALSE)
##   HEADER    TRANSCRIPTION                           17-NOV-04   1WCM

dist.hist.df.mega is a list containing a data.frame for each PDB file that was matched to the sequences chosen.

We can then use another color palette that we want to use for our histogram.

Since we only have one PDB structure, 1WCM, we can just select it from the list to use.

ggplot(melt(dist.hist.df.mega[['1WCM']]), aes(value, fill = variable)) +
  geom_histogram(position = "dodge") +
  labs(x=expression(paste('Distance (',ring(A),')')),y='Frequency',title='Distance Analysis for 1WCM') +
  scale_fill_manual(values=c("steelblue","forestgreen") )+
  theme_classic() + theme(legend.title = element_blank())

Note: the control distribution will look slightly different each time it is run due to the random sampling of the lysines unless set.seed is set before ppi.distAnalysis.

Visualizing using PyMOL (3D)

First, we will omit all rows which contain an NA, which do not have a valid distance in them.

xlink.df.pdb <- na.omit(xlink.df.pdb)

Using Default Settings

We can first check how the sites look on a structure by writing a file that can be read by PyMOL.

ppi.pymol(xlink.df = xlink.df.pdb, write_file = TRUE,file.name='xlink_default.pml')

Using Custom Colors

If we do not want to use the colors automatically generated by ppi.combineData, we can customize this to better fit our desired aesthetic.

ppi.pymol(xlink.df = xlink.df.pdb, colors = '#f38bd1',color_by = 'freq',write_file = TRUE,file.name='xlink_pink.pml')

Coloring by Distance

We can also color the crosslinking sites by distance.

ppi.pymol(xlink.df = xlink.df.pdb, color_by='dist', write_file = TRUE,file.name='xlink_dist.pml')

We are currently working on adding more customization options within the ppi.pymol function for coloring by dist. However, you can also color by distance yourself by using the following code:

dist.colors <- c("blue", "green", "orange","red") #colors desired for each of the intervals
dist.int <- c(-Inf,10,15,20,Inf) #intervals for each of the different colors

xlink.df.pdb$freq_color <- cut(xlink.df.pdb$dist, dist.int, dist.colors)

head(xlink.df.pdb[c('freq_color','dist')])
##   freq_color     dist
## 3      green 13.31661
## 5        red 26.13151
## 6     orange 15.94826
## 7      green 13.67619
## 8        red 24.95538
## 9     orange 16.43422

Then use the ppi.pymol function with default settings to produce your .pml file.

Session Info

devtools::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 3.6.1 (2019-07-05)
##  os       macOS Mojave 10.14.3        
##  system   x86_64, darwin15.6.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       Australia/Melbourne         
##  date     2019-12-12                  
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package          * version    date       lib
##  ade4               1.7-13     2018-08-31 [1]
##  assertthat         0.2.1      2019-03-21 [1]
##  backports          1.1.5      2019-10-02 [1]
##  bio3d              2.3-4      2018-04-03 [1]
##  BiocGenerics     * 0.30.0     2019-05-02 [1]
##  Biostrings       * 2.52.0     2019-05-02 [1]
##  bitops             1.0-6      2013-08-17 [1]
##  callr              3.3.2      2019-09-22 [1]
##  cli                1.1.0      2019-03-19 [1]
##  colorspace         1.4-1      2019-03-18 [1]
##  crayon             1.3.4      2017-09-16 [1]
##  crisscrosslinker * 0.9        2019-12-12 [1]
##  curl               4.2        2019-09-24 [1]
##  desc               1.2.0      2018-05-01 [1]
##  devtools           2.2.1      2019-09-24 [1]
##  digest             0.6.22     2019-10-21 [1]
##  dplyr              0.8.3      2019-07-04 [1]
##  ellipsis           0.3.0      2019-09-20 [1]
##  evaluate           0.14       2019-05-28 [1]
##  fs                 1.3.1      2019-05-06 [1]
##  gdtools          * 0.2.1      2019-10-14 [1]
##  ggplot2          * 3.2.1      2019-08-10 [1]
##  glue               1.3.1      2019-03-12 [1]
##  gridExtra          2.3        2017-09-09 [1]
##  gtable             0.3.0      2019-03-25 [1]
##  htmltools          0.4.0      2019-10-04 [1]
##  httr               1.4.1      2019-08-05 [1]
##  IRanges          * 2.18.3     2019-09-24 [1]
##  jsonlite           1.6        2018-12-07 [1]
##  knitr              1.26       2019-11-12 [1]
##  labeling           0.3        2014-08-23 [1]
##  lattice            0.20-38    2018-11-04 [1]
##  lava               1.6.6      2019-08-01 [1]
##  lazyeval           0.2.2      2019-03-15 [1]
##  magrittr           1.5        2014-11-22 [1]
##  MASS               7.3-51.4   2019-03-31 [1]
##  Matrix             1.2-17     2019-03-22 [1]
##  memoise            1.1.0      2017-04-21 [1]
##  munsell            0.5.0      2018-06-12 [1]
##  openxlsx           4.1.3      2019-11-07 [1]
##  pillar             1.4.2      2019-06-29 [1]
##  pkgbuild           1.0.6      2019-10-09 [1]
##  pkgconfig          2.0.3      2019-09-22 [1]
##  pkgload            1.0.2      2018-10-29 [1]
##  plyr               1.8.4      2016-06-08 [1]
##  prettyunits        1.0.2      2015-07-13 [1]
##  processx           3.4.1      2019-07-18 [1]
##  prodlim            2019.10.13 2019-11-03 [1]
##  ps                 1.3.0      2018-12-21 [1]
##  purrr              0.3.3      2019-10-18 [1]
##  R6                 2.4.1      2019-11-12 [1]
##  RColorBrewer       1.1-2      2014-12-07 [1]
##  Rcpp               1.0.3      2019-11-08 [1]
##  RCurl              1.95-4.12  2019-03-04 [1]
##  remotes            2.1.0      2019-06-24 [1]
##  reshape2         * 1.4.3      2017-12-11 [1]
##  rlang              0.4.1      2019-10-24 [1]
##  rmarkdown          1.17       2019-11-13 [1]
##  rprojroot          1.3-2      2018-01-03 [1]
##  S4Vectors        * 0.22.1     2019-09-09 [1]
##  scales             1.0.0      2018-08-09 [1]
##  seqinr             3.6-1      2019-09-07 [1]
##  sessioninfo        1.1.1      2018-11-05 [1]
##  stringi            1.4.3      2019-03-12 [1]
##  stringr            1.4.0      2019-02-10 [1]
##  survival           3.1-7      2019-11-09 [1]
##  svglite            1.2.2      2019-05-17 [1]
##  systemfonts        0.1.1      2019-07-01 [1]
##  testthat           2.3.0      2019-11-05 [1]
##  tibble             2.1.3      2019-06-06 [1]
##  tidyselect         0.2.5      2018-10-11 [1]
##  usethis            1.5.1      2019-07-04 [1]
##  viridis            0.5.1      2018-03-29 [1]
##  viridisLite        0.3.0      2018-02-01 [1]
##  withr              2.1.2      2018-03-15 [1]
##  xfun               0.11       2019-11-12 [1]
##  XML              * 3.98-1.20  2019-06-06 [1]
##  xml2               1.2.2      2019-08-09 [1]
##  XVector          * 0.24.0     2019-05-02 [1]
##  yaml               2.2.0      2018-07-25 [1]
##  zip                2.0.4      2019-09-01 [1]
##  zlibbioc           1.30.0     2019-05-02 [1]
##  source                                   
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  Bioconductor                             
##  Bioconductor                             
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  Github (egmg726/crisscrosslinker@616171c)
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.1)                           
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  Bioconductor                             
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.1)                           
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.1)                           
##  CRAN (R 3.6.1)                           
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  Bioconductor                             
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  Bioconductor                             
##  CRAN (R 3.6.0)                           
##  CRAN (R 3.6.0)                           
##  Bioconductor                             
## 
## [1] /Library/Frameworks/R.framework/Versions/3.6/Resources/library

References

Chen, Z. A. et al. Architecture of the RNA polymerase II-TFIIF complex revealed by cross-linking and mass spectrometry. The EMBO journal 29, 717-726, doi:10.1038/emboj.2009.401 (2010).

Combe, C. W., Fischer, L. & Rappsilber, J. xiNET: Cross-link Network Maps With Residue Resolution. Molecular & Cellular Proteomics : MCP 14, 1137-1147, doi:10.1074/mcp.O114.042259 (2015).