Title: | Regulatory DNA Element Prediction |
---|---|
Description: | Predicting regulatory DNA elements based on epigenomic signatures. This package is more of a set of building blocks than a direct solution. REPTILE regulatory prediction pipeline is built on this R package. See <https://github.com/yupenghe/REPTILE> for more information. |
Authors: | Yupeng He |
Maintainer: | Yupeng He <[email protected]> |
License: | BSD_2_clause + file LICENSE |
Version: | 1.0 |
Built: | 2024-11-14 02:42:22 UTC |
Source: | https://github.com/cran/REPTILE |
Predicting DNA regulatory elements based on epigenomic signatures. This package is more of a set of building blocks than a direct solution. REPTILE regulatory prediction pipeline is built on this R package. Please check the url below for details:
https://github.com/yupenghe/REPTILE
Accurate enhancer identification is critical for understanding the spatiotemporal transcriptional regulation during development as well as the functional impact of disease-related non-coding genetic variants. REPTILE is a algorithm to identify the precise location of enhancers by integrating histone modification data and base-resolution DNA methylation profiles.
REPTILE was designed based on three observations: 1) regions that are differentially methylated (or differentially methylated regions, DMRs) across diverse cell and tissue types strongly overlap with enhancers. 2) With base-resolution DNA methylation data, the boundaries of DMRs can be accurately defined, circumventing the difficulty of determining enhancer boundaries. 3) DMR size is often smaller (~500bp) than known enhancers, known negative regions (regions with no observable enhancer activity) and genomic windows used in enhancer prediction (~2kb), all of which we termed as "query regions". Together with the association between transcription factor binding and DNA methylation level, DMRs may serve as high-resolution enhancer candidates and capture the local epigenomic patterns that would otherwise be averaged/washed out in analysis focusing on the query regions.
Running REPTILE involves four major steps. First, to identify DMRs, we compared the methylomes of target sample (where putative enhancers will be generated) and several other samples with different cell/tissue types (as reference). In the next step, input files for REPTILE are prepared, which store the information of query regions, DMRs and the epigenomic data. Taking these inputs, REPTILE represents each DMR or query region as a feature vector, where each element corresponds to either intensity or intensity deviation of one epigenetic mark. Intensity deviation is defined as the intensity in target sample subtracted by the mean intensity in reference samples (i.e. reference epigenome) and it captures the tissue-specificity of each epigenetic mark. In the third step, based on the feature vectors of known enhancers and negative regions as well as the feature vectors of the DMRs within them, we trained an enhancer model, containing two random forest classifiers, which respectively predict enhancer activities of query regions and DMRs. In the last step, REPTILE uses the enhancer model to calculate enhancer confidence scores for DMRs and query regions, based on which the final predictions are made.
The two key concepts on REPTILE are:
Query regions - known enhancers, known negative regions and genomic windows used for enhancer prediction
DMRs - differentially methylated regions
In REPTILE, DMRs are used as high-resolution candidates to capture the fine epigenomic signatures in query regions.
Yupeng He
Maintainer: Yupeng He <[email protected]>
He, Yupeng et al., REPTILE: Regulatory Element Prediction based on TIssue-specific Local Epigenetic marks, in preparation
library("REPTILE") data("rsd") ## Training (needs a few minutes and ~1.8 Gb memory) reptile.model <- reptile_train(rsd$training_data$region_epimark, rsd$training_data$region_label, rsd$training_data$DMR_epimark, rsd$training_data$DMR_label, ntree=50) ## Prediction ## - REPTILE pred <- reptile_predict(reptile.model, rsd$test_data$region_epimark, rsd$test_data$DMR_epimark) ## - Random guessing pred_guess = runif(length(pred$D)) names(pred_guess) = names(pred$D) ## Evaluation res_reptile <- reptile_eval_prediction(pred$D, rsd$test_data$region_label) res_guess <- reptile_eval_prediction(pred_guess, rsd$test_data$region_label) ## - Print AUROC and AUPR cat(paste0("REPTILE\n", " AUROC = ",round(res_reptile$AUROC,digit=3), "\n", " AUPR = ",round(res_reptile$AUPR,digit=3)) ,"\n") cat(paste0("Random guessing\n", " AUROC = ",round(res_guess$AUROC,digit=3), "\n", " AUPR = ",round(res_guess$AUPR,digit=3)) ,"\n")
library("REPTILE") data("rsd") ## Training (needs a few minutes and ~1.8 Gb memory) reptile.model <- reptile_train(rsd$training_data$region_epimark, rsd$training_data$region_label, rsd$training_data$DMR_epimark, rsd$training_data$DMR_label, ntree=50) ## Prediction ## - REPTILE pred <- reptile_predict(reptile.model, rsd$test_data$region_epimark, rsd$test_data$DMR_epimark) ## - Random guessing pred_guess = runif(length(pred$D)) names(pred_guess) = names(pred$D) ## Evaluation res_reptile <- reptile_eval_prediction(pred$D, rsd$test_data$region_label) res_guess <- reptile_eval_prediction(pred_guess, rsd$test_data$region_label) ## - Print AUROC and AUPR cat(paste0("REPTILE\n", " AUROC = ",round(res_reptile$AUROC,digit=3), "\n", " AUPR = ",round(res_reptile$AUPR,digit=3)) ,"\n") cat(paste0("Random guessing\n", " AUROC = ",round(res_guess$AUROC,digit=3), "\n", " AUPR = ",round(res_guess$AUPR,digit=3)) ,"\n")
Internal function used to calculate the intensity deviation features. It is based on the epigenomic signatures of a given region in target sample, where prediction will be generated, and reference samples. Intensity deviation is defined as the intensity in target sample subtracted by the mean intensity in reference samples (i.e. reference epigenome) and it captures the tissue-specificity of each epigenetic mark.
calculate_epimark_deviation(data_info, x, query_sample, ref_sample = NULL)
calculate_epimark_deviation(data_info, x, query_sample, ref_sample = NULL)
data_info |
data.frame instance generated by reading data information file specifying the samples and marks used in the analysis. The data.frame includes at least two columns named "sample" and "mark", corresponding to the samples and marks included. |
x |
data.frame instance generated by reading epimark file. The first four columns of the data.frame are "chr", "start", "end" and "id" of each region in the epimark file. The rest columns contain values of epigenetic marks in samples as specified in data_info and column names are under MARK_SAMPLE format, such as "H3K4me1_mESC". |
query_sample |
name of the target sample |
ref_sample |
a vector of names of the reference sample(s) |
data.frame instance containing intensity deviation values of each mark
Yupeng He [email protected]
Internal function used to parsing options for "REPTILE_compute_score.R" script in the REPTILE enhancer prediction pipeline:
https://github.com/yupenghe/REPTILE
get_option_parser_compute_score()
get_option_parser_compute_score()
An instance of the OptionParser class.
Yupeng He [email protected]
Internal function used to parsing options for "REPTILE_evaluate_prediction.R" script in the REPTILE enhancer prediction pipeline:
https://github.com/yupenghe/REPTILE
get_option_parser_evaluation()
get_option_parser_evaluation()
An instance of the OptionParser class.
Yupeng He [email protected]
Internal function used to parsing options for "REPTILE_train.R" script in the REPTILE enhancer prediction pipeline:
https://github.com/yupenghe/REPTILE
get_option_parser_training()
get_option_parser_training()
An instance of the OptionParser class.
Yupeng He [email protected]
Function to read epimark file from disk and generate data.frame instance. It is used to read epigenomic data from file on disk and generate the input data.frame instance to fuel the model training, prediction and other following steps. Epimark file is a tab-separated file with a header. The first four columns are "chr", "start", "end" and "id", specifying the chromosome, start, end and id of regions. Each of the remaining columns contain values of one epigenetic mark in one sample (condition, cell or tissue type, etc) and the column name follows "MARK_SAMPLE" format, such as "H3K4me1_mESC".
read_epigenomic_data(data_info, epimark_file, query_sample, ref_sample = NULL, incl_dev = T)
read_epigenomic_data(data_info, epimark_file, query_sample, ref_sample = NULL, incl_dev = T)
data_info |
data.frame generated by reading data information file specifying the samples and marks used in the analysis. The data.frame includes at least two columns named "sample" and "mark", corresponding to the samples and marks included. |
epimark_file |
name of epimark file |
query_sample |
name of the target sample |
ref_sample |
a vector of names of the reference sample(s) |
incl_dev |
logical value indicates whether to calculate the intensity deviation feature. Intensity deviation is defined as the intensity in target sample subtracted by the mean intensity in reference samples (i.e. reference epigenome) and it captures the tissue-specificity of each epigenetic mark. |
data.frame instance containing intensity and intensity deviation values of each mark for each region
Yupeng He [email protected]
Function to read epimark file from disk and generate data.frame instance. It is used to read epigenomic data from file on disk and generate the input data.frame instance to fuel the model training, prediction and other following steps. Label file is a tab-separated file with a header. The first column contains the id of each region. The second or more columns specify whether a certain region is enhancer (1) or not (0) in a specific sample. Each of these columns corresponds to one sample and the name of the column is the sample name.
read_label(label_file, query_sample)
read_label(label_file, query_sample)
label_file |
name of label file on disk |
query_sample |
name(s) of sample(s), in which you would like to have label information |
an data.frame instance containing label of each region in query samples The possible values and their meanings of a label are:
NA
- unknwon (will be ignored)
0
- not enhancer
1
- enhancer
Yupeng He [email protected]
Function used to evaluate the predictions by comparing
enhancer scores from reptile_predict
or
reptile_predict_genome_wide
and the correct labels. Area under
the Receiver Operating Characteristic (ROC) curve (AUROC) and Area
under the Precision-Recall curve (AUPR) will be calculated.
reptile_eval_prediction(predictions,annotations)
reptile_eval_prediction(predictions,annotations)
predictions |
vector of enhancer scores for regions. The name of each value (score) corresponds to the id of the region. |
annotations |
vector of labels for regions with the same length as predictions.
The name of each value (label) corresponds to the id of the
region. Only two values are allowed in |
A list containing two numbers
AUROC |
Area under the Receiver Operating Characteristic (ROC) curve |
AUPR |
Area under the Precision-Recall curve |
Yupeng He [email protected]
reptile_predict
,
reptile_predict_genome_wide
library("REPTILE") data("rsd") ## Training rsd_model <- reptile_train(rsd$training_data$region_epimark, rsd$training_data$region_label, rsd$training_data$DMR_epimark, rsd$training_data$DMR_label, ntree=50) ## Prediction ## - REPTILE pred <- reptile_predict(rsd_model, rsd$test_data$region_epimark, rsd$test_data$DMR_epimark) ## - Random guessing pred_guess = runif(length(pred$D)) names(pred_guess) = names(pred$D) ## Evaluation res_reptile <- reptile_eval_prediction(pred$D, rsd$test_data$region_label) res_guess <- reptile_eval_prediction(pred_guess, rsd$test_data$region_label) ## - Print AUROC and AUPR cat(paste0("REPTILE\n", " AUROC = ",round(res_reptile$AUROC,digit=3), "\n", " AUPR = ",round(res_reptile$AUPR,digit=3)) ,"\n") cat(paste0("Random guessing\n", " AUROC = ",round(res_guess$AUROC,digit=3), "\n", " AUPR = ",round(res_guess$AUPR,digit=3)) ,"\n")
library("REPTILE") data("rsd") ## Training rsd_model <- reptile_train(rsd$training_data$region_epimark, rsd$training_data$region_label, rsd$training_data$DMR_epimark, rsd$training_data$DMR_label, ntree=50) ## Prediction ## - REPTILE pred <- reptile_predict(rsd_model, rsd$test_data$region_epimark, rsd$test_data$DMR_epimark) ## - Random guessing pred_guess = runif(length(pred$D)) names(pred_guess) = names(pred$D) ## Evaluation res_reptile <- reptile_eval_prediction(pred$D, rsd$test_data$region_label) res_guess <- reptile_eval_prediction(pred_guess, rsd$test_data$region_label) ## - Print AUROC and AUPR cat(paste0("REPTILE\n", " AUROC = ",round(res_reptile$AUROC,digit=3), "\n", " AUPR = ",round(res_reptile$AUPR,digit=3)) ,"\n") cat(paste0("Random guessing\n", " AUROC = ",round(res_guess$AUROC,digit=3), "\n", " AUPR = ",round(res_guess$AUPR,digit=3)) ,"\n")
Predicting enhancer activities of query regions based on the enhancer
model from reptile_train
in training step. This function
calculates the combined enhancer score for each query region (given
region) as the maximum among the score of whole query region and the
scores of DMRs within it. This function is for generating genome-wide
enhancer predictions.
reptile_predict(reptile_model, epimark_region, epimark_DMR = NULL, family = "randomForest")
reptile_predict(reptile_model, epimark_region, epimark_DMR = NULL, family = "randomForest")
reptile_model |
Enhancer model from |
epimark_region |
data.frame instance from read_epigenomic_data, which containing intensity and intensity deviation values of each mark for each query region |
epimark_DMR |
data.frame instance from read_epigenomic_data, which containing intensity and intensity deviation values of each mark for each DMR |
family |
classifier family used in the enhancer model Default: RandomForest Classifiers available: - RandomForest: random forest - Logistic: logistic regression |
A list containing three vectors
D |
Combined enhancer score of each query region |
R |
Enhancer score of each query region |
DMR |
Enhancer score of each DMR |
Yupeng He [email protected]
library("REPTILE") data("rsd") ## Training rsd_model <- reptile_train(rsd$training_data$region_epimark, rsd$training_data$region_label, rsd$training_data$DMR_epimark, rsd$training_data$DMR_label, ntree=50) ## Prediction ## - REPTILE pred <- reptile_predict(rsd_model, rsd$test_data$region_epimark, rsd$test_data$DMR_epimark) ## - Random guessing pred_guess = runif(length(pred$D)) names(pred_guess) = names(pred$D) ## Evaluation res_reptile <- reptile_eval_prediction(pred$D, rsd$test_data$region_label) res_guess <- reptile_eval_prediction(pred_guess, rsd$test_data$region_label) ## - Print AUROC and AUPR cat(paste0("REPTILE\n", " AUROC = ",round(res_reptile$AUROC,digit=3), "\n", " AUPR = ",round(res_reptile$AUPR,digit=3)) ,"\n") cat(paste0("Random guessing\n", " AUROC = ",round(res_guess$AUROC,digit=3), "\n", " AUPR = ",round(res_guess$AUPR,digit=3)) ,"\n")
library("REPTILE") data("rsd") ## Training rsd_model <- reptile_train(rsd$training_data$region_epimark, rsd$training_data$region_label, rsd$training_data$DMR_epimark, rsd$training_data$DMR_label, ntree=50) ## Prediction ## - REPTILE pred <- reptile_predict(rsd_model, rsd$test_data$region_epimark, rsd$test_data$DMR_epimark) ## - Random guessing pred_guess = runif(length(pred$D)) names(pred_guess) = names(pred$D) ## Evaluation res_reptile <- reptile_eval_prediction(pred$D, rsd$test_data$region_label) res_guess <- reptile_eval_prediction(pred_guess, rsd$test_data$region_label) ## - Print AUROC and AUPR cat(paste0("REPTILE\n", " AUROC = ",round(res_reptile$AUROC,digit=3), "\n", " AUPR = ",round(res_reptile$AUPR,digit=3)) ,"\n") cat(paste0("Random guessing\n", " AUROC = ",round(res_guess$AUROC,digit=3), "\n", " AUPR = ",round(res_guess$AUPR,digit=3)) ,"\n")
Predicting enhancer activities of query regions based on the enhancer
model from reptile_train
in training step. This function
calculates the enhancer scores of DMRs and query regions. It does not
try to generate combined enhancer scores.
reptile_predict_genome_wide(reptile_model, epimark_region, epimark_DMR = NULL, family = "randomForest")
reptile_predict_genome_wide(reptile_model, epimark_region, epimark_DMR = NULL, family = "randomForest")
reptile_model |
Enhancer model from |
epimark_region |
data.frame instance from read_epigenomic_data, which containing intensity and intensity deviation values of each mark for each query region |
epimark_DMR |
data.frame instance from read_epigenomic_data, which containing intensity and intensity deviation values of each mark for each DMR |
family |
classifier family used in the enhancer model Default: RandomForest Classifiers available: - RandomForest: random forest - Logistic: logistic regression |
A list containing two vectors
R |
Enhancer score of each query region |
DMR |
Enhancer score of each DMR |
Yupeng He [email protected]
library("REPTILE") data("rsd") ## Training rsd_model <- reptile_train(rsd$training_data$region_epimark, rsd$training_data$region_label, rsd$training_data$DMR_epimark, rsd$training_data$DMR_label, ntree=50) ## Prediction ## - REPTILE pred <- reptile_predict(rsd_model, rsd$test_data$region_epimark, rsd$test_data$DMR_epimark) ## - Random guessing pred_guess = runif(length(pred$D)) names(pred_guess) = names(pred$D) ## Evaluation res_reptile <- reptile_eval_prediction(pred$D, rsd$test_data$region_label) res_guess <- reptile_eval_prediction(pred_guess, rsd$test_data$region_label) ## - Print AUROC and AUPR cat(paste0("REPTILE\n", " AUROC = ",round(res_reptile$AUROC,digit=3), "\n", " AUPR = ",round(res_reptile$AUPR,digit=3)) ,"\n") cat(paste0("Random guessing\n", " AUROC = ",round(res_guess$AUROC,digit=3), "\n", " AUPR = ",round(res_guess$AUPR,digit=3)) ,"\n")
library("REPTILE") data("rsd") ## Training rsd_model <- reptile_train(rsd$training_data$region_epimark, rsd$training_data$region_label, rsd$training_data$DMR_epimark, rsd$training_data$DMR_label, ntree=50) ## Prediction ## - REPTILE pred <- reptile_predict(rsd_model, rsd$test_data$region_epimark, rsd$test_data$DMR_epimark) ## - Random guessing pred_guess = runif(length(pred$D)) names(pred_guess) = names(pred$D) ## Evaluation res_reptile <- reptile_eval_prediction(pred$D, rsd$test_data$region_label) res_guess <- reptile_eval_prediction(pred_guess, rsd$test_data$region_label) ## - Print AUROC and AUPR cat(paste0("REPTILE\n", " AUROC = ",round(res_reptile$AUROC,digit=3), "\n", " AUPR = ",round(res_reptile$AUPR,digit=3)) ,"\n") cat(paste0("Random guessing\n", " AUROC = ",round(res_guess$AUROC,digit=3), "\n", " AUPR = ",round(res_guess$AUPR,digit=3)) ,"\n")
Internal function used to predict the enhancer activity of either DMRs or query regions.
reptile_predict_one_mode(reptile_classifier, epimark, family)
reptile_predict_one_mode(reptile_classifier, epimark, family)
reptile_classifier |
An object of class |
epimark |
data.frame instance from read_epigenomic_data, which containing intensity and intensity deviation values of each mark for each query region |
family |
classifier family used in the enhancer model Default: RandomForest Classifiers available: - RandomForest: random forest - Logistic: logistic regression |
A vecotr of enhancer score of each query region or DMR
Yupeng He [email protected]
reptile_predict
,
reptile_predict_genome_wide
Learn a REPTILE enhancer model based on epigenomic signature of known enhancers.
reptile_train(epimark_region, label_region, epimark_DMR = NULL, label_DMR = NULL, family = "randomForest", ntree = 2000, nodesize = 1)
reptile_train(epimark_region, label_region, epimark_DMR = NULL, label_DMR = NULL, family = "randomForest", ntree = 2000, nodesize = 1)
epimark_region |
data.frame instance from read_epigenomic_data, which containing intensity and intensity deviation values of each mark for each query region. |
label_region |
factor instance from read_label, containing the label of each query region. The possible values and their meanings of a label are: 0 (not enhancer), 1 (enhancer) and NA (unknwon and it will be ignored). |
epimark_DMR |
data.frame instance from read_epigenomic_data, which containing intensity and intensity deviation values of each mark for each DMR. If either this value or label_DMR is NULL, the output enhancer model will not inlclude a classifier for predicting the enhancer activities of DMRs. Default: NULL |
label_DMR |
factor instance from read_label, containing the label of each DMR. The possible values and their meanings of a label are: 0 (not enhancer), 1 (enhancer) and NA (unknwon and it will be ignored). If either this value or label_DMR is NULL, the output enhancer model will not inlclude a classifier for predicting the enhancer activities of DMRs. Default: NULL |
family |
classifier family used in the enhancer model Default: RandomForest Classifiers available: - RandomForest: random forest - Logistic: logistic regression |
ntree |
Number of tree to be constructed in the random forest model. See the function randomForest() in "randomForest" package for more information. Default: 2000 |
nodesize |
Minimum size of terminal nodes. See the function randomForest() in "randomForest" package for more information. Default: 1 |
A list containing two objects of class randomForest
.
D |
Classifier for DMRs. It is an |
R |
Classifier for query regions. It is an |
Yupeng He [email protected]
Breiman, L. (2001), Random Forests, Machine Learning 45(1), 5-32.
A. Liaw and M. Wiener (2002), Classification and Regression by randomForest, R News 2(3), 18–22.
read_epigenomic_data
,
read_label
,
reptile_predict
library("REPTILE") data("rsd") ## Training rsd_model <- reptile_train(rsd$training_data$region_epimark, rsd$training_data$region_label, rsd$training_data$DMR_epimark, rsd$training_data$DMR_label, ntree=5) print(rsd_model$D) print(rsd_model$R)
library("REPTILE") data("rsd") ## Training rsd_model <- reptile_train(rsd$training_data$region_epimark, rsd$training_data$region_label, rsd$training_data$DMR_epimark, rsd$training_data$DMR_label, ntree=5) print(rsd_model$D) print(rsd_model$R)
Internal function to learn a random forest classifier
reptile_train_one_mode(epimark, label, family, ntree, nodesize)
reptile_train_one_mode(epimark, label, family, ntree, nodesize)
epimark |
|
label |
|
family |
Classifier family used in the enhancer model Default: RandomForest Classifiers available: - RandomForest: random forest - Logistic: logistic regression |
ntree |
Number of tree to be constructed in the random forest model. See the function randomForest() in "randomForest" package for more information. Default: 2000 |
nodesize |
Minimum size of terminal nodes. See the function randomForest() in "randomForest" package for more information. Default: 1 |
An randomForest
object or glm
object when family
is set to be "Logistic".
Yupeng He [email protected]
Breiman, L. (2001), Random Forests, Machine Learning 45(1), 5-32.
A. Liaw and M. Wiener (2002), Classification and Regression by randomForest, R News 2(3), 18–22.
sample data for testing REPTILE training, prediction and evaluation.
data(rsd)
data(rsd)
A list containing two lists.
training_data
is the data used for training REPTILE enhancer
model. This list has four elements: region_epimark
,
DMR_epimark
, region_label
and DMR_label
. The
former two store the epigenomic signatures of query regions and
DMRs. The latter two label which a certain query region or DMR is
enhancer (1
) or negative instance (0
)
test_data
is for training REPTILE enhancer model and it
has four elements: region_epimark
, DMR_epimark
and
region_label
. The former two store the
epigenomic signatures of query regions and DMRs. The
region_label
indicates whether a certain query region or DMR is
enhancer (1
) or negative instance (0
)
Yupeng He [email protected]
training_data
was based on the EP300 binding sites (positives),
promoters (negatives) and randomly chosen genomic loci (negatives) in
mouse embryonic stem cells.
The test_data
data was constructed based on in vivo
validated mouse sequences from VISTA enhancer browser (Oct 24th,
2015). The labels indicate the activity in mouse heart tissues from
E11.5 embryo.
See the papers included in References for details.
He, Yupeng et al., REPTILE: Regulatory Element Prediction based on TIssue-specific Local Epigenetic marks, in preparation
Visel, Axel et al. (2007), VISTA Enhancer Browser - a database of tissue-specific human enhancers Nucleic acids research 35. suppl 1 http://enhancer.lbl.gov/
## Visualizing rsd data library("REPTILE") data(rsd) ## Epigenomic signature of query region grouped by labels ind_pos = rsd$training_data$region_label == 1 pos_region = rsd$training_data$region_epimark[ind_pos,] neg_region = rsd$training_data$region_epimark[!ind_pos,] ## Epigenomic signature of DMRs grouped by labels ind_pos = rsd$training_data$DMR_label == 1 pos_DMR = rsd$training_data$DMR_epimark[ind_pos,] neg_DMR = rsd$training_data$DMR_epimark[!ind_pos,] ## Prepare the data format required for plotting n = ncol(rsd$training_data$DMR_epimark) ## Number of features feature_data_DMR = list() feature_data_region = list() for(i in 1:n){ feature_data_DMR <- append(feature_data_DMR, list(neg_DMR[,i],pos_DMR[,i], NA,NA)) feature_data_region <- append(feature_data_region, list(neg_region[,i],pos_region[,i], NA,NA)) } ## Plot the feature distribution par(mar=c(4,8,4,4)) ## - query region b <- boxplot(feature_data_region, xlab = "feature value", notch=TRUE,outline=FALSE,yaxt='n', xlim = c(1,n*4-2),ylim=c(-7,7), horizontal=TRUE, col=c(rgb(65,105,225,max=255),rgb(250,128,114,max=255)), main = "Feature value distribution in query regions" ) text(par("usr")[1]-0.2, seq(1.5,n*4-2,by=4), labels=gsub("_","-",colnames(rsd$training_data$region_epimark)), xpd = TRUE,adj=1) legend(-8,4*n+4,c("negative","enhancer"),ncol=2, fill = c(rgb(250,128,114,max=255),rgb(65,105,225,max=255)), xpd=TRUE,bty='n') ## - DMR b <- boxplot(feature_data_DMR, xlab = "feature value", notch=TRUE,outline=FALSE,yaxt='n', xlim = c(1,n*4-2),ylim=c(-7,7), horizontal=TRUE, col=c(rgb(65,105,225,max=255),rgb(250,128,114,max=255)), main = "Feature value distribution in DMRs" ) text(par("usr")[1]-0.2, seq(1.5,n*4-2,by=4), labels=gsub("_","-",colnames(rsd$training_data$DMR_epimark)), xpd = TRUE,adj=1) legend(-8,4*n+4,c("negative","enhancer"),ncol=2, fill = c(rgb(250,128,114,max=255),rgb(65,105,225,max=255)), xpd=TRUE,bty='n')
## Visualizing rsd data library("REPTILE") data(rsd) ## Epigenomic signature of query region grouped by labels ind_pos = rsd$training_data$region_label == 1 pos_region = rsd$training_data$region_epimark[ind_pos,] neg_region = rsd$training_data$region_epimark[!ind_pos,] ## Epigenomic signature of DMRs grouped by labels ind_pos = rsd$training_data$DMR_label == 1 pos_DMR = rsd$training_data$DMR_epimark[ind_pos,] neg_DMR = rsd$training_data$DMR_epimark[!ind_pos,] ## Prepare the data format required for plotting n = ncol(rsd$training_data$DMR_epimark) ## Number of features feature_data_DMR = list() feature_data_region = list() for(i in 1:n){ feature_data_DMR <- append(feature_data_DMR, list(neg_DMR[,i],pos_DMR[,i], NA,NA)) feature_data_region <- append(feature_data_region, list(neg_region[,i],pos_region[,i], NA,NA)) } ## Plot the feature distribution par(mar=c(4,8,4,4)) ## - query region b <- boxplot(feature_data_region, xlab = "feature value", notch=TRUE,outline=FALSE,yaxt='n', xlim = c(1,n*4-2),ylim=c(-7,7), horizontal=TRUE, col=c(rgb(65,105,225,max=255),rgb(250,128,114,max=255)), main = "Feature value distribution in query regions" ) text(par("usr")[1]-0.2, seq(1.5,n*4-2,by=4), labels=gsub("_","-",colnames(rsd$training_data$region_epimark)), xpd = TRUE,adj=1) legend(-8,4*n+4,c("negative","enhancer"),ncol=2, fill = c(rgb(250,128,114,max=255),rgb(65,105,225,max=255)), xpd=TRUE,bty='n') ## - DMR b <- boxplot(feature_data_DMR, xlab = "feature value", notch=TRUE,outline=FALSE,yaxt='n', xlim = c(1,n*4-2),ylim=c(-7,7), horizontal=TRUE, col=c(rgb(65,105,225,max=255),rgb(250,128,114,max=255)), main = "Feature value distribution in DMRs" ) text(par("usr")[1]-0.2, seq(1.5,n*4-2,by=4), labels=gsub("_","-",colnames(rsd$training_data$DMR_epimark)), xpd = TRUE,adj=1) legend(-8,4*n+4,c("negative","enhancer"),ncol=2, fill = c(rgb(250,128,114,max=255),rgb(65,105,225,max=255)), xpd=TRUE,bty='n')