| Title: | 'ggplot2'-Based Tools for Visualising DNA Sequences and Modifications |
|---|---|
| Description: | Uses 'ggplot2' to visualise either (a) a single DNA/RNA sequence split across multiple lines, (b) multiple DNA/RNA sequences, each occupying a whole line, or (c) base modifications such as DNA methylation called by modified bases models in Dorado or Guppy. Functions starting with visualise_<>() are the main plotting functions, and functions starting with extract_and_sort_<>() are key helper functions for reading files and reformatting data. Source code is available at <https://github.com/ejade42/ggDNAvis>, a full non-expert user guide is available at <https://ejade42.github.io/ggDNAvis/>, and an interactive web-app version of the software is available at <https://ejade42.github.io/ggDNAvis/articles/interactive_app.html>. |
| Authors: | Evelyn Jade [aut, cre, cph] (ORCID: <https://orcid.org/0009-0003-7761-5425>) |
| Maintainer: | Evelyn Jade <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 1.0.0.9001 |
| Built: | 2026-06-03 03:30:37 UTC |
| Source: | https://github.com/ejade42/ggdnavis |
ggDNAvis helper)This function takes an argument name, a named list of arguments
(presumably being iterated over for a particular validation check),
and a message. Using rlang::abort(), it prints an error message of the form:
Argument '<argument_name>' <message> Current value: <argument_value> Current class: <class(argument_value)>
If the argument value is a named item (i.e. names(arguments_list[[argument_name]])
is not null), or if force_names is TRUE, then the form will be:
Argument '<argument_name>' <message> Current value: <argument_value> Current names: <argument_names> Current class: <class(argument_value)>
bad_arg( argument_name, arguments_list, message, class = "argument_value_or_type", force_names = FALSE )bad_arg( argument_name, arguments_list, message, class = "argument_value_or_type", force_names = FALSE )
argument_name |
|
arguments_list |
|
message |
|
class |
|
force_names |
|
Nothing, but causes an error exit via rlang::abort()
## Obviously this error-message function causes an error, ## so needs to be wrapped in try() for these examples ## Standard use positive_args <- list(number = -1) try(bad_arg("number", positive_args, "must be positive")) ## Automatically detects named item and prints names named <- list(x = c("first item" = 1, "second item" = 7)) try(bad_arg("x", named, "is not acceptable")) ## Can force name printing try(bad_arg("number", positive_args, "must be positive", force_names = TRUE))## Obviously this error-message function causes an error, ## so needs to be wrapped in try() for these examples ## Standard use positive_args <- list(number = -1) try(bad_arg("number", positive_args, "must be positive")) ## Automatically detects named item and prints names named <- list(x = c("first item" = 1, "second item" = 7)) try(bad_arg("x", named, "is not acceptable")) ## Can force name printing try(bad_arg("number", positive_args, "must be positive", force_names = TRUE))
ggDNAvis helper)This function takes a single base and numerically
encodes it for visualisation via rasterise_matrix().
Encoding: A = 1, C = 2, G = 3, T/U = 4.
convert_base_to_number(base)convert_base_to_number(base)
base |
|
integer. The corresponding number.
convert_base_to_number("A") convert_base_to_number("c") convert_base_to_number("g") convert_base_to_number("T") convert_base_to_number("u")convert_base_to_number("A") convert_base_to_number("c") convert_base_to_number("g") convert_base_to_number("T") convert_base_to_number("u")
write_modified_fastq() helper)This function takes a vector of modified base locations as absolute indices
(i.e. a 1 would mean the first base in the sequence has been assessed for
modification; a 15 would mean the 15th base has), and converts it to a vector
in the format of the SAM/BAM MM tags. The MM tag defines a particular target base (e.g.
C for methylation), and then stores the number of skipped instances of that base
between sites where modification was assessed. In practice, this often means counting the
number of non-CpG Cs in between CpG Cs. In a GGC repeat, this should be a bunch of 0s
as every C is in a CpG, but unique sequence will have many non-CpG Cs.
This function is reversed by convert_MM_vector_to_locations().
convert_locations_to_MM_vector(sequence, locations, target_base = "C")convert_locations_to_MM_vector(sequence, locations, target_base = "C")
sequence |
|
locations |
|
target_base |
|
integer vector. A component of a SAM MM tag, representing the number of skipped target bases in between each assessed base.
convert_locations_to_MM_vector( "GGCGGCGGCGGC", locations = c(3, 6, 9, 12), target_base = "C" ) convert_locations_to_MM_vector( "GGCGGCGGCGGC", locations = c(1, 4, 7, 10), target_base = "G" ) convert_locations_to_MM_vector( "GGCGGCGGCGGC", locations = c(1, 2, 4, 5, 7, 8, 10, 11), target_base = "G" )convert_locations_to_MM_vector( "GGCGGCGGCGGC", locations = c(3, 6, 9, 12), target_base = "C" ) convert_locations_to_MM_vector( "GGCGGCGGCGGC", locations = c(1, 4, 7, 10), target_base = "G" ) convert_locations_to_MM_vector( "GGCGGCGGCGGC", locations = c(1, 2, 4, 5, 7, 8, 10, 11), target_base = "G" )
This function takes a sequence, a SAM-style vector of number of potential
target bases to skip in between each target base that was actually assessed,
and a target base type (defaults to "C" as 5-methylcytosine is most common).
It identifies the indices/locations of all instances of the target base within the
sequence, and then goes along the vector of these indices, skipping them if requested
by skips.
For example, the sequence "GGCGGCGGCGGC" with target "C" and skips c(0, 0, 1)
would identify that the indices where "C" occurs are c(3, 6, 9, 12). It would then
take the first index, the second index, skip one, and take the fourth index i.e.
return c(3, 6, 12). If instead the skips were given as c(0, 2) it would take the
first index, skip two, and take the fourth index i.e. return c(3, 12). If the skips
were given as c(1, 1) it would skip one, take the second index, skip one, and take
the fourth index i.e. return c(6, 12).
The length of skips corresponds to the number of indices/locations that will be returned
(i.e. the length of the returned locations vector).
Ideally the length of skips plus the sum of skips (i.e. the number returned plus the
total number skipped) is the same or less than the number of possible locations. If it is
the same, then the last possible location will be taken; if it is less then some number of
possible locations at the end will be skipped.
Important: if the length of skips plus the sum of skips is greater than the number
of possible locations (instances of the target base within the sequence), then the total
number of taken or skipped locations will be greater than the number of available locations.
In this case, the returned vector will contain NA after the available locations have run out.
In the example above, skips = c(0, 0, 0, 0, 0) would return c(3, 6, 9, 12, NA), and
skips = c(0, 2, 0) would return c(3, 12, NA).
Therefore, if the target base is totally absent from the sequence (e.g. target "A" in
"GGCGGCGGCGGC"), then any non-zero length of skips will return the same length of NAs e.g.
skips = c(0) would return NA, and skips = c(0, 1, 0) would return c(NA, NA, NA).
If skips has length zero, it will return numeric(0).
This function is reversed by convert_locations_to_MM_vector().
convert_MM_vector_to_locations(sequence, skips, target_base = "C")convert_MM_vector_to_locations(sequence, skips, target_base = "C")
sequence |
|
skips |
|
target_base |
|
integer vector. All of the base indices at which methylation/modification information was processed. Will all be instances of the target base.
convert_MM_vector_to_locations( "GGCGGCGGCGGC", skips = c(0, 0, 0, 0), target_base = "C" ) convert_MM_vector_to_locations( "GGCGGCGGCGGC", skips = c(1, 1, 1, 1), target_base = "G" ) convert_MM_vector_to_locations( "GGCGGCGGCGGC", skips = c(0, 0, 2, 1, 0), target_base = "G" )convert_MM_vector_to_locations( "GGCGGCGGCGGC", skips = c(0, 0, 0, 0), target_base = "C" ) convert_MM_vector_to_locations( "GGCGGCGGCGGC", skips = c(1, 1, 1, 1), target_base = "G" ) convert_MM_vector_to_locations( "GGCGGCGGCGGC", skips = c(0, 0, 2, 1, 0), target_base = "G" )
visualise_methylation() helper)Takes modification locations (indices along the read signifying bases at which
modification probability was assessed) and modification probabilities (the probability
of modification at each assessed location, as an integer from 0 to 255), as comma-separated
strings (e.g. "1,5,25") produced from numerical vectors via vector_to_string().
Outputs a numerical vector of the modification probability for each base along the read.
i.e. -2 for indices outside sequences, -1 for bases where modification was not assessed,
and probability from 0-255 for bases where modification was assessed.
convert_modification_to_number_vector( modification_locations_str, modification_probabilities_str, max_length, sequence_length )convert_modification_to_number_vector( modification_locations_str, modification_probabilities_str, max_length, sequence_length )
modification_locations_str |
|
modification_probabilities_str |
|
max_length |
|
sequence_length |
|
numeric vector. A vector of length max_length indicating the probability of methylation at each index along the read - 0 where methylation was not assessed, and probability from 0-255 where methylation was assessed.
convert_modification_to_number_vector( modification_locations_str = "3,6,9,12", modification_probabilities = "100,200,50,150", max_length = 15, sequence_length = 13 )convert_modification_to_number_vector( modification_locations_str = "3,6,9,12", modification_probabilities = "100,200,50,150", max_length = 15, sequence_length = 13 )
ggDNAvis helper)This function takes a sequence and encodes it as a vector
of numbers for visualisation via rasterise_matrix().
Encoding: A = 1, C = 2, G = 3, T/U = 4.
convert_sequence_to_numbers(sequence, length = NA)convert_sequence_to_numbers(sequence, length = NA)
sequence |
|
length |
|
integer vector. The numerical encoding of the input sequence, cut/padded to the desired length.
convert_sequence_to_numbers("ATCGATCG") convert_sequence_to_numbers("ATCGATCG", length = NA) convert_sequence_to_numbers("ATCGATCG", length = 4) convert_sequence_to_numbers("ATCGATCG", length = 10)convert_sequence_to_numbers("ATCGATCG") convert_sequence_to_numbers("ATCGATCG", length = NA) convert_sequence_to_numbers("ATCGATCG", length = 4) convert_sequence_to_numbers("ATCGATCG", length = 10)
ggDNAvis helper)This function takes a vector of sequences (e.g. input to visualise_many_sequences()
or visualise_methylation(), or vector split from input to visualise_single_sequence()).
It converts it into a matrix e.g. c("GGCGGC", "", "ACGT", "") would become:
G G C G G C NA NA NA NA NA NA A C G T NA NA NA NA NA NA NA NA
The resulting matrix can then be rasterised into a coordinate-value dataframe via rasterise_matrix().
convert_sequences_to_matrix(sequences, line_length = NA, blank_value = NA)convert_sequences_to_matrix(sequences, line_length = NA, blank_value = NA)
sequences |
|
line_length |
|
blank_value |
|
matrix. A matrix of the sequences with one line per sequence, ready for rasterisation via rasterise_matrix().
convert_sequences_to_matrix( sequences = c("GGCGGC", "", "ACGT", "") ) convert_sequences_to_matrix( sequences = c("GGCGGC", "", "ACGT", ""), line_length = 10, blank_value = "X" )convert_sequences_to_matrix( sequences = c("GGCGGC", "", "ACGT", "") ) convert_sequences_to_matrix( sequences = c("GGCGGC", "", "ACGT", ""), line_length = 10, blank_value = "X" )
ggDNAvis helper)Takes a character vector of sequences (which are allowed to be empty "" to
act as a spacing line) and rasterises it into a dataframe that ggplot can read.
create_image_data(sequences)create_image_data(sequences)
sequences |
|
dataframe. Rasterised dataframe representation of the sequences, readable by ggplot2::ggplot().
create_image_data(c("ATCG", "", "GGCGGC", ""))create_image_data(c("ATCG", "", "GGCGGC", ""))
ggDNAvis debug helper)Takes a numeric vector, and prints it to the console separated by ", ".
This allows the output to be copy-pasted into a vector within an R script.
Used for taking vector outputs and then writing them as literals within a script.
E.g. when given input 1:5, prints 1, 2, 3, 4, 5, which can be directly copy-pasted
within c() to input that vector. Printing normally via print(1:5) instead prints
[1] 1 2 3 4 5, which is not valid vector input so can't be copy-pasted directly.
See debug_join_vector_str() for the equivalent for character/string vectors.
debug_join_vector_num(vector)debug_join_vector_num(vector)
vector |
|
None (invisible NULL) - uses cat() to output directly to console.
debug_join_vector_num(1:5)debug_join_vector_num(1:5)
ggDNAvis debug helper)Takes a character/string vector, and prints it to the console separated by ", ".
This allows the output to be copy-pasted into a vector within an R script.
Used for taking vector outputs and then writing them as literals within a script.
E.g. when given input strsplit("ABCD", split = "")[[1]], prints "A", "B", "C", "D",
which can be directly copy-pasted within c() to input that vector.
Printing normally via print(strsplit("ABCD", split = "")[[1]]) instead prints
[1] "A" "B" "C" "D", which is not valid vector input so can't be copy-pasted directly.
See debug_join_vector_num() for the equivalent for numeric vectors.
debug_join_vector_str(vector)debug_join_vector_str(vector)
vector |
|
None (invisible NULL) - uses cat() to output directly to console.
debug_join_vector_str(c("A", "B", "C", "D"))debug_join_vector_str(c("A", "B", "C", "D"))
A collection of made-up sequences in the style of long reads over a repeat region
(e.g. NOTCH2NLC), with meta-data describing the participant each read is from and
the family each participant is from. Can be used in visualise_many_sequences(),
visualise_methylation(), and helper functions to visualise these sequences.
Generation code is available at data-raw/example_many_sequences.R
example_many_sequencesexample_many_sequences
example_many_sequencesA dataframe with 23 rows and 10 columns:
Participant family
Participant ID
Unique read ID
DNA sequence of the read
Length (nucleotides) of the read
FASTQ quality scores for the read. Each character represents a score from 0 to 40 - see fastq_quality_scores.
These values are made up via pmin(pmax(round(rnorm(n, mean = 20, sd = 10)), 0), 40) i.e. sampled from a normal distribution with mean 20 and standard deviation 10, then rounded to integers between 0 and 40 (inclusive) - see example_many_sequences.R
Indices along the read (starting at 1) at which methylation probability was assessed i.e. CpG sites. Stored as a single character value per read, condensed from a numeric vector via vector_to_string().
Probability of methylation (8-bit integer i.e. 0-255) for each assessed base. Stored as a single character value per read, condensed from a numeric vector via vector_to_string().
These values are made up via round(runif(n, min = 0, max = 255)) - see example_many_sequences.R
Indices along the read (starting at 1) at which hydroxymethylation probability was assessed i.e. CpG sites. Stored as a single character value per read, condensed from a numeric vector via vector_to_string().
Probability of hydroxymethylation (8-bit integer i.e. 0-255) for each assessed base. Stored as a single character value per read, condensed from a numeric vector via vector_to_string().
These values are made up via round(runif(n, min = 0, max = 255 - this_base_methylation_probability)) such that the summed methylation and hydroxymethylation probability never exceeds 255 (100%) - see example_many_sequences.R
example_many_sequencesexample_many_sequences
extract_methylation_from_dataframe() is an alias for extract_and_sort_methylation() - see aliases.
This function takes a dataframe that contains methylation information in the form of locations (indices along the read signifying bases at which modification probability was assessed) and probabilities (the probability of modification at each assessed location, as an integer from 0 to 255).
Each observation/row in the dataframe represents one sequence (e.g. a Nanopore read).
In the locations and probabilities column, each sequence (row) has many numbers associated.
These are stored as one string per observation e.g. "3,6,9,12", with the column representing
a character vector of such strings (e.g. c("3,6,9,12", "1,2,3,4")).
This function calls extract_and_sort_sequences() on the relevant columns and returns
a list of vectors stored in $locations, $probabilities, $sequences, and $lengths.
These can then be used as input for visualise_methylation().
Default arguments are set up to work with the included example_many_sequences data.
extract_and_sort_methylation( modification_data, ..., locations_colname = "methylation_locations", probabilities_colname = "methylation_probabilities", sequences_colname = "sequence", lengths_colname = "sequence_length", grouping_levels = c(family = 8, individual = 2), sort_by = "sequence_length", desc_sort = TRUE )extract_and_sort_methylation( modification_data, ..., locations_colname = "methylation_locations", probabilities_colname = "methylation_probabilities", sequences_colname = "sequence", lengths_colname = "sequence_length", grouping_levels = c(family = 8, individual = 2), sort_by = "sequence_length", desc_sort = TRUE )
modification_data |
|
... |
Used to recognise aliases e.g. American spellings or common misspellings - see aliases. If any American spellings do not work, please make a bug report at https://github.com/ejade42/ggDNAvis/issues. |
locations_colname |
|
probabilities_colname |
|
sequences_colname |
|
lengths_colname |
|
grouping_levels |
|
sort_by |
|
desc_sort |
|
list, containing $locations (character vector), $probabilities (character vector), $sequences (character vector), and $lengths (integer vector).
## See documentation for extract_and_sort_sequences() ## for more examples of changing sorting/grouping extract_and_sort_methylation( example_many_sequences, locations_colname = "methylation_locations", probabilities_colname = "methylation_probabilities", sequences_colname = "sequence", lengths_colname = "sequence_length", grouping_levels = c("family" = 8, "individual" = 2), sort_by = "sequence_length", desc_sort = TRUE ) extract_and_sort_methylation( example_many_sequences, locations_colname = "hydroxymethylation_locations", probabilities_colname = "hydroxymethylation_probabilities", sequences_colname = "sequence", lengths_colname = "sequence_length", grouping_levels = c("family" = 8, "individual" = 2), sort_by = "sequence_length", desc_sort = TRUE )## See documentation for extract_and_sort_sequences() ## for more examples of changing sorting/grouping extract_and_sort_methylation( example_many_sequences, locations_colname = "methylation_locations", probabilities_colname = "methylation_probabilities", sequences_colname = "sequence", lengths_colname = "sequence_length", grouping_levels = c("family" = 8, "individual" = 2), sort_by = "sequence_length", desc_sort = TRUE ) extract_and_sort_methylation( example_many_sequences, locations_colname = "hydroxymethylation_locations", probabilities_colname = "hydroxymethylation_probabilities", sequences_colname = "sequence", lengths_colname = "sequence_length", grouping_levels = c("family" = 8, "individual" = 2), sort_by = "sequence_length", desc_sort = TRUE )
extract_sequences_from_dataframe() is an alias for extract_and_sort_sequences() - see aliases.
This function takes a dataframe that contains sequences and metadata,
recursively splits it into multiple levels of groups defined by grouping_levels,
and adds breaks between each level of group as defined by grouping_levels.
Within each lowest-level group, reads are sorted by sort_by, with order determined
by desc_sort.
Default values are set up to work with the included dataset example_many_sequences.
The returned sequences vector is ideal input for visualise_many_sequences().
Also called by extract_methylation_from_dataframe() to produce input for visualise_methylation().
extract_and_sort_sequences( sequence_dataframe, sequence_variable = "sequence", grouping_levels = c(family = 8, individual = 2), sort_by = "sequence_length", desc_sort = TRUE )extract_and_sort_sequences( sequence_dataframe, sequence_variable = "sequence", grouping_levels = c(family = 8, individual = 2), sort_by = "sequence_length", desc_sort = TRUE )
sequence_dataframe |
|
sequence_variable |
|
grouping_levels |
|
sort_by |
|
desc_sort |
|
character vector. The sequences ordered and grouped as specified, with blank sequences ("") inserted as spacers as specified.
extract_and_sort_sequences( example_many_sequences, sequence_variable = "sequence", grouping_levels = c("family" = 8, "individual" = 2), sort_by = "sequence_length", desc_sort = TRUE ) extract_and_sort_sequences( example_many_sequences, sequence_variable = "sequence", grouping_levels = c("family" = 3), sort_by = "sequence_length", desc_sort = FALSE ) extract_and_sort_sequences( example_many_sequences, sequence_variable = "sequence", grouping_levels = NA, sort_by = "sequence_length", desc_sort = TRUE ) extract_and_sort_sequences( example_many_sequences, sequence_variable = "sequence", grouping_levels = c("family" = 8, "individual" = 2), sort_by = NA ) extract_and_sort_sequences( example_many_sequences, sequence_variable = "sequence", grouping_levels = NA, sort_by = NA ) extract_and_sort_sequences( example_many_sequences, sequence_variable = "quality", grouping_levels = c("individual" = 3), sort_by = "quality", desc_sort = FALSE )extract_and_sort_sequences( example_many_sequences, sequence_variable = "sequence", grouping_levels = c("family" = 8, "individual" = 2), sort_by = "sequence_length", desc_sort = TRUE ) extract_and_sort_sequences( example_many_sequences, sequence_variable = "sequence", grouping_levels = c("family" = 3), sort_by = "sequence_length", desc_sort = FALSE ) extract_and_sort_sequences( example_many_sequences, sequence_variable = "sequence", grouping_levels = NA, sort_by = "sequence_length", desc_sort = TRUE ) extract_and_sort_sequences( example_many_sequences, sequence_variable = "sequence", grouping_levels = c("family" = 8, "individual" = 2), sort_by = NA ) extract_and_sort_sequences( example_many_sequences, sequence_variable = "sequence", grouping_levels = NA, sort_by = NA ) extract_and_sort_sequences( example_many_sequences, sequence_variable = "quality", grouping_levels = c("individual" = 3), sort_by = "quality", desc_sort = FALSE )
A vector of the characters used to indicate quality scores from 0 to 40
in the FASTQ format. These scores are related to the error probability
via , so a Q-score of 10 (represented by "+") means
the error probability is 0.1, a Q-score of 20 ("5") means the error probability
is 0.01, and a Q-score of 30 ("?") means the error probability is 0.001.
The character representations store Q-scores in one byte each by using ASCII encodings,
where the Q-score for a character is its ASCII code minus 33 (e.g. A has an ASCII
code of 65 and represents a Q-score of 32).
This vector contains the characters in order but starting with a score of 0, meaning
the character at index represents a Q-score of e.g. the first
character ("!") represents a score of 0; the eleventh character ("+")
represents a score of 10.
The full set of possible score representations, in order and presented as a single
string, is !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHI.
Generation code is available at data-raw/fastq_quality_scores.R
fastq_quality_scoresfastq_quality_scores
fastq_quality_scoresA character vector of length 41
The vector c("!", '"', "#", "$", "%", "&", "'", "(", ")", "*", "+", ",", "-", ".", "/", "0", "1", "2", "3", "4", "5", "6", "7", "8", "9", ":", ";", "<", "=", ">", "?", "@", "A", "B", "C", "D", "E", "F", "G", "H", "I")
fastq_quality_scoresfastq_quality_scores
ggDNAvis helper)This function takes two times (class "POSIXct") and formats
the difference between them nicely, with a certain number
of numerical characters printed.
Note that the if the time difference rounded to the integer
number of seconds (e.g. 1234 seconds) requires more space than
the number of characters allocated (e.g. 3 characters) then
it will go beyond the specified characters.
However, this would be an exceptionally slow-running function.
In normal monitoring use for monitor(),
<1 second steps should be nearly universal, and <0.01 second
steps are very common.
format_time_diff(new_time, old_time, characters_to_print = 4)format_time_diff(new_time, old_time, characters_to_print = 4)
new_time |
|
old_time |
|
characters_to_print |
|
character. The formatted time difference in seconds.
## POSIXct time is a very large number of seconds newer <- 1000000001 older <- 1000000000 format_time_diff(newer, older, 4) newer <- 1000000456.45645 older <- 1000000000 format_time_diff(newer, older, 4) format_time_diff(newer, older, 3) format_time_diff(newer, older, 2) newer <- 1000000000.011 older <- 1000000000 format_time_diff(newer, older, 4) format_time_diff(newer, older, 3) format_time_diff(newer, older, 2)## POSIXct time is a very large number of seconds newer <- 1000000001 older <- 1000000000 format_time_diff(newer, older, 4) newer <- 1000000456.45645 older <- 1000000000 format_time_diff(newer, older, 4) format_time_diff(newer, older, 3) format_time_diff(newer, older, 2) newer <- 1000000000.011 older <- 1000000000 format_time_diff(newer, older, 4) format_time_diff(newer, older, 3) format_time_diff(newer, older, 2)
ggDNAvis aliasesAs of v1.0.0, ggDNAvis supports function and argument aliases.
The code is entirely written with British spellings (e.g. visualise_methylation_colour_scale()),
but should also accept American spellings (e.g. visualize_methylation_color_scale()).
If any American spellings don't work, I most likely overlooked them and can easily fix,
so please submit a bug report by creating a github issue
(https://github.com/ejade42/ggDNAvis/issues).
All four major visualise_ functions have aliases to also accept visualize_:
As of v1.0.0, extract_methylation_from_dataframe() has been renamed extract_and_sort_methylation()
for consistency with extract_and_sort_sequences(). To preserve compatibility and ensure consistency,
both functions now accept either name formulation:
extract_and_sort_sequences() (extract_sequences_from_dataframe())
extract_and_sort_methylation() (extract_methylation_from_dataframe())
The builtin dataset sequence_colour_palettes, like all colour arguments, also accepts
color or col:
The interactive shinyapp can be called via ggDNAvis_shinyapp() or ggDNAvis_shiny().
Additionally, the three rasterise_ helper functions also accept rasterize_:
All arguments should have aliases configured. In particular, any _colour arguments
should also accept _color or _col.
When more than one equivalent argument is provided, the 'canonical' (British) argument
takes precedence, and will produce a warning message explaining this. For colours, _colour
takes precedence over _color, which itself takes precedence over _col.
I have also tried to provide aliases for common argument misspellings. In particular,
index_annotation_full_line also accepts any of index_annotations_full_lines,
index_annotation_full_lines, or index_annotations_full_line.
Likewise, index_annotations_above also accepts index_annotation_above.
d <- extract_methylation_from_dataframe(example_many_sequences) ## The resulting low colour will be green visualise_methylation( d$locations, d$probabilities, d$sequences, index_annotation_lines = NA, outline_linewidth = 0, high_colour = "white", low_colour = "green", low_color = "orange", low_col = "purple" ) ## The resulting low colour will be orange visualise_methylation( d$locations, d$probabilities, d$sequences, index_annotation_lines = NA, outline_linewidth = 0, high_colour = "white", low_color = "orange", low_col = "purple" ) ## The resulting low colour will be purple visualise_methylation( d$locations, d$probabilities, d$sequences, index_annotation_lines = NA, outline_linewidth = 0, high_colour = "white", low_col = "purple" )d <- extract_methylation_from_dataframe(example_many_sequences) ## The resulting low colour will be green visualise_methylation( d$locations, d$probabilities, d$sequences, index_annotation_lines = NA, outline_linewidth = 0, high_colour = "white", low_colour = "green", low_color = "orange", low_col = "purple" ) ## The resulting low colour will be orange visualise_methylation( d$locations, d$probabilities, d$sequences, index_annotation_lines = NA, outline_linewidth = 0, high_colour = "white", low_color = "orange", low_col = "purple" ) ## The resulting low colour will be purple visualise_methylation( d$locations, d$probabilities, d$sequences, index_annotation_lines = NA, outline_linewidth = 0, high_colour = "white", low_col = "purple" )
ggDNAvis shinyappggDNAvis_shiny() is an alias for ggDNAvis_shinyapp() - see aliases.
The ggDNAvis shinyapp is an interactive frontend for the ggDNAvis functions.
Arguments can be configured via text/numerical/colour/checkbox entry rather than
on the command line. In the future it will be hosted online, but is
currently accessible only by running the shinyapp locally.
This function checks 'suggests' packages are present
(not needed for main package, but needed for the shinyapp)
and then runs the shinyapp in the inst/shinyapp directory.
ggDNAvis_shinyapp(themer = FALSE, return = FALSE)ggDNAvis_shinyapp(themer = FALSE, return = FALSE)
themer |
|
return |
|
Nothing, or the shiny app object
## Not run: ## Run normally ggDNAvis_shinyapp() ggDNAvis_shinyapp(themer = FALSE, return = FALSE) ## Run with theme picker (dev) ggDNAvis_shinyapp(themer = TRUE) ## Run, returning object (dev) ggDNAvis_shinyapp(return = TRUE) ## End(Not run)## Not run: ## Run normally ggDNAvis_shinyapp() ggDNAvis_shinyapp(themer = FALSE, return = FALSE) ## Run with theme picker (dev) ggDNAvis_shinyapp(themer = TRUE) ## Run, returning object (dev) ggDNAvis_shinyapp(return = TRUE) ## End(Not run)
visualise_many_sequences() helper)This function takes a vector (e.g. the output of extract_and_sort_sequences()) and
inserts a specified "blank" value at the specified indices.
If insert_before is TRUE then the blank value will be inserted before each
specified index, whereas if insert_before is FALSE then the blank value
will be inserted after each specified index.
insert_at_indices( original_vector, insertion_indices, insert_before = TRUE, insert = "", vert = NA )insert_at_indices( original_vector, insertion_indices, insert_before = TRUE, insert = "", vert = NA )
original_vector |
|
insertion_indices |
|
insert_before |
|
insert |
|
vert |
|
vector. The original vector but with the insert value added before/after each specified index.
insert_at_indices(c("A", "B", "C", "D", "E"), c(2, 4)) insert_at_indices( c("A", "B", "C", "D", "E"), c(2, 4), insert_before = TRUE, insert = 0 ) insert_at_indices( c("A", "B", "C", "D", "E"), c(2, 4), insert_before = FALSE, insert = 0 ) insert_at_indices( original_vector = c("A", "B", "C", "D", "E"), insertion_indices = c(1, 4, 6), insert_before = TRUE, insert = c("X", "Y") ) insert_at_indices( list("A", "B", "C", "D", "E"), c(2, 4), insert = TRUE ) insert_at_indices( list("A", "B", "C", "D", "E"), c(2, 4), insert_before = FALSE, insert = list(TRUE, 7) ) insert_at_indices( NA, c(1, 2), FALSE ) insert_at_indices( c("A", "B", "C", "D", "E"), integer(0) )insert_at_indices(c("A", "B", "C", "D", "E"), c(2, 4)) insert_at_indices( c("A", "B", "C", "D", "E"), c(2, 4), insert_before = TRUE, insert = 0 ) insert_at_indices( c("A", "B", "C", "D", "E"), c(2, 4), insert_before = FALSE, insert = 0 ) insert_at_indices( original_vector = c("A", "B", "C", "D", "E"), insertion_indices = c(1, 4, 6), insert_before = TRUE, insert = c("X", "Y") ) insert_at_indices( list("A", "B", "C", "D", "E"), c(2, 4), insert = TRUE ) insert_at_indices( list("A", "B", "C", "D", "E"), c(2, 4), insert_before = FALSE, insert = list(TRUE, 7) ) insert_at_indices( NA, c(1, 2), FALSE ) insert_at_indices( c("A", "B", "C", "D", "E"), integer(0) )
Merge a dataframe of sequence and quality data (as produced by
read_fastq() from an unmodified FASTQ file) with a dataframe of
metadata, reverse-complementing sequences if required such that all
reads are now in the forward direction.
merge_methylation_with_metadata() is the equivalent function for
working with FASTQs that contain DNA modification information.
FASTQ dataframe must contain columns of "read" (unique read ID),
"sequence" (DNA sequence), and "quality" (FASTQ quality score).
Other columns are allowed but not required, and will be preserved unaltered
in the merged data.
Metadata dataframe must contain "read" (unique read ID) and "direction"
(read direction, either "forward" or "reverse" for each read) columns,
and can contain any other columns with arbitrary information for each read.
Columns that might be useful include participant ID and family designations
so that each read can be associated with its participant and family.
Important: A key feature of this function is that it uses the direction
column from the metadata to identify which rows are reverse reads. These reverse
reads will then be reversed-complemented and have quality scores reversed
such that all reads are in the forward direction, ideal for consistent analysis or
visualisation. The output columns are "forward_sequence" and "forward_quality".
Calls reverse_sequence_if_needed() and reverse_quality_if_needed()
to implement the reversing - see documentation for these functions for more details.
merge_fastq_with_metadata( fastq_data, metadata, reverse_complement_mode = "DNA" )merge_fastq_with_metadata( fastq_data, metadata, reverse_complement_mode = "DNA" )
fastq_data |
|
metadata |
|
reverse_complement_mode |
|
dataframe. A merged dataframe containing all columns from the input dataframes, as well as forward versions of sequences and qualities.
## Locate files fastq_file <- system.file("extdata", "example_many_sequences_raw.fastq", package = "ggDNAvis") metadata_file <- system.file("extdata", "example_many_sequences_metadata.csv", package = "ggDNAvis") ## Read files fastq_data <- read_fastq(fastq_file) metadata <- read.csv(metadata_file) ## Merge data (including reversing if needed) merge_fastq_with_metadata( fastq_data, metadata ) ## Merge data reversing but not complementing sequences merge_fastq_with_metadata( fastq_data, metadata, reverse_complement_mode = "reverse_only" )## Locate files fastq_file <- system.file("extdata", "example_many_sequences_raw.fastq", package = "ggDNAvis") metadata_file <- system.file("extdata", "example_many_sequences_metadata.csv", package = "ggDNAvis") ## Read files fastq_data <- read_fastq(fastq_file) metadata <- read.csv(metadata_file) ## Merge data (including reversing if needed) merge_fastq_with_metadata( fastq_data, metadata ) ## Merge data reversing but not complementing sequences merge_fastq_with_metadata( fastq_data, metadata, reverse_complement_mode = "reverse_only" )
Merge a dataframe of methylation/modification data (as produced by
read_modified_fastq()) with a dataframe of metadata, reversing
sequence and modification information if required such that all information
is now in the forward direction.
merge_fastq_with_metadata() is the equivalent function for working with
unmodified FASTQs (sequence and quality only).
Methylation/modification dataframe must contain columns of "read" (unique read ID),
"sequence" (DNA sequence), "quality" (FASTQ quality score), "sequence_length"
(read length), "modification_types" (a comma-separated string of SAMtools modification
headers produced via vector_to_string() e.g. "C+h?,C+m?"), and,
for each modification type, a column of comma-separated strings of modification
locations (e.g. "3,6,9,12") and a column of comma-separated strings of
modification probabilities (e.g. "255,0,64,128"). See read_modified_fastq()
for more information on how this dataframe is formatted and produced.
Other columns are allowed but not required, and will be preserved unaltered
in the merged data.
Metadata dataframe must contain "read" (unique read ID) and "direction"
(read direction, either "forward" or "reverse" for each read) columns,
and can contain any other columns with arbitrary information for each read.
Columns that might be useful include participant ID and family designations
so that each read can be associated with its participant and family.
Important: A key feature of this function is that it uses the direction
column from the metadata to identify which rows are reverse reads. These reverse
reads will then be reversed-complemented and have modification information reversed
such that all reads are in the forward direction, ideal for consistent analysis or
visualisation. The output columns are "forward_sequence", "forward_quality",
"forward_<modification_type>_locations", and "forward_<modification_type>_probabilities".
Calls reverse_sequence_if_needed(), reverse_quality_if_needed(),
reverse_locations_if_needed(), and reverse_probabilities_if_needed()
to implement the reversing - see documentation for these functions for more details.
If wanting to write reversed sequences to FASTQ via write_modified_fastq(), locations
must be symmetric (e.g. CpG) and offset must be set to 1. Asymmetric locations are impossible
to write to modified FASTQ once reversed because then e.g. cytosine methylation will be assessed
at guanines, which SAMtools can't account for. Symmetrically reversing CpGs via
reversed_location_offset = 1 is the only way to fix this.
PLEASE READ THE reverse_locations_if_needed() DOCUMENTATION TO UNDERSTAND THE CHOICE OF OFFSET!
merge_methylation_with_metadata( methylation_data, metadata, reversed_location_offset = 0, reverse_complement_mode = "DNA" )merge_methylation_with_metadata( methylation_data, metadata, reversed_location_offset = 0, reverse_complement_mode = "DNA" )
methylation_data |
|
metadata |
|
reversed_location_offset |
|
reverse_complement_mode |
|
dataframe. A merged dataframe containing all columns from the input dataframes, as well as forward versions of sequences, qualities, modification locations, and modification probabilities (with separate locations and probabilities columns created for each modification type in the modification data).
## Locate files modified_fastq_file <- system.file("extdata", "example_many_sequences_raw_modified.fastq", package = "ggDNAvis") metadata_file <- system.file("extdata", "example_many_sequences_metadata.csv", package = "ggDNAvis") ## Read files methylation_data <- read_modified_fastq(modified_fastq_file) metadata <- read.csv(metadata_file) ## Merge data (including reversing if needed) merge_methylation_with_metadata( methylation_data, metadata, reversed_location_offset = 0 ) ## Merge data with offset = 1 merge_methylation_with_metadata( methylation_data, metadata, reversed_location_offset = 1 ) ## Merge data with offset = 1 but without complementing merge_methylation_with_metadata( methylation_data, metadata, reversed_location_offset = 1, reverse_complement_mode = "reverse_only" )## Locate files modified_fastq_file <- system.file("extdata", "example_many_sequences_raw_modified.fastq", package = "ggDNAvis") metadata_file <- system.file("extdata", "example_many_sequences_metadata.csv", package = "ggDNAvis") ## Read files methylation_data <- read_modified_fastq(modified_fastq_file) metadata <- read.csv(metadata_file) ## Merge data (including reversing if needed) merge_methylation_with_metadata( methylation_data, metadata, reversed_location_offset = 0 ) ## Merge data with offset = 1 merge_methylation_with_metadata( methylation_data, metadata, reversed_location_offset = 1 ) ## Merge data with offset = 1 but without complementing merge_methylation_with_metadata( methylation_data, metadata, reversed_location_offset = 1, reverse_complement_mode = "reverse_only" )
ggDNAvis helper)This function is meant to be called frequently throughout
a main function, and if verbose performance monitoring is enabled
then it will print the elapsed time since (a) initialisation via
monitor_start() and (b) since the last step was recorded via
this function.
monitor(monitor_performance, start_time, previous_time, message)monitor(monitor_performance, start_time, previous_time, message)
monitor_performance |
|
start_time |
|
previous_time |
|
message |
|
POSIXct the time at which the function was called, via Sys.time().
## Initialise monitoring start_time <- monitor_start(TRUE, "my_cool_function") ## Step 1 monitor_time <- monitor(TRUE, start_time, start_time, "performing step 1") x <- 2 + 2 ## Step 2 monitor_time <- monitor(TRUE, start_time, monitor_time, "performing step 2") y <- 10.5^6 %% 345789 ## Step 3 monitor_time <- monitor(TRUE, start_time, monitor_time, "performing step 3") z <- y / x^2 ## Conclude monitoring monitor_time <- monitor(TRUE, start_time, monitor_time, "done")## Initialise monitoring start_time <- monitor_start(TRUE, "my_cool_function") ## Step 1 monitor_time <- monitor(TRUE, start_time, start_time, "performing step 1") x <- 2 + 2 ## Step 2 monitor_time <- monitor(TRUE, start_time, monitor_time, "performing step 2") y <- 10.5^6 %% 345789 ## Step 3 monitor_time <- monitor(TRUE, start_time, monitor_time, "performing step 3") z <- y / x^2 ## Conclude monitoring monitor_time <- monitor(TRUE, start_time, monitor_time, "done")
ggDNAvis helper)This function takes a bool of whether verbose performance monitoring is on, as well as the name of the calling function, prints a monitoring initialisation message (if desired), and returns the start time.
Later monitoring steps are performed by monitor()
monitor_start(monitor_performance, function_name)monitor_start(monitor_performance, function_name)
monitor_performance |
|
function_name |
|
POSIXct the time at which the function was initialised, via Sys.time().
## Initialise monitoring start_time <- monitor_start(TRUE, "my_cool_function") ## Step 1 monitor_time <- monitor(TRUE, start_time, start_time, "performing step 1") x <- 2 + 2 ## Step 2 monitor_time <- monitor(TRUE, start_time, monitor_time, "performing step 2") y <- 10.5^6 %% 345789 ## Step 3 monitor_time <- monitor(TRUE, start_time, monitor_time, "performing step 3") z <- y / x^2 ## Conclude monitoring monitor_time <- monitor(TRUE, start_time, monitor_time, "done")## Initialise monitoring start_time <- monitor_start(TRUE, "my_cool_function") ## Step 1 monitor_time <- monitor(TRUE, start_time, start_time, "performing step 1") x <- 2 + 2 ## Step 2 monitor_time <- monitor(TRUE, start_time, monitor_time, "performing step 2") y <- 10.5^6 %% 345789 ## Step 3 monitor_time <- monitor(TRUE, start_time, monitor_time, "performing step 3") z <- y / x^2 ## Conclude monitoring monitor_time <- monitor(TRUE, start_time, monitor_time, "done")
ggDNAvis helper)rasterize_index_annotations() is an alias for rasterise_index_annotations().
This function is called by
visualise_many_sequences(), visualise_methylation(), and visualise_single_sequence()
to create the x/y position data for placing the index annotations on the graph.
Its arguments are either intermediate variables produced by the visualisation functions,
or arguments of the visualisation functions directly passed through.
Returns a dataframe with x_position, y_position, and value columns, where the
values are the index annotations.
rasterise_index_annotations( new_sequences_vector, original_sequences_vector, index_annotation_lines, index_annotation_interval = 15, index_annotation_full_line = TRUE, index_annotations_above = TRUE, index_annotation_vertical_position = 1/3, index_annotation_always_first_base = TRUE, index_annotation_always_last_base = TRUE, sum_indices = FALSE, spacing = NA, offset_start = 0 )rasterise_index_annotations( new_sequences_vector, original_sequences_vector, index_annotation_lines, index_annotation_interval = 15, index_annotation_full_line = TRUE, index_annotations_above = TRUE, index_annotation_vertical_position = 1/3, index_annotation_always_first_base = TRUE, index_annotation_always_last_base = TRUE, sum_indices = FALSE, spacing = NA, offset_start = 0 )
new_sequences_vector |
|
original_sequences_vector |
|
index_annotation_lines |
|
index_annotation_interval |
|
index_annotation_full_line |
|
index_annotations_above |
|
index_annotation_vertical_position |
|
index_annotation_always_first_base |
|
index_annotation_always_last_base |
|
sum_indices |
|
spacing |
|
offset_start |
|
dataframe. A dataframe with columns x, y, and value, with one observation per annotation number that needs to be drawn onto the ggplot.
## Set up arguments (e.g. from visualise_many_sequences() call) sequences_data <- example_many_sequences index_annotation_lines <- c(1, 23, 37) index_annotation_interval <- 10 index_annotations_above <- TRUE index_annotation_full_line <- FALSE index_annotation_vertical_position <- 1/3 ## Create sequences vector sequences <- extract_and_sort_sequences( example_many_sequences, grouping_levels = c("family" = 8, "individual" = 2) ) sequences ## Insert blank rows as needed new_sequences <- insert_at_indices( sequences, insertion_indices = index_annotation_lines, insert_before = index_annotations_above, insert = "", vert = index_annotation_vertical_position ) new_sequences ## Create annnotation dataframe rasterise_index_annotations( new_sequences_vector = new_sequences, original_sequences_vector = sequences, index_annotation_lines = index_annotation_lines, index_annotation_interval = 10, index_annotation_full_line = index_annotation_full_line, index_annotations_above = index_annotations_above, index_annotation_vertical_position = index_annotation_vertical_position, index_annotation_always_first_base = TRUE, index_annotation_always_last_base = TRUE, sum_indices = FALSE, spacing = NA, ## infer from vertical position offset_start = 0 )## Set up arguments (e.g. from visualise_many_sequences() call) sequences_data <- example_many_sequences index_annotation_lines <- c(1, 23, 37) index_annotation_interval <- 10 index_annotations_above <- TRUE index_annotation_full_line <- FALSE index_annotation_vertical_position <- 1/3 ## Create sequences vector sequences <- extract_and_sort_sequences( example_many_sequences, grouping_levels = c("family" = 8, "individual" = 2) ) sequences ## Insert blank rows as needed new_sequences <- insert_at_indices( sequences, insertion_indices = index_annotation_lines, insert_before = index_annotations_above, insert = "", vert = index_annotation_vertical_position ) new_sequences ## Create annnotation dataframe rasterise_index_annotations( new_sequences_vector = new_sequences, original_sequences_vector = sequences, index_annotation_lines = index_annotation_lines, index_annotation_interval = 10, index_annotation_full_line = index_annotation_full_line, index_annotations_above = index_annotations_above, index_annotation_vertical_position = index_annotation_vertical_position, index_annotation_always_first_base = TRUE, index_annotation_always_last_base = TRUE, sum_indices = FALSE, spacing = NA, ## infer from vertical position offset_start = 0 )
ggDNAvis helper)rasterize_matrix() is an alias for rasterise_matrix().
This function takes a matrix and rasterises it to a dataframe of x and y coordinates, such that the matrix occupies the space from (0, 0) to (1, 1) and each element of the matrix represents a rectangle with width 1/ncol(matrix) and height 1/nrow(matrix). The "layer" column of the dataframe is simply the value of each element of the matrix.
rasterise_matrix(image_matrix, drop_na = TRUE)rasterise_matrix(image_matrix, drop_na = TRUE)
image_matrix |
|
drop_na |
|
dataframe. A dataframe containing x and y coordinates for the centre of a rectangle per element of the matrix, such that the whole matrix occupies the space from (0, 0) to (1, 1). Additionally contains a layer column storing the value of each element of the matrix.
## Create numerical matrix example_matrix <- matrix(1:16, ncol = 4, nrow = 4, byrow = TRUE) ## View example_matrix ## Rasterise rasterise_matrix(example_matrix) ## Create character matrix example_matrix <- matrix( c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J"), nrow = 2, ncol = 5, byrow = TRUE ) ## View example_matrix ## Rasterise rasterise_matrix(example_matrix) ## Create realistic DNA matrix dna_matrix <- matrix( c(0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 2, 3, 3, 2, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 4, 1, 4, 1, 0, 0, 0, 0), nrow = 4, ncol = 8, byrow = TRUE ) ## View dna_matrix ## Rasterise rasterise_matrix(dna_matrix) ## Create matrix with missing values incomplete_matrix <- matrix( c(1, 2, 3, NA, 5, NA, 7, 8), nrow = 2, ncol = 4, byrow = TRUE ) ## View incomplete_matrix ## Rasterise, dropping NAs (default) rasterise_matrix(incomplete_matrix, drop_na = TRUE) ## Rasterise, keeping NAs rasterise_matrix(incomplete_matrix, drop_na = FALSE)## Create numerical matrix example_matrix <- matrix(1:16, ncol = 4, nrow = 4, byrow = TRUE) ## View example_matrix ## Rasterise rasterise_matrix(example_matrix) ## Create character matrix example_matrix <- matrix( c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J"), nrow = 2, ncol = 5, byrow = TRUE ) ## View example_matrix ## Rasterise rasterise_matrix(example_matrix) ## Create realistic DNA matrix dna_matrix <- matrix( c(0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 2, 3, 3, 2, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 4, 1, 4, 1, 0, 0, 0, 0), nrow = 4, ncol = 8, byrow = TRUE ) ## View dna_matrix ## Rasterise rasterise_matrix(dna_matrix) ## Create matrix with missing values incomplete_matrix <- matrix( c(1, 2, 3, NA, 5, NA, 7, 8), nrow = 2, ncol = 4, byrow = TRUE ) ## View incomplete_matrix ## Rasterise, dropping NAs (default) rasterise_matrix(incomplete_matrix, drop_na = TRUE) ## Rasterise, keeping NAs rasterise_matrix(incomplete_matrix, drop_na = FALSE)
visualise_methylation() helper)This function takes the locations/probabilities/sequences input to visualise_methylation(),
as well as the scaling and rounding to apply to the probability text,
and produces a dataframe of the x and y coordinates to draw each probability at
(i.e. inside the coloured box for each assessed base)
and the probability text to draw inside each box.
rasterise_probabilities( modification_locations, modification_probabilities, sequences, sequence_text_scaling = c(-0.5, 256), sequence_text_rounding = 2 )rasterise_probabilities( modification_locations, modification_probabilities, sequences, sequence_text_scaling = c(-0.5, 256), sequence_text_rounding = 2 )
modification_locations |
|
modification_probabilities |
|
sequences |
|
sequence_text_scaling |
|
sequence_text_rounding |
|
dataframe. Dataframe of x, y, and value (i.e. probability to draw).
d <- extract_and_sort_methylation(example_many_sequences) ## Unscaled i.e. integers rasterise_probabilities( d$locations, d$probabilities, d$sequences, sequence_text_scaling = c(0, 1), sequence_text_rounding = 0 ) ## Scaled to 0-1, 3 dp rasterize_probabilities( d$locations, d$probabilities, d$sequences, sequence_text_scaling = c(-0.5, 256), sequence_text_rounding = 3 ) ## Default (i.e. scaled to 0-1, 2 dp) rasterise_probabilities( d$locations, d$probabilities, d$sequences )d <- extract_and_sort_methylation(example_many_sequences) ## Unscaled i.e. integers rasterise_probabilities( d$locations, d$probabilities, d$sequences, sequence_text_scaling = c(0, 1), sequence_text_rounding = 0 ) ## Scaled to 0-1, 3 dp rasterize_probabilities( d$locations, d$probabilities, d$sequences, sequence_text_scaling = c(-0.5, 256), sequence_text_rounding = 3 ) ## Default (i.e. scaled to 0-1, 2 dp) rasterise_probabilities( d$locations, d$probabilities, d$sequences )
This function simply reads a FASTQ file into a dataframe containing
columns for read ID, sequence, and quality scores.
Optionally also contains a column of sequence lengths.
See fastq_quality_scores for an explanation of quality.
Resulting dataframe can be written back to FASTQ via write_fastq().
To read/write a modified FASTQ containing modification information
(SAM/BAM MM and ML tags) in the header lines, use
read_modified_fastq() and write_modified_fastq().
read_fastq(filename = file.choose(), calculate_length = TRUE, strip_at = TRUE)read_fastq(filename = file.choose(), calculate_length = TRUE, strip_at = TRUE)
filename |
|
calculate_length |
|
strip_at |
|
dataframe. A dataframe with read, sequence, quality, and optionally sequence_length columns.
## Locate file fastq_file <- system.file("extdata", "example_many_sequences_raw.fastq", package = "ggDNAvis") ## View file for (i in 1:16) { cat(readLines(fastq_file)[i], "\n") } ## Read file to dataframe read_fastq(fastq_file, calculate_length = FALSE) read_fastq(fastq_file, calculate_length = TRUE)## Locate file fastq_file <- system.file("extdata", "example_many_sequences_raw.fastq", package = "ggDNAvis") ## View file for (i in 1:16) { cat(readLines(fastq_file)[i], "\n") } ## Read file to dataframe read_fastq(fastq_file, calculate_length = FALSE) read_fastq(fastq_file, calculate_length = TRUE)
This function reads a modified FASTQ file (e.g. created by samtools fastq -T MM,ML
from a BAM basecalled with a modification-capable model in Dorado or Guppy) to a dataframe.
By default, the dataframe contains columns for unique read id (read), sequence (sequence),
sequence length (sequence_length), quality (quality), comma-separated (via vector_to_string())
modification types present in each read (modification_types), and for each modification type,
a column of comma-separated modification locations (<type>_locations) and
a column of comma-separated modification probabilities (<type>_probabilities).
Modification locations are the indices along the read at which modification was assessed
e.g. a 3 indicates that the third base in the read was assessed for modifications of the given type.
Modification probabilities are the probability that the given modification is present, given as
an integer from 0-255 where integer represents the probability space from
to .
To extract the numbers from these columns as numeric vectors to analyse, use string_to_vector() e.g.
list_of_locations <- lapply(test_01$`C+h?_locations`, string_to_vector). Be aware that the SAM
modification types often contain special characters, meaning the colname may need to be enclosed in
backticks as in this example. Alternatively, use extract_methylation_from_dataframe() to
create a list of locations, probabilities, and lengths ready for visualisation in
visualise_methylation(). This works with any modification type extracted in this function,
just provide the relevant colname when calling extract_methylation_from_dataframe().
Optionally (by specifying debug = TRUE), the dataframe will also contain columns of
the raw MM and ML tags (<MM/ML>_raw) and of the same tags with the initial label
trimmed out (<MM/ML>_tags). This is not recommended in most situations but may help
with debugging unexpected issues as it contains the raw data exactly from the FASTQ.
Dataframes produced by this function can be written back to modified FASTQ via write_modified_fastq().
read_modified_fastq(filename = file.choose(), strip_at = TRUE, debug = FALSE)read_modified_fastq(filename = file.choose(), strip_at = TRUE, debug = FALSE)
filename |
|
strip_at |
|
debug |
|
dataframe. Dataframe of modification information, as described above.
Sequences can be visualised with visualise_many_sequences() and modification information can be visualised with visualise_methylation() (despite the name, any type of information can be visualised as long as it has locations and probabilities columns).
Can be written back to FASTQ via write_modified_fastq().
## Locate file modified_fastq_file <- system.file("extdata", "example_many_sequences_raw_modified.fastq", package = "ggDNAvis") ## View file for (i in 1:16) { cat(readLines(modified_fastq_file)[i], "\n") } ## Read file to dataframe read_modified_fastq(modified_fastq_file, debug = FALSE) read_modified_fastq(modified_fastq_file, debug = TRUE)## Locate file modified_fastq_file <- system.file("extdata", "example_many_sequences_raw_modified.fastq", package = "ggDNAvis") ## View file for (i in 1:16) { cat(readLines(modified_fastq_file)[i], "\n") } ## Read file to dataframe read_modified_fastq(modified_fastq_file, debug = FALSE) read_modified_fastq(modified_fastq_file, debug = TRUE)
ggDNAvis helper)See the aliases page for a general explanation of how aliases are used in ggDNAvis.
This function takes the name and value for the 'primary' form of an argument
(generally British spellings in ggDNAvis), the name of an alternative
'alias' form, the dots (unrecognised argument) environment, and the default value of the 'primary' argument.
If the alias has not been used (i.e. the alias is not present in the dots env) or if the 'primary' value has been changed from the default, then the 'primary' value will be returned. (Note that if the alias is present in the dots env and the 'primary' value has been changed from the default, then the updated 'primary' value 'wins' and is returned, but with a warning that explains that both values were set and the 'alias' has been discarded).
If the alias has been used (i.e. the alias is present in the dots env) and the 'primary' value is the default, then the 'alias' value will be returned.
This function is most often used when called by resolve_alias_map().
resolve_alias(primary_name, primary_val, primary_default, alias_name, dots_env)resolve_alias(primary_name, primary_val, primary_default, alias_name, dots_env)
primary_name |
|
primary_val |
|
primary_default |
|
alias_name |
|
dots_env |
|
value. Either primary_val or alias_val, depending on the logic above.
low_colour <- "blue" ## e.g. default value from function call dots_env <- list2env(list(low_color = "pink")) ## e.g. low_color = "pink" set in function call low_colour <- resolve_alias("low_colour", low_colour, "blue", "low_color", dots_env) low_colour ## check to see what value was storedlow_colour <- "blue" ## e.g. default value from function call dots_env <- list2env(list(low_color = "pink")) ## e.g. low_color = "pink" set in function call low_colour <- resolve_alias("low_colour", low_colour, "blue", "low_color", dots_env) low_colour ## check to see what value was stored
ggDNAvis helper)See the aliases page for a general explanation of how aliases are used in ggDNAvis.
This function takes an alias map and the environment constructed from non-formal
arguments (...) to the calling function, and optionally an environment to function inside,
and works through the aliases provided in the map via resolve_alias().
If any arguments were given that aren't in the alias map an error is raised.
resolve_alias_map(alias_map, dots_env, target_env = parent.frame())resolve_alias_map(alias_map, dots_env, target_env = parent.frame())
alias_map |
|
dots_env |
|
target_env |
|
Nothing (variables are modified within the target_env).
## Alias map (from within function code) alias_map <- list( low_colour = list(default = "blue", aliases = c("low_color", "low_col")), high_colour = list(default = "red", aliases = c("high_color", "high_col")) ) ## Default values (would come from formal arguments) low_colour = "blue" ## default high_colour = "green" ## changed from default ## Extra arguments provided by name dots_env <- list2env(list("low_col" = "black", "low_color" = "white", "high_color" = "orange")) ## Process resolve_alias_map(alias_map, dots_env) ## See values print(low_colour) print(high_colour)## Alias map (from within function code) alias_map <- list( low_colour = list(default = "blue", aliases = c("low_color", "low_col")), high_colour = list(default = "red", aliases = c("high_color", "high_col")) ) ## Default values (would come from formal arguments) low_colour = "blue" ## default high_colour = "green" ## changed from default ## Extra arguments provided by name dots_env <- list2env(list("low_col" = "black", "low_color" = "white", "high_color" = "orange")) ## Process resolve_alias_map(alias_map, dots_env) ## See values print(low_colour) print(high_colour)
ggDNAvis helper)This function takes a string/character representing a DNA/RNA sequence and returns
the reverse complement. Either DNA (A/C/G/T) or RNA (A/C/G/U) input is accepted.
By default, output is DNA (so A is reverse-complemented to T), but it can be set
to output RNA (so A is reverse-complemented to U).
Alternatively, if output_mode is set to "reverse_only" then the sequence will be
reversed as-is without being complemented. Note that this also skips sequence validation,
meaning any string can be reversed even if it contains non-A/C/G/T/U characters.
reverse_complement(sequence, output_mode = "DNA")reverse_complement(sequence, output_mode = "DNA")
sequence |
|
output_mode |
|
character. The reverse-complement of the input sequence.
reverse_complement("ATGCTAG") reverse_complement("ATGCTAG", output_mode = "reverse_only") reverse_complement("UUAUUAGC", output_mode = "RNA") reverse_complement("AcGtU", output_mode = "DNA") reverse_complement("aCgTU", output_mode = "RNA")reverse_complement("ATGCTAG") reverse_complement("ATGCTAG", output_mode = "reverse_only") reverse_complement("UUAUUAGC", output_mode = "RNA") reverse_complement("AcGtU", output_mode = "DNA") reverse_complement("aCgTU", output_mode = "RNA")
merge_methylation_with_metadata() helper)This function takes a vector of condensed modification locations/indices (e.g.
c("3,6,9,12", "1,4,7,10")), a vector of directions (which must all be either
"forward" or "reverse", not case-sensitive), and a vector of sequence lengths
(integers).
Returns a vector of condensed locations where reads that were originally forward
are unchanged, and reads that were originally reverse are flipped to now be forward.
Optionally, a numerical offset can be set. If this is left at 0 (the default value),
then a CpG assessed for methylation would be reverse-complemented to a CG with the
modification information ascribed to the G (as the G is at the location where the actual
modified C was on the other strand). However, setting the offset to 1 would shift all
of the modification indices by 1 such that the modification is now ascribed to the C of the
reverse-strand CG. This is beneficial for visualising the modifications as it ensures consistency
between originally-forward and originally-reverse strands by making the modification score associated
with each CpG site always be located at the C, but may be misleading for quantitative analysis.
Setting the offset to anything other than 0 or 1 should work but may be biologically misleading,
so produces a warning.
Called by merge_methylation_with_metadata() to create a forward dataset, alongside
reverse_sequence_if_needed(), reverse_quality_if_needed(), and reverse_probabilities_if_needed().
Example:
Forward sequence, with indices of Cs in CpGs numbered:
C C C A G G C G G C G G C G A C C G A
7 10 13 17
(length = 19, locations = "7,10,13,17", CpGs = 7-8, 10-11, 13-14, 17-18)
Reverse sequence, with indices of C in CpGs numbered:
T C G G T C G C C G C C G C C T G G G 2 6 9 12
(length = 19, locations = "2,6,9,12", CpGs = 2-3, 6-7, 9-10, 12-13)
As CG reverse-complements to itself, each CpG site has a 1:1 correspondence with
a CpG site in the reverse strand. Many methylation calling models assess C-methylation
at the C of each CpG. To map the locations from C to C, we take 19 - <index> such that
"7,10,13,17" becomes "12,9,6,2" and "2,6,9,12" becomes "17,13,10,7".
The symmetry of CpGs means mapping from C to C is also symmetric.
This is achieved by setting offset = 1, as mapping C to C involves shifting position by 1.
Conversely, to map the locations from C to G (i.e. preserving the actual location of each
modification, which is required if assessed locations are non-symmetric/don't reverse-complement
to themselves like CpGs do), we take 20 - <index> such that
"7,10,13,17" becomes "13,10,7,3" i.e. the indices of the Gs in CpGs in the reverse
sequence. Likewise "2,6,9,12" becomes "18,14,11,8" i.e. the indices of the Gs in CpGs in
the forward sequence.
This is achieved by setting offset = 0, as mapping C to G preserves the actual original position
at which each modification was assessed, but changes the base to its complement.
In general, new locations are calculated as (<length> + 1 - <offset>) - <index>.
Of course, output locations are reversed before returning so that they all
return in ascending order, as is standard for all location vectors/strings.
If wanting to write reversed sequences to FASTQ via write_modified_fastq(), locations
must be symmetric (e.g. CpG) and offset must be set to 1. Asymmetric locations are impossible
to write to modified FASTQ once reversed because then e.g. cytosine methylation will be assessed
at guanines, which SAMtools can't account for. Symmetrically reversing CpGs via offset = 1 is
the only way to fix this.
reverse_locations_if_needed( locations_vector, direction_vector, length_vector, offset = 0 )reverse_locations_if_needed( locations_vector, direction_vector, length_vector, offset = 0 )
locations_vector |
|
direction_vector |
|
length_vector |
|
offset |
|
character vector. A vector of all forward versions of the input locations vector.
reverse_locations_if_needed( locations_vector = c("7,10,13,17", "2,6,9,12"), direction_vector = c("forward", "reverse"), length_vector = c(19, 19), offset = 0 ) reverse_locations_if_needed( locations_vector = c("7,10,13,17", "2,6,9,12"), direction_vector = c("forward", "reverse"), length_vector = c(19, 19), offset = 1 )reverse_locations_if_needed( locations_vector = c("7,10,13,17", "2,6,9,12"), direction_vector = c("forward", "reverse"), length_vector = c(19, 19), offset = 0 ) reverse_locations_if_needed( locations_vector = c("7,10,13,17", "2,6,9,12"), direction_vector = c("forward", "reverse"), length_vector = c(19, 19), offset = 1 )
merge_methylation_with_metadata() helper)This function takes a vector of condensed modification probabilities
(e.g. c("128,0,63,255", "3,78,1") and a vector of directions (which
must all be either "forward" or "reverse", not case-sensitive),
and returns a vector of condensed modification probabilities where those
that were originally forward are unchanged, and those that were originally
reverse are flipped to now be forward.
Called by merge_methylation_with_metadata() to create a forward dataset, alongside
reverse_sequence_if_needed(), reverse_quality_if_needed(), and reverse_locations_if_needed().
reverse_probabilities_if_needed(probabilities_vector, direction_vector)reverse_probabilities_if_needed(probabilities_vector, direction_vector)
probabilities_vector |
|
direction_vector |
|
character vector. A vector of all forward versions of the input probabilities vector.
reverse_probabilities_if_needed( probabilities_vector = c("100,200,50", "100,200,50"), direction_vector = c("forward", "reverse") )reverse_probabilities_if_needed( probabilities_vector = c("100,200,50", "100,200,50"), direction_vector = c("forward", "reverse") )
merge_methylation_with_metadata() helper)This function takes a vector of FASTQ qualities and a vector of directions
(which must all be either "forward" or "reverse", not case-sensitive)
and returns a vector of forward qualities.
Qualities of reads that were forward to begin with are unchanged,
while qualities of reads that were reverse are now flipped
to give the corresponding forward quality scores.
Called by merge_methylation_with_metadata() to create a forward dataset,
alongside reverse_sequence_if_needed(), reverse_locations_if_needed(),
and reverse_probabilities_if_needed().
reverse_quality_if_needed(quality_vector, direction_vector)reverse_quality_if_needed(quality_vector, direction_vector)
quality_vector |
|
direction_vector |
|
character vector. A vector of all forward versions of the input quality vector.
reverse_quality_if_needed( quality_vector = c("#^$&$*", "#^$&$*"), direction_vector = c("reverse", "forward") )reverse_quality_if_needed( quality_vector = c("#^$&$*", "#^$&$*"), direction_vector = c("reverse", "forward") )
merge_methylation_with_metadata() helper)This function takes a vector of DNA/RNA sequences and a vector of directions
(which must all be either "forward" or "reverse", not case-sensitive)
and returns a vector of forward DNA/RNA sequences.
Sequences in the vector that were forward to begin with are unchanged,
while sequences that were reverse are reverse-complemented via reverse_complement()
to produce the forward sequence.
Called by merge_methylation_with_metadata() to create a forward dataset, alongside
reverse_quality_if_needed(), reverse_locations_if_needed() and reverse_probabilities_if_needed().
reverse_sequence_if_needed( sequence_vector, direction_vector, output_mode = "DNA" )reverse_sequence_if_needed( sequence_vector, direction_vector, output_mode = "DNA" )
sequence_vector |
|
direction_vector |
|
output_mode |
|
character vector. A vector of all forward versions of the input sequence vector.
reverse_sequence_if_needed( sequence_vector = c("TAAGGC", "TAAGGC"), direction_vector = c("reverse", "forward") ) reverse_sequence_if_needed( sequence_vector = c("UAAGGC", "UAAGGC"), direction_vector = c("reverse", "forward"), output_mode = "RNA" ) reverse_sequence_if_needed( sequence_vector = c("TAAGGC", "TAAGGC"), direction_vector = c("reverse", "forward"), output_mode = "reverse_only" )reverse_sequence_if_needed( sequence_vector = c("TAAGGC", "TAAGGC"), direction_vector = c("reverse", "forward") ) reverse_sequence_if_needed( sequence_vector = c("UAAGGC", "UAAGGC"), direction_vector = c("reverse", "forward"), output_mode = "RNA" ) reverse_sequence_if_needed( sequence_vector = c("TAAGGC", "TAAGGC"), direction_vector = c("reverse", "forward"), output_mode = "reverse_only" )
sequence_color_palettes and sequence_col_palettes
are aliases for sequence_colour_palettes - see aliases.
A collection of colour palettes for use with visualise_single_sequence()
and visualise_many_sequences(). Each is a character vector of 4 colours,
corresponding to A, C, G, and T/U in that order.
To use inside the visualisation functions, set
sequence_colours = sequence_colour_palettes$<palette_name>
Generation code is available at data-raw/sequence_colour_palettes.R
sequence_colour_palettessequence_colour_palettes
sequence_colour_palettesA list of 6 length-4 character vectors
The shades of red, green, blue, and purple that ggplot2::ggplot() uses by default for a 4-way discrete colour scheme.
Values: c("#F8766D", "#7CAE00", "#00BFC4", "#C77CFF")
Bright yellow, green, blue, and red in lighter pastel-like tones.
Values: c("#FFDD00", "#40C000", "#00A0FF", "#FF4E4E")
Bright yellow, green, blue, and red in lighter pastel-like tones. The green (for C) is slightly lighter than bright_pale.
Values: c("#FFDD00", "#30EC00", "#00A0FF", "#FF4E4E")
Bright orange, green, blue, and red in darker, richer tones.
Values: c("#FFAA00", "#00BC00", "#0000DC", "#FF1E1E")
Green, blue, black, and red similar to a traditional Sanger sequencing readout.
Values: c("#00B200", "#0000FF", "#000000", "#FF0000")
Light green, dark green, dark blue, and light blue as suggested by colorbrewer2.org for a 4-qualitative-category colourblind-safe palette.
Values: c("#B2DF8A", "#33A02C", "#1F78B4", "#A6CEE3")
## ggplot_style: visualise_single_sequence( "ACGT", sequence_colours = sequence_colour_palettes$ggplot_style, index_annotation_interval = 0 ) ## bright_pale: visualise_single_sequence( "ACGT", sequence_colours = sequence_colour_palettes$bright_pale, index_annotation_interval = 0 ) ## bright_pale2: visualise_single_sequence( "ACGT", sequence_colours = sequence_colour_palettes$bright_pale2, index_annotation_interval = 0 ) ## bright_deep: visualise_single_sequence( "ACGT", sequence_colours = sequence_colour_palettes$bright_deep, sequence_text_colour = "white", index_annotation_interval = 0 ) ## sanger: visualise_single_sequence( "ACGT", sequence_colours = sequence_colour_palettes$sanger, sequence_text_colour = "white", index_annotation_interval = 0 ) ## accessible: visualise_single_sequence( "ACGT", sequence_colours = sequence_colour_palettes$accessible, sequence_text_colour = "black", index_annotation_interval = 0 )## ggplot_style: visualise_single_sequence( "ACGT", sequence_colours = sequence_colour_palettes$ggplot_style, index_annotation_interval = 0 ) ## bright_pale: visualise_single_sequence( "ACGT", sequence_colours = sequence_colour_palettes$bright_pale, index_annotation_interval = 0 ) ## bright_pale2: visualise_single_sequence( "ACGT", sequence_colours = sequence_colour_palettes$bright_pale2, index_annotation_interval = 0 ) ## bright_deep: visualise_single_sequence( "ACGT", sequence_colours = sequence_colour_palettes$bright_deep, sequence_text_colour = "white", index_annotation_interval = 0 ) ## sanger: visualise_single_sequence( "ACGT", sequence_colours = sequence_colour_palettes$sanger, sequence_text_colour = "white", index_annotation_interval = 0 ) ## accessible: visualise_single_sequence( "ACGT", sequence_colours = sequence_colour_palettes$accessible, sequence_text_colour = "black", index_annotation_interval = 0 )
","-joined string back to a vector (generic ggDNAvis helper)Takes a string (character) produced by vector_to_string() and recreates the vector.
Note that if a vector of multiple strings is input (e.g. c("1,2,3", "9,8,7")) the output
will be a single concatenated vector (e.g. c(1, 2, 3, 9, 8, 7)).
If the desired output is a list of vectors, try lapply() e.g.
lapply(c("1,2,3", "9,8,7"), string_to_vector) returns list(c(1, 2, 3), c(9, 8, 7)).
string_to_vector(string, type = "numeric", sep = ",")string_to_vector(string, type = "numeric", sep = ",")
string |
|
type |
|
sep |
|
<type> vector. The resulting vector (e.g. c(1, 2, 3)).
## String to numeric vector (default) string_to_vector("1,2,3,4") string_to_vector("1,2,3,4", type = "numeric") string_to_vector("1;2;3;4", sep = ";") ## String to character vector string_to_vector("A,B,C,D", type = "character") ## String to logical vector string_to_vector("TRUE FALSE TRUE", type = "logical", sep = " ") ## By default, vector inputs are concatenated string_to_vector(c("1,2,3", "4,5,6")) ## To create a list of vector outputs, use lapply() lapply(c("1,2,3", "4,5,6"), string_to_vector)## String to numeric vector (default) string_to_vector("1,2,3,4") string_to_vector("1,2,3,4", type = "numeric") string_to_vector("1;2;3;4", sep = ";") ## String to character vector string_to_vector("A,B,C,D", type = "character") ## String to logical vector string_to_vector("TRUE FALSE TRUE", type = "logical", sep = " ") ## By default, vector inputs are concatenated string_to_vector(c("1,2,3", "4,5,6")) ## To create a list of vector outputs, use lapply() lapply(c("1,2,3", "4,5,6"), string_to_vector)
@ from character vectorThis function removes a single leading @ character
from each element of a character vector when present.
This is intended to deal with SAMtools > FASTQ translation
often prefixing read IDs with an "@", which can result
in read ID mismatches and metadata merging fails.
strip_leading_at(string)strip_leading_at(string)
string |
|
character vector. The same string but with one "@" removed from each element that started with one.
strip_leading_at(c("read_1", "@read_2", "@@read_3", "", NA, NULL))strip_leading_at(c("read_1", "@read_2", "@@read_3", "", NA, NULL))
ggDNAvis helper)Takes a vector and condenses it into a single string by joining items with ",".
Reversed by string_to_vector().
vector_to_string(vector, sep = ",")vector_to_string(vector, sep = ",")
vector |
|
sep |
|
character. The same vector but as a comma-separated string (e.g. "1,2,3").
vector_to_string(c(1, 2, 3, 4)) vector_to_string(c("These", "are", "some", "words")) vector_to_string(3:5, sep = ";")vector_to_string(c(1, 2, 3, 4)) vector_to_string(c("These", "are", "some", "words")) vector_to_string(3:5, sep = ";")
visualize_many_sequences() is an alias for visualise_many_sequences() - see aliases.
This function takes a vector of DNA/RNA sequences (each sequence can be
any length and they can be different lengths), and plots each sequence
as base-coloured squares along a single line. Setting filename allows direct
export of a png image with the correct dimensions to make every base a perfect
square. Empty strings ("") within the vector can be utilised as blank spacing
lines. Colours and pixels per square when exported are configurable.
visualise_many_sequences( sequences_vector, ..., sequence_colours = sequence_colour_palettes$ggplot_style, background_colour = "white", margin = 0.5, sequence_text_colour = "black", sequence_text_size = 16, index_annotation_lines = c(1), index_annotation_colour = "darkred", index_annotation_size = 12.5, index_annotation_interval = 15, index_annotations_above = TRUE, index_annotation_vertical_position = 1/3, index_annotation_full_line = TRUE, index_annotation_always_first_base = TRUE, index_annotation_always_last_base = TRUE, outline_colour = "black", outline_linewidth = 3, outline_join = "mitre", return = TRUE, filename = NA, force_raster = FALSE, render_device = ragg::agg_png, pixels_per_base = 100, monitor_performance = FALSE )visualise_many_sequences( sequences_vector, ..., sequence_colours = sequence_colour_palettes$ggplot_style, background_colour = "white", margin = 0.5, sequence_text_colour = "black", sequence_text_size = 16, index_annotation_lines = c(1), index_annotation_colour = "darkred", index_annotation_size = 12.5, index_annotation_interval = 15, index_annotations_above = TRUE, index_annotation_vertical_position = 1/3, index_annotation_full_line = TRUE, index_annotation_always_first_base = TRUE, index_annotation_always_last_base = TRUE, outline_colour = "black", outline_linewidth = 3, outline_join = "mitre", return = TRUE, filename = NA, force_raster = FALSE, render_device = ragg::agg_png, pixels_per_base = 100, monitor_performance = FALSE )
sequences_vector |
|
... |
Used to recognise aliases e.g. American spellings or common misspellings - see aliases. If any American spellings do not work, please make a bug report at https://github.com/ejade42/ggDNAvis/issues. |
sequence_colours |
|
background_colour |
|
margin |
|
sequence_text_colour |
|
sequence_text_size |
|
index_annotation_lines |
|
index_annotation_colour |
|
index_annotation_size |
|
index_annotation_interval |
|
index_annotations_above |
|
index_annotation_vertical_position |
|
index_annotation_full_line |
|
index_annotation_always_first_base |
|
index_annotation_always_last_base |
|
outline_colour |
|
outline_linewidth |
|
outline_join |
|
return |
|
filename |
|
force_raster |
|
render_device |
|
pixels_per_base |
|
monitor_performance |
|
A ggplot object containing the full visualisation, or invisible(NULL) if return = FALSE. It is often more useful to use filename = "myfilename.png", because then the visualisation is exported at the correct aspect ratio.
## Create sequences vector sequences <- extract_and_sort_sequences(example_many_sequences) ## Visualise example_many_sequences with all defaults ## This looks ugly because it isn't at the right scale/aspect ratio visualise_many_sequences(sequences) ## Export with all defaults rather than returning visualise_many_sequences( sequences, filename = "example_vms_01.png", return = FALSE ) ## View exported image image <- png::readPNG("example_vms_01.png") unlink("example_vms_01.png") grid::grid.newpage() grid::grid.raster(image) ## Export while customising appearance visualise_many_sequences( sequences, filename = "example_vms_02.png", return = FALSE, sequence_colours = sequence_colour_palettes$bright_pale, sequence_text_colour = "white", index_annotation_interval = 3, index_annotation_lines = 1:51, index_annotation_full_line = FALSE, index_annotation_always_first_base = FALSE, index_annotation_always_last_base = FALSE, background_colour = "lightgrey", outline_linewidth = 0, margin = 0 ) ## View exported image image <- png::readPNG("example_vms_02.png") unlink("example_vms_02.png") grid::grid.newpage() grid::grid.raster(image)## Create sequences vector sequences <- extract_and_sort_sequences(example_many_sequences) ## Visualise example_many_sequences with all defaults ## This looks ugly because it isn't at the right scale/aspect ratio visualise_many_sequences(sequences) ## Export with all defaults rather than returning visualise_many_sequences( sequences, filename = "example_vms_01.png", return = FALSE ) ## View exported image image <- png::readPNG("example_vms_01.png") unlink("example_vms_01.png") grid::grid.newpage() grid::grid.raster(image) ## Export while customising appearance visualise_many_sequences( sequences, filename = "example_vms_02.png", return = FALSE, sequence_colours = sequence_colour_palettes$bright_pale, sequence_text_colour = "white", index_annotation_interval = 3, index_annotation_lines = 1:51, index_annotation_full_line = FALSE, index_annotation_always_first_base = FALSE, index_annotation_always_last_base = FALSE, background_colour = "lightgrey", outline_linewidth = 0, margin = 0 ) ## View exported image image <- png::readPNG("example_vms_02.png") unlink("example_vms_02.png") grid::grid.newpage() grid::grid.raster(image)
visualize_methylation() is an alias for visualise_methylation() - see aliases.
This function takes vectors of modifications locations, modification probabilities,
and sequence lengths (e.g. created by extract_and_sort_methylation()) and
visualises the probability of methylation (or other modification) across each read.
Assumes that the three main input vectors are of equal length and represent sequences
(e.g. Nanopore reads), where locations are the indices along each read at which modification
was assessed, probabilities are the probability of modification at each assessed site, and
lengths are the lengths of each sequence.
For each sequence, renders non-assessed (e.g. non-CpG) bases as other_bases_colour, renders
background (including after the end of the sequence) as background_colour, and renders assessed
bases on a linear scale from low_colour to high_colour.
Clamping means that the endpoints of the colour gradient can be set some distance into the probability
space e.g. with Nanopore > SAM probability values from 0-255, the default is to render 0 as fully blue
(#0000FF), 255 as fully red (#FF0000), and values in between linearly interpolated. However, clamping with
low_clamp = 100 and high_clamp = 200 would set all probabilities up to 100 as fully blue,
all probabilities 200 and above as fully red, and linearly interpolate only over the 100-200 range.
A separate scalebar plot showing the colours corresponding to each probability, with any/no clamping values,
can be produced via visualise_methylation_colour_scale().
visualise_methylation( modification_locations, modification_probabilities, sequences, ..., low_colour = "blue", high_colour = "red", low_clamp = 0, high_clamp = 255, background_colour = "white", other_bases_colour = "grey", sequence_text_type = "none", sequence_text_scaling = c(-0.5, 256), sequence_text_rounding = 2, sequence_text_colour = "black", sequence_text_size = 16, index_annotation_lines = c(1), index_annotation_colour = "darkred", index_annotation_size = 12.5, index_annotation_interval = 15, index_annotations_above = TRUE, index_annotation_vertical_position = 1/3, index_annotation_full_line = TRUE, index_annotation_always_first_base = TRUE, index_annotation_always_last_base = TRUE, outline_colour = "black", outline_linewidth = 3, outline_join = "mitre", modified_bases_outline_colour = NA, modified_bases_outline_linewidth = NA, modified_bases_outline_join = NA, other_bases_outline_colour = NA, other_bases_outline_linewidth = NA, other_bases_outline_join = NA, margin = 0.5, return = TRUE, filename = NA, force_raster = FALSE, render_device = ragg::agg_png, pixels_per_base = 100, monitor_performance = FALSE )visualise_methylation( modification_locations, modification_probabilities, sequences, ..., low_colour = "blue", high_colour = "red", low_clamp = 0, high_clamp = 255, background_colour = "white", other_bases_colour = "grey", sequence_text_type = "none", sequence_text_scaling = c(-0.5, 256), sequence_text_rounding = 2, sequence_text_colour = "black", sequence_text_size = 16, index_annotation_lines = c(1), index_annotation_colour = "darkred", index_annotation_size = 12.5, index_annotation_interval = 15, index_annotations_above = TRUE, index_annotation_vertical_position = 1/3, index_annotation_full_line = TRUE, index_annotation_always_first_base = TRUE, index_annotation_always_last_base = TRUE, outline_colour = "black", outline_linewidth = 3, outline_join = "mitre", modified_bases_outline_colour = NA, modified_bases_outline_linewidth = NA, modified_bases_outline_join = NA, other_bases_outline_colour = NA, other_bases_outline_linewidth = NA, other_bases_outline_join = NA, margin = 0.5, return = TRUE, filename = NA, force_raster = FALSE, render_device = ragg::agg_png, pixels_per_base = 100, monitor_performance = FALSE )
modification_locations |
|
modification_probabilities |
|
sequences |
|
... |
Used to recognise aliases e.g. American spellings or common misspellings - see aliases. If any American spellings do not work, please make a bug report at https://github.com/ejade42/ggDNAvis/issues. |
low_colour |
|
high_colour |
|
low_clamp |
|
high_clamp |
|
background_colour |
|
other_bases_colour |
|
sequence_text_type |
|
sequence_text_scaling |
|
sequence_text_rounding |
|
sequence_text_colour |
|
sequence_text_size |
|
index_annotation_lines |
|
index_annotation_colour |
|
index_annotation_size |
|
index_annotation_interval |
|
index_annotations_above |
|
index_annotation_vertical_position |
|
index_annotation_full_line |
|
index_annotation_always_first_base |
|
index_annotation_always_last_base |
|
outline_colour |
|
outline_linewidth |
|
outline_join |
|
modified_bases_outline_colour |
|
modified_bases_outline_linewidth |
|
modified_bases_outline_join |
|
other_bases_outline_colour |
|
other_bases_outline_linewidth |
|
other_bases_outline_join |
|
margin |
|
return |
|
filename |
|
force_raster |
|
render_device |
|
pixels_per_base |
|
monitor_performance |
|
A ggplot object containing the full visualisation, or invisible(NULL) if return = FALSE. It is often more useful to use filename = "myfilename.png", because then the visualisation is exported at the correct aspect ratio.
## Extract info from dataframe methylation_info <- extract_and_sort_methylation(example_many_sequences) ## Visualise example_many_sequences with all defaults ## This looks ugly because it isn't at the right scale/aspect ratio visualise_methylation( methylation_info$locations, methylation_info$probabilities, methylation_info$sequences ) ## Export with all defaults rather than returning visualise_methylation( methylation_info$locations, methylation_info$probabilities, methylation_info$sequences, filename = "example_vm_01.png", return = FALSE ) ## View exported image image <- png::readPNG("example_vm_01.png") unlink("example_vm_01.png") grid::grid.newpage() grid::grid.raster(image) ## Export with customisation visualise_methylation( methylation_info$locations, methylation_info$probabilities, methylation_info$sequences, filename = "example_vm_02.png", return = FALSE, low_colour = "white", high_colour = "black", low_clamp = 0.3*255, high_clamp = 0.7*255, index_annotation_lines = c(1, 23, 37), index_annotation_interval = 3, index_annotation_full_line = FALSE, index_annotation_always_first_base = TRUE, index_annotation_always_last_base = TRUE, other_bases_colour = "lightblue1", other_bases_outline_linewidth = 1, other_bases_outline_colour = "grey", modified_bases_outline_linewidth = 3, modified_bases_outline_colour = "black", margin = 0.3 ) ## View exported image image <- png::readPNG("example_vm_02.png") unlink("example_vm_02.png") grid::grid.newpage() grid::grid.raster(image) ## Export with customisation, viewing sequences visualise_methylation( methylation_info$locations, methylation_info$probabilities, methylation_info$sequences, filename = "example_vm_03.png", return = FALSE, low_colour = "white", high_colour = "black", low_clamp = 0.3*255, high_clamp = 0.7*255, sequence_text_type = "sequence", sequence_text_colour = "red", index_annotation_lines = c(1, 23, 37), index_annotation_interval = 3, index_annotation_full_line = FALSE, index_annotation_always_first_base = TRUE, index_annotation_always_last_base = TRUE, other_bases_colour = "lightblue1", other_bases_outline_linewidth = 1, other_bases_outline_colour = "grey", modified_bases_outline_linewidth = 3, modified_bases_outline_colour = "black", margin = 0.3 ) ## View exported image image <- png::readPNG("example_vm_03.png") unlink("example_vm_03.png") grid::grid.newpage() grid::grid.raster(image) ## Export with customisation, viewing probabilities visualise_methylation( methylation_info$locations, methylation_info$probabilities, methylation_info$sequences, filename = "example_vm_04.png", return = FALSE, low_colour = "cyan", high_colour = "yellow", low_clamp = 0.3*255, high_clamp = 0.7*255, sequence_text_type = "probability", sequence_text_size = 10, sequence_text_colour = "black", index_annotation_lines = c(1, 23, 37), index_annotation_interval = 3, index_annotation_full_line = FALSE, index_annotation_always_first_base = TRUE, index_annotation_always_last_base = TRUE, other_bases_colour = "lightgreen", other_bases_outline_linewidth = 1, other_bases_outline_colour = "grey", modified_bases_outline_linewidth = 3, modified_bases_outline_colour = "black", margin = 0.3 ) ## View exported image image <- png::readPNG("example_vm_04.png") unlink("example_vm_04.png") grid::grid.newpage() grid::grid.raster(image) ## Export with customisation, viewing probability integers visualise_methylation( methylation_info$locations, methylation_info$probabilities, methylation_info$sequences, filename = "example_vm_05.png", return = FALSE, low_colour = "blue", high_colour = "red", low_clamp = 0.3*255, high_clamp = 0.7*255, sequence_text_type = "probability", sequence_text_scaling = c(0, 1), sequence_text_rounding = 0, sequence_text_size = 10, sequence_text_colour = "white", index_annotation_lines = c(1, 23, 37), index_annotation_interval = 3, index_annotation_full_line = FALSE, index_annotation_always_first_base = TRUE, index_annotation_always_last_base = TRUE, other_bases_outline_linewidth = 1, other_bases_outline_colour = "grey", modified_bases_outline_linewidth = 3, modified_bases_outline_colour = "black", margin = 0.3 ) ## View exported image image <- png::readPNG("example_vm_05.png") unlink("example_vm_05.png") grid::grid.newpage() grid::grid.raster(image)## Extract info from dataframe methylation_info <- extract_and_sort_methylation(example_many_sequences) ## Visualise example_many_sequences with all defaults ## This looks ugly because it isn't at the right scale/aspect ratio visualise_methylation( methylation_info$locations, methylation_info$probabilities, methylation_info$sequences ) ## Export with all defaults rather than returning visualise_methylation( methylation_info$locations, methylation_info$probabilities, methylation_info$sequences, filename = "example_vm_01.png", return = FALSE ) ## View exported image image <- png::readPNG("example_vm_01.png") unlink("example_vm_01.png") grid::grid.newpage() grid::grid.raster(image) ## Export with customisation visualise_methylation( methylation_info$locations, methylation_info$probabilities, methylation_info$sequences, filename = "example_vm_02.png", return = FALSE, low_colour = "white", high_colour = "black", low_clamp = 0.3*255, high_clamp = 0.7*255, index_annotation_lines = c(1, 23, 37), index_annotation_interval = 3, index_annotation_full_line = FALSE, index_annotation_always_first_base = TRUE, index_annotation_always_last_base = TRUE, other_bases_colour = "lightblue1", other_bases_outline_linewidth = 1, other_bases_outline_colour = "grey", modified_bases_outline_linewidth = 3, modified_bases_outline_colour = "black", margin = 0.3 ) ## View exported image image <- png::readPNG("example_vm_02.png") unlink("example_vm_02.png") grid::grid.newpage() grid::grid.raster(image) ## Export with customisation, viewing sequences visualise_methylation( methylation_info$locations, methylation_info$probabilities, methylation_info$sequences, filename = "example_vm_03.png", return = FALSE, low_colour = "white", high_colour = "black", low_clamp = 0.3*255, high_clamp = 0.7*255, sequence_text_type = "sequence", sequence_text_colour = "red", index_annotation_lines = c(1, 23, 37), index_annotation_interval = 3, index_annotation_full_line = FALSE, index_annotation_always_first_base = TRUE, index_annotation_always_last_base = TRUE, other_bases_colour = "lightblue1", other_bases_outline_linewidth = 1, other_bases_outline_colour = "grey", modified_bases_outline_linewidth = 3, modified_bases_outline_colour = "black", margin = 0.3 ) ## View exported image image <- png::readPNG("example_vm_03.png") unlink("example_vm_03.png") grid::grid.newpage() grid::grid.raster(image) ## Export with customisation, viewing probabilities visualise_methylation( methylation_info$locations, methylation_info$probabilities, methylation_info$sequences, filename = "example_vm_04.png", return = FALSE, low_colour = "cyan", high_colour = "yellow", low_clamp = 0.3*255, high_clamp = 0.7*255, sequence_text_type = "probability", sequence_text_size = 10, sequence_text_colour = "black", index_annotation_lines = c(1, 23, 37), index_annotation_interval = 3, index_annotation_full_line = FALSE, index_annotation_always_first_base = TRUE, index_annotation_always_last_base = TRUE, other_bases_colour = "lightgreen", other_bases_outline_linewidth = 1, other_bases_outline_colour = "grey", modified_bases_outline_linewidth = 3, modified_bases_outline_colour = "black", margin = 0.3 ) ## View exported image image <- png::readPNG("example_vm_04.png") unlink("example_vm_04.png") grid::grid.newpage() grid::grid.raster(image) ## Export with customisation, viewing probability integers visualise_methylation( methylation_info$locations, methylation_info$probabilities, methylation_info$sequences, filename = "example_vm_05.png", return = FALSE, low_colour = "blue", high_colour = "red", low_clamp = 0.3*255, high_clamp = 0.7*255, sequence_text_type = "probability", sequence_text_scaling = c(0, 1), sequence_text_rounding = 0, sequence_text_size = 10, sequence_text_colour = "white", index_annotation_lines = c(1, 23, 37), index_annotation_interval = 3, index_annotation_full_line = FALSE, index_annotation_always_first_base = TRUE, index_annotation_always_last_base = TRUE, other_bases_outline_linewidth = 1, other_bases_outline_colour = "grey", modified_bases_outline_linewidth = 3, modified_bases_outline_colour = "black", margin = 0.3 ) ## View exported image image <- png::readPNG("example_vm_05.png") unlink("example_vm_05.png") grid::grid.newpage() grid::grid.raster(image)
visualize_methylation_color_scale() is an alias for visualise_methylation_colour_scale() - see aliases.
This function creates a scalebar showing the colouring scheme based on methylation
probability that is used in visualise_methylation(). Showing this is particularly
important when the colour range is clamped via low_clamp and high_clamp (e.g.
setting that all values below 100 are fully blue (#0000FF), all values above 200 are
fully red (#FF0000), and colour interpolation occurs only in the range 100-200, rather
than across the whole range 0-255). If clamping is off (default), then 0 is fully blue,
255 is fully read, and all values are linearly interpolated. NB: colours are configurable
but default to blue = low modification probability and red = high modification probability.
visualise_methylation_colour_scale( low_colour = "blue", high_colour = "red", low_clamp = 0, high_clamp = 255, ..., full_range = c(0, 255), precision = 10^3, background_colour = "white", axis_location = "bottom", axis_title = NULL, do_axis_ticks = TRUE, outline_colour = "black", outline_linewidth = 1, monitor_performance = FALSE )visualise_methylation_colour_scale( low_colour = "blue", high_colour = "red", low_clamp = 0, high_clamp = 255, ..., full_range = c(0, 255), precision = 10^3, background_colour = "white", axis_location = "bottom", axis_title = NULL, do_axis_ticks = TRUE, outline_colour = "black", outline_linewidth = 1, monitor_performance = FALSE )
low_colour |
|
high_colour |
|
low_clamp |
|
high_clamp |
|
... |
Used to recognise aliases e.g. American spellings or common misspellings - see aliases. If any American spellings do not work, please make a bug report at https://github.com/ejade42/ggDNAvis/issues. |
full_range |
|
precision |
|
background_colour |
|
axis_location |
|
axis_title |
|
do_axis_ticks |
|
outline_colour |
|
outline_linewidth |
|
monitor_performance |
|
ggplot of the scalebar.
Unlike the other visualise_<> functions in this package, does not directly export a png. This is because there are no squares that need to be rendered at a precise aspect ratio in this function. It can just be saved normally with ggplot2::ggsave() with any sensible combination of height and width.
## Defaults match defaults of visualise_methylation() visualise_methylation_colour_scale() ## Use clamping and change colours visualise_methylation_colour_scale( low_colour = "white", high_colour = "black", low_clamp = 0.3*255, high_clamp = 0.7*255, full_range = c(0, 255), background_colour = "lightblue1", axis_location = "bottom", axis_title = "Methylation probability" ) ## Lower precision = colour banding visualise_methylation_colour_scale( precision = 10, do_axis_ticks = FALSE ) ## Left axis visualise_methylation_colour_scale( precision = 100, axis_location = "WEST", axis_title = "vertical probability" )## Defaults match defaults of visualise_methylation() visualise_methylation_colour_scale() ## Use clamping and change colours visualise_methylation_colour_scale( low_colour = "white", high_colour = "black", low_clamp = 0.3*255, high_clamp = 0.7*255, full_range = c(0, 255), background_colour = "lightblue1", axis_location = "bottom", axis_title = "Methylation probability" ) ## Lower precision = colour banding visualise_methylation_colour_scale( precision = 10, do_axis_ticks = FALSE ) ## Left axis visualise_methylation_colour_scale( precision = 100, axis_location = "WEST", axis_title = "vertical probability" )
visualize_single_sequence() is an alias for visualise_single_sequence() - see aliases.
This function takes a DNA/RNA sequence and returns a ggplot visualising it, with the option to directly export a png image with appropriate dimensions. Colours, line wrapping, index annotation interval, and pixels per square when exported are configurable.
visualise_single_sequence( sequence, ..., sequence_colours = sequence_colour_palettes$ggplot_style, background_colour = "white", line_wrapping = 75, spacing = 1, margin = 0.5, sequence_text_colour = "black", sequence_text_size = 16, index_annotation_colour = "darkred", index_annotation_size = 12.5, index_annotation_interval = 15, index_annotations_above = TRUE, index_annotation_vertical_position = 1/3, index_annotation_always_first_base = TRUE, index_annotation_always_last_base = TRUE, outline_colour = "black", outline_linewidth = 3, outline_join = "mitre", return = TRUE, filename = NA, force_raster = FALSE, render_device = ragg::agg_png, pixels_per_base = 100, monitor_performance = FALSE )visualise_single_sequence( sequence, ..., sequence_colours = sequence_colour_palettes$ggplot_style, background_colour = "white", line_wrapping = 75, spacing = 1, margin = 0.5, sequence_text_colour = "black", sequence_text_size = 16, index_annotation_colour = "darkred", index_annotation_size = 12.5, index_annotation_interval = 15, index_annotations_above = TRUE, index_annotation_vertical_position = 1/3, index_annotation_always_first_base = TRUE, index_annotation_always_last_base = TRUE, outline_colour = "black", outline_linewidth = 3, outline_join = "mitre", return = TRUE, filename = NA, force_raster = FALSE, render_device = ragg::agg_png, pixels_per_base = 100, monitor_performance = FALSE )
sequence |
|
... |
Used to recognise aliases e.g. American spellings or common misspellings - see aliases. If any American spellings do not work, please make a bug report at https://github.com/ejade42/ggDNAvis/issues. |
sequence_colours |
|
background_colour |
|
line_wrapping |
|
spacing |
|
margin |
|
sequence_text_colour |
|
sequence_text_size |
|
index_annotation_colour |
|
index_annotation_size |
|
index_annotation_interval |
|
index_annotations_above |
|
index_annotation_vertical_position |
|
index_annotation_always_first_base |
|
index_annotation_always_last_base |
|
outline_colour |
|
outline_linewidth |
|
outline_join |
|
return |
|
filename |
|
force_raster |
|
render_device |
|
pixels_per_base |
|
monitor_performance |
|
A ggplot object containing the full visualisation, or invisible(NULL) if return = FALSE. It is often more useful to use filename = "myfilename.png", because then the visualisation is exported at the correct aspect ratio.
## Create sequence to visualise sequence <- paste(c(rep("GGC", 72), rep("GGAGGAGGCGGC", 15)), collapse = "") ## Visualise with all defaults ## This looks ugly because it isn't at the right scale/aspect ratio visualise_single_sequence(sequence) ## Export with all defaults rather than returning visualise_single_sequence( sequence, filename = "example_vss_01.png", return = FALSE ) ## View exported image image <- png::readPNG("example_vss_01.png") unlink("example_vss_01.png") grid::grid.newpage() grid::grid.raster(image) ## Export while customising appearance visualise_single_sequence( sequence, filename = "example_vss_02.png", return = FALSE, sequence_colours = sequence_colour_palettes$bright_pale, sequence_text_colour = "white", background_colour = "lightgrey", line_wrapping = 60, spacing = 2, outline_linewidth = 0, index_annotations_above = FALSE, index_annotation_always_first_base = FALSE, index_annotation_always_last_base = FALSE, margin = 0 ) ## View exported image image <- png::readPNG("example_vss_02.png") unlink("example_vss_02.png") grid::grid.newpage() grid::grid.raster(image)## Create sequence to visualise sequence <- paste(c(rep("GGC", 72), rep("GGAGGAGGCGGC", 15)), collapse = "") ## Visualise with all defaults ## This looks ugly because it isn't at the right scale/aspect ratio visualise_single_sequence(sequence) ## Export with all defaults rather than returning visualise_single_sequence( sequence, filename = "example_vss_01.png", return = FALSE ) ## View exported image image <- png::readPNG("example_vss_01.png") unlink("example_vss_01.png") grid::grid.newpage() grid::grid.raster(image) ## Export while customising appearance visualise_single_sequence( sequence, filename = "example_vss_02.png", return = FALSE, sequence_colours = sequence_colour_palettes$bright_pale, sequence_text_colour = "white", background_colour = "lightgrey", line_wrapping = 60, spacing = 2, outline_linewidth = 0, index_annotations_above = FALSE, index_annotation_always_first_base = FALSE, index_annotation_always_last_base = FALSE, margin = 0 ) ## View exported image image <- png::readPNG("example_vss_02.png") unlink("example_vss_02.png") grid::grid.newpage() grid::grid.raster(image)
This function simply writes a FASTQ file from a dataframe containing
columns for read ID, sequence, and quality scores.
See fastq_quality_scores for an explanation of quality.
Said dataframe can be produced from FASTQ via read_fastq().
To read/write a modified FASTQ containing modification information
(SAM/BAM MM and ML tags) in the header lines, use
read_modified_fastq() and write_modified_fastq().
write_fastq( dataframe, filename = NA, ..., read_id_colname = "read", sequence_colname = "sequence", quality_colname = "quality", return = FALSE )write_fastq( dataframe, filename = NA, ..., read_id_colname = "read", sequence_colname = "sequence", quality_colname = "quality", return = FALSE )
dataframe |
Dataframe containing sequence information to write back to FASTQ. Must have columns for unique read ID and DNA sequence. Should also have a column for quality, unless wanting to fill in qualities with |
filename |
|
... |
Used to recognise aliases e.g. American spellings or common misspellings - see aliases. If any American spellings do not work, please make a bug report at https://github.com/ejade42/ggDNAvis/issues. |
read_id_colname |
|
sequence_colname |
|
quality_colname |
|
return |
|
character vector. The resulting FASTQ file as a character vector of its constituent lines (or invisible(NULL) if return is FALSE). This is probably mostly useful for debugging, as setting filename within this function directly writes to FASTQ via writeLines(). Therefore, defaults to returning invisible(NULL).
## Write to FASTQ (using filename = NA, return = FALSE ## to view as char vector rather than writing to file) write_fastq( example_many_sequences, filename = NA, read_id_colname = "read", sequence_colname = "sequence", quality_colname = "quality", return = TRUE ) ## quality_colname = NA fills in quality with "B" write_fastq( example_many_sequences, filename = NA, read_id_colname = "read", sequence_colname = "sequence", quality_colname = NA, return = TRUE )## Write to FASTQ (using filename = NA, return = FALSE ## to view as char vector rather than writing to file) write_fastq( example_many_sequences, filename = NA, read_id_colname = "read", sequence_colname = "sequence", quality_colname = "quality", return = TRUE ) ## quality_colname = NA fills in quality with "B" write_fastq( example_many_sequences, filename = NA, read_id_colname = "read", sequence_colname = "sequence", quality_colname = NA, return = TRUE )
This function takes a dataframe containing DNA modification information
(e.g. produced by read_modified_fastq()) and writes it back to modified
FASTQ, equivalent to what would be produced via samtools fastq -T MM,ML.
Arguments give the names of columns within the dataframe from which to read.
If multiple types of modification have been assessed (e.g. both methylation
and hydroxymethylation), then multiple colnames must be provided for locations
and probabilites, and multiple prefixes (e.g. "C+h?") must be provided.
IMPORTANT: These three vectors must all be the same length, and the modification
types must be in a consistent order (e.g. if writing hydroxymethylation and methylation
in that order, must do H then M in all three vectors and never vice versa).
If quality isn't known (e.g. there was a FASTA step at some point in the pipeline),
the quality argument can be set to NA to fill in quality scores with "B". This
is the same behaviour as SAMtools v1.21 when converting FASTA to SAM/BAM then FASTQ.
I don't really know why SAMtools decided the default quality should be "B" but there
was probably a reason so I have stuck with that.
Default arguments are set up to work with the included example_many_sequences data.
write_modified_fastq( dataframe, filename = NA, ..., read_id_colname = "read", sequence_colname = "sequence", quality_colname = "quality", locations_colnames = c("hydroxymethylation_locations", "methylation_locations"), probabilities_colnames = c("hydroxymethylation_probabilities", "methylation_probabilities"), modification_prefixes = c("C+h?", "C+m?"), include_blank_tags = TRUE, return = FALSE )write_modified_fastq( dataframe, filename = NA, ..., read_id_colname = "read", sequence_colname = "sequence", quality_colname = "quality", locations_colnames = c("hydroxymethylation_locations", "methylation_locations"), probabilities_colnames = c("hydroxymethylation_probabilities", "methylation_probabilities"), modification_prefixes = c("C+h?", "C+m?"), include_blank_tags = TRUE, return = FALSE )
dataframe |
|
filename |
|
... |
Used to recognise aliases e.g. American spellings or common misspellings - see aliases. If any American spellings do not work, please make a bug report at https://github.com/ejade42/ggDNAvis/issues. |
read_id_colname |
|
sequence_colname |
|
quality_colname |
|
locations_colnames |
|
probabilities_colnames |
|
modification_prefixes |
|
include_blank_tags |
|
return |
|
character vector. The resulting modified FASTQ file as a character vector of its constituent lines (or invisible(NULL) if return is FALSE). This is probably mostly useful for debugging, as setting filename within this function directly writes to FASTQ via writeLines(). Therefore, defaults to returning invisible(NULL).
## Write to FASTQ (using filename = NA, return = FALSE ## to view as char vector rather than writing to file) write_modified_fastq( example_many_sequences, filename = NA, read_id_colname = "read", sequence_colname = "sequence", quality_colname = "quality", locations_colnames = c("hydroxymethylation_locations", "methylation_locations"), probabilities_colnames = c("hydroxymethylation_probabilities", "methylation_probabilities"), modification_prefixes = c("C+h?", "C+m?"), return = TRUE ) ## Write methylation only, and fill in qualities with "B" write_modified_fastq( example_many_sequences, filename = NA, read_id_colname = "read", sequence_colname = "sequence", quality_colname = NA, locations_colnames = c("methylation_locations"), probabilities_colnames = c("methylation_probabilities"), modification_prefixes = c("C+m?"), return = TRUE )## Write to FASTQ (using filename = NA, return = FALSE ## to view as char vector rather than writing to file) write_modified_fastq( example_many_sequences, filename = NA, read_id_colname = "read", sequence_colname = "sequence", quality_colname = "quality", locations_colnames = c("hydroxymethylation_locations", "methylation_locations"), probabilities_colnames = c("hydroxymethylation_probabilities", "methylation_probabilities"), modification_prefixes = c("C+h?", "C+m?"), return = TRUE ) ## Write methylation only, and fill in qualities with "B" write_modified_fastq( example_many_sequences, filename = NA, read_id_colname = "read", sequence_colname = "sequence", quality_colname = NA, locations_colnames = c("methylation_locations"), probabilities_colnames = c("methylation_probabilities"), modification_prefixes = c("C+m?"), return = TRUE )