Title: | Miscellaneous Functions for Bioinformatics and Bayesian Statistics |
---|---|
Description: | A hodgepodge of hopefully helpful functions. Two of these perform shrinkage estimation: one using a simple weighted method where the user can specify the degree of shrinkage required, and one using James-Stein shrinkage estimation for the case of unequal variances. |
Authors: | Andrew McKenzie [aut, cre] |
Maintainer: | Andrew McKenzie <[email protected]> |
License: | GPL-3 |
Version: | 1.0.1.9000 |
Built: | 2025-02-12 04:10:51 UTC |
Source: | https://github.com/andymckenzie/bayesbio |
To be used in MLE computation of the James-Stein shrinkage factor.
a_hat_mle(stat, vars, a_hat)
a_hat_mle(stat, vars, a_hat)
stat |
Input statistics to be shrinkage estimated. |
vars |
Corresponding variances of equal length. |
a_hat |
Shrinkage intensity to be estimated. |
The likelihood of the function given the parameters.
http://projecteuclid.org/euclid.ss/1331729986
By default the base R function duplicated only identifies the duplicated values after the first in a vector as TRUE. This function identifies all of the duplicates as true.
allDups(x)
allDups(x)
x |
The input vector. |
A logical vector.
A hodgepodge of hopefully helpful functions. Two of these perform shrinkage estimation: one using a simple weighted method where the user can specify the degree of shrinkage required, and one using James-Stein shrinkage estimation for the case of unequal variances.
cbind usually malfunctions on vector of unequal lengths; this function allows vectors of unequal length to be combined, while filling the missing entries with NAs.
cbindFill(...)
cbindFill(...)
... |
A set of vectors separated by commas. |
A matrix that combines the inputted vectors.
http://r.789695.n4.nabble.com/How-to-join-matrices-of-different-row-length-from-a-list-td3177212.html; http://stackoverflow.com/a/7962286/560791
Take a data frame with a diagnosis column and a number of covariate columns and specify the percentage of specified covariate levels in each group and/or the mean +/- sd for quantitative variables for each covariate desired. Although it was designed for generating sample summary tables in the context of bioinformatics experiments and the terminology refers to this, it can be used more generally as well.
covariatesTable(df, dg_col, percent_cols = NULL, quant_cols = NULL, percent_col_cats = NULL, group_names = NULL, row_names = NULL)
covariatesTable(df, dg_col, percent_cols = NULL, quant_cols = NULL, percent_col_cats = NULL, group_names = NULL, row_names = NULL)
df |
The data frame containing the columns to be extracted, both diagnosis and covariates. |
dg_col |
Column specifying the diagnosis column name, which is used to split the table. Levels of this value will be used to generate |
percent_cols |
Character vector of column names specifying the |
percent_col_cats |
Character vector specifying the values for which the percentage should be calculated for each percent column. |
group_names |
Optional character vector specifying the groups within the dg_col, which will be used to order the resulting table. |
row_names |
Optional character vector specifying what the rownames of the resulting table should be. |
A table summarizing the covariates.
Makes them unique by randomly choosing the character strings; and, in case it is necessary, adding numbers to the end using make.unique.
createStrings(number, length, upper = FALSE)
createStrings(number, length, upper = FALSE)
number |
Specifies the number of character strings that should be created. |
length |
Specifies the length of each character string in letters. |
upper |
Binary parameter specifying whether the character strings should be uppercase. Default = FALSE, so the character strings are all lowercase. |
http://stackoverflow.com/a/1439541/560791
This function takes a data frame and creates a horizontal (by default) bar plot from it while ordering the values.
ggHorizBar(data_df, dataCol, namesCol, labelsCol, decreasing = TRUE)
ggHorizBar(data_df, dataCol, namesCol, labelsCol, decreasing = TRUE)
data_df |
Data frame with columns to specify the data values, the row names, and the fill colors of each of the bars. |
dataCol |
The column name that specifies the values to be plotted. |
namesCol |
The column name that specifies the corresponding names for each of the bar plots to be plotted. |
labelsCol |
The column name that specifies the groups of the labels. |
decreasing |
Logical specifying whether the values in dataCol should be in decreasing order. |
A ggplot2 object, which can be plotted via the plot() function or saved via the ggsave() function.
This function compares the elements in two character vectors to find the Jaccard index, i.e. the number of intersections divided by the total number of elements in both sets.
jaccardSets(set1, set2)
jaccardSets(set1, set2)
set1 |
Character vector. |
set2 |
Character vector. |
A number (one-element numeric vector) specifying the Jaccard index from comparing the two sets.
https://en.wikipedia.org/wiki/Jaccard_index
The replaced values will be lost following the operation of this function.
makeMatSym(mat, replaceUpper = TRUE)
makeMatSym(mat, replaceUpper = TRUE)
mat |
The matrix to be made symmetric. |
replaceUpper |
Whether the upper triangle of the matrix should be replaced by the lower triangle. Default = TRUE; if FALSE, the lower triangle of the matrix is replaced by the upper triangle. |
A matrix that has been made symmetric.
An extension to gsub that handles vectors of patterns and replacements, avoiding recursion problems associated with overlap at the extense of computation time.
mgsub(pattern, replacement, x, ...)
mgsub(pattern, replacement, x, ...)
pattern |
Character vector of patterns to match. |
replacement |
Character vector of replacements for each pattern. |
x |
Character vector in which the gsub should be performed. |
... |
Additional arguments to grep. |
http://stackoverflow.com/a/15254254/560791
Takes two data frames each with time/date columns in date-time or date format (i.e., able to be compared using the function difftime), finds the rows of df2 that minimize the absolute value of the datetime for each of the rows in df1, and merges the corresponding rows of df2 into df1 for downstream processing.
nearestTime(df1, df2, timeCol1, timeCol2)
nearestTime(df1, df2, timeCol1, timeCol2)
df1 |
Data frame containing the dates for which the differences between the other data frame's date column should be minimized for each row. |
df2 |
Data frame containing the dates which should be compared to, as well as other values that should be merged to df1 per minimized date time. |
timeCol1 |
Character vector specifying the date/time column in df1. |
timeCol2 |
Character vector specifying the date/time column in df2. |
A merged data frame that minimizes datetime differences.
Takes two data frames each with time/date columns in date-time or date format (i.e., able to be compared using the function difftime), finds the rows of df2 that minimize the absolute value of the datetime for each of the rows in df1, and merges the corresponding rows of df2 into df1 for downstream processing.
nearestTimeandID(df1, df2, timeCol1, timeCol2, IDcol)
nearestTimeandID(df1, df2, timeCol1, timeCol2, IDcol)
df1 |
Data frame containing the dates for which the differences between the other data frame's date column should be minimized for each row. |
df2 |
Data frame containing the dates which should be compared to, as well as other values that should be merged to df1 per minimized date time. |
timeCol1 |
Character vector specifying the date/time column in df1. |
timeCol2 |
Character vector specifying the date/time column in df2. |
IDcol |
Must be unique by row in df1. Multiple versions are allowed (and expected at least in some rows, as that is the point of the function) in df2. |
A merged data frame that minimizes datetime differences.
This function recapitulates p.adjust but allows the number of hypothesis tests n to be less than the number of p-values p. Statistical properties of the p-value adjustments may not hold.
p.adjust.nlp(p, method = p.adjust.methods, n = length(p))
p.adjust.nlp(p, method = p.adjust.methods, n = length(p))
p |
Numeric vector of p-values. |
method |
Correction method. |
n |
Number of comparisons to be made. |
http://stackoverflow.com/a/30110186/560791
Perform PubMed queries on the intersections of two character vectors. This function is a wrapper to RISmed::EUtilsSummary with type = 'esearch', db = 'pubmed'.
pubmedQuery(rowTerms, colTerms, sleepTime = 0.01, ...)
pubmedQuery(rowTerms, colTerms, sleepTime = 0.01, ...)
rowTerms |
Character vector of terms that should make up the rows of the resulting mention count data frame. |
colTerms |
Character vector of terms for the columns. |
sleepTime |
How much time (in seconds) to sleep between successive PubMed queries. If you set this too low, PubMed may shut down your connection to prevent overloading their servers. |
... |
Additional arguments to RISmed::EUtilsSummary |
A data frame of the number of mentions for each combination of terms.
Finds the standard error of a numeric vector (i.e., the standard deviation divided by the square root of the sample size); by default, removes NAs prior to calculation.
std_error(x, na.rm = TRUE)
std_error(x, na.rm = TRUE)
x |
The numeric vector whose standard error should be calculated. |
na.rm |
Logical; TRUE indicates that NAs should be removed from the vector prior to calculating the standard error, and vice versa for FALSE. |
A one-element numeric vector giving the standard error.
Takes a matrix and adds values to the values that are one above the diagonal (ie the superdiagonal) and the values that are one below the diagonal (ie the subdiagonal).
subsupDiag(matrix, x)
subsupDiag(matrix, x)
matrix |
Matrix whose super- and sub-diagonals values should be replaced. |
x |
Numeric vector used to replace values in the matrix. If the inputted vector is not of the same length as both the super- and sub-diagonals of the matrix, then short vector recycling will occur (e.g., x can be one value to replace all of the super- and sub-diagonals of the matrix with that one value). |
The original matrix with the values added.
http://stackoverflow.com/a/9885186/560791
Traditional JS shrinkage estimation assumes equal variances for each of the data points, while this algorithm extends JS shrinkage estimation to entries with different variances.
unequalVarShrink(stat, vars, verbose = TRUE)
unequalVarShrink(stat, vars, verbose = TRUE)
stat |
Input statistics to be shrinkage estimated. |
vars |
Corresponding variances of equal length. |
verbose |
Whether information about the algorithm should be reported. |
A data frame containing the shrinkage estimated statistics.
http://projecteuclid.org/euclid.ss/1331729986
Shrink values towards the mean (in the sample or the overall cohort) to an inverse degree to the confidence you assign to that observation.
weightedShrink(x, n, m = NULL, meanVal = NULL)
weightedShrink(x, n, m = NULL, meanVal = NULL)
x |
Numeric vector of values to be shrunken towards the mean. |
n |
Numeric vector with corresponding entries to x, specifying the number of observations used to calculate x, or some other confidence weight to associate with x. |
m |
Number specifying weight of the shrinkage estimation, relative to the number of observations in the input vector n. Defaults to the minimum of n, but this is an arbitrary value and should be explored to find an optimal value for your use case. |
meanVal |
Number specifying the overall mean towards which the values should be shrunken. Defaults to NULL, in which case it is calculated as the (non-weighted) arithmetic mean of the values in the inputted vector x. |
A numeric vector with shrunken data values.
http://math.stackexchange.com/a/41513
A wrapper function for write.table that has the same options as read.delim.
write.delim(df, file, row.names = FALSE, col.names = TRUE, sep = "\t", quote = FALSE, ...)
write.delim(df, file, row.names = FALSE, col.names = TRUE, sep = "\t", quote = FALSE, ...)
df |
Data frame to be written. |
file |
Full or relative path to file to be written. |
row.names |
Logical indicating whether to include row names. |
col.names |
Logical indicating whether to include column names. |
sep |
Deliter to separate fields in the resulting file. Default is tab separation. |
quote |
Logical indicating whether to put quotes around the resulting values. |
... |
Additional arguments to write.table. |
None; side-effect is to write to a file.