This vignette showcases the functionality of the metLinkR R package, the code for which can be found at our github site.
MetLinkR is a package designed to harmonize and align metabolite identifiers across multiple input studies. Outputs are arranged to provide a number of ways to explore alignment results that will meet the different needs of various users. MetLinkR uses RefMet and RaMP-DB for metabolite name standardization and synonym searching, respectively. It is a relatively simple package with just one exported function meant for the user. Because meta-analyses of metabolomic studies can include tens of studies with thousands of metabolites, metLinkR performs queries in parallel, which requires simple setup to take advantage of.
Next we showcase the use of the main function of the package,
harmonizeInputFiles()
. This step also requires the creation
of a csv manifest that contains metadata regarding the supplied
metabolite IDs. Lastly, we provide an overview of the outputs provided
by a typical metLinkR run.
MetLinkR automatically performs parallel computations using the
parallel
and doParallel
packages, but requires
that the user supply the number of cores that can be reserved for the
task. This can be done with the detectCores()
function:
require(parallel)
<- parallel::detectCores()
available_cores ## For maximum speed, we reserve all cores minus one, so the machine can still perform some background tasks
<- available_cores - 1 n_cores
Next we need to build a csv manifest to describe the metadata provided for our metabolite entities in our inputs. The example manifest for this vignette is as follows:
FileNames | ShortFileName | HMDB | Metabolite_Name | PubChem_CID | KEGG | LIPIDMAPS | chebi |
---|---|---|---|---|---|---|---|
example_data/SamplefromCOMETS.csv | inputfile | HMDB_ID | metabolite_name | PUBCHEM | NA | NA | NA |
example_data/2019_Metabolon_Metadata.csv | VickyFile | HMDB | BIOCHEMICAL.NAME | PUBCHEM | NA | NA | NA |
example_data/Metabolon_Annotations_Serum_hmdbformatted.csv | JohnFile | HMDB | CHEMICAL_NAME | PUBCHEM | NA | NA | NA |
example_data/CSVModifiedBroadfilefromVicky20182019.csv | broadfileVicky | HMDB_ID | Metabolite | COMP_ID | NA | NA | NA |
example_data/Broad_2022Aug_annotations.csv | broadfileEwy | hmdbId | name | pubChemId | NA | NA | NA |
MetLinkR performs an exhaustive search of all identifier types supplied, meaning that all supplied IDs are queried to ensure the highest possible mapping rate. ID types are prioritized for harmonization purposes in the following order, from highest to lowest:
With the manifest created and our input data files properly organized, we’re ready to run metLinkR.
The main exported function for metLinkR is
harmonizeInputSheets
, which takes four paramaters as
input:
inputcsv
: the file/pathname for the identifier manifest
we constructedn_cores
: the number of cores for parallelizationmapping_library_format
: one of the main outputs of a
metLinkR analysis is the mapping library, which contains the
associations with metabolite species from each input file to a consensus
name generated by RefMet. This parameter specifies if this library
should be returned in “long” or “wide” or “both” formats.remove_parentheses_for_synonym_search
: We found a
common issue with parsing common names was that some platforms would
include redundant or secondary information in parentheses following a
primary ID. This optional filter removes this information after
attempting a primary match.use_metabolon_parsers
: removes special characters such
as “*” and “(1)” or “(2)” when dealing with ambiguous or duplicate
features from the Metabolon platform, respectivelymajority_vote
: When multiple IDs are supplied for a
metabolite, either (1) use the most common RefMet output among options,
or (2) prioritize options in the following order: HMDB > KEGG >
LIPIDMAPS > Chebi > Common Name > PubChemharmonizeInputSheets
will keep you apprised of its
progress as it runs, and provide some stats on the mapping rates to the
console as well as run time. Don’t worry about saving this information,
it will also be present in the output files.
require("metLinkR")
setwd(paste0(path.package("metLinkR"),"/extdata"))
<- parallel::detectCores() - 1
n_cores
<- harmonizeInputSheets(inputcsv=system.file("extdata/HarmInputFiles.csv", package="metLinkR"),
metLinkR_output n_cores = n_cores,
mapping_library_format="both",
remove_parentheses_for_synonym_search = TRUE,
use_metabolon_parsers = TRUE,
majority_vote = TRUE)
## (1/5) Imported files
## Loading RaMP-DB version 2.5.4 from cache.
## Loading RaMP-DB version 2.5.4 from cache.
## Loading RaMP-DB version 2.5.4 from cache.
## Loading RaMP-DB version 2.5.4 from cache.
## Loading RaMP-DB version 2.5.4 from cache.
## (2/5) Performed initial RefMet mapping
## (3/5) Found RaMP synonyms for unmapped inputs
## (4/5) Queried RaMP synonyms in RefMet
## MetLinkR achieved the following mapping rates:
## inputfile: 95.9%
## VickyFile: 74.3%
## JohnFile: 72.1%
## broadfileVicky: 55.6%
## broadfileEwy: 73.2%
## Global mapping rate: 75.1 %
## Loading RaMP-DB version 2.5.4 from cache.
##
##
## processing file: metLinkR_report.Rmd
## 1/11
## 2/11 [setup]
## 3/11
## 4/11 [unnamed-chunk-6]
## 5/11
## 6/11 [unnamed-chunk-7]
## 7/11
## 8/11 [unnamed-chunk-8]
## 9/11
## 10/11 [unnamed-chunk-9]
## 11/11
## output file: metLinkR_report.knit.md
## /usr/local/bin/pandoc +RTS -K512m -RTS metLinkR_report.knit.md --to html4 --from markdown+autolink_bare_uris+tex_math_single_backslash --output metLinkR_report.html --lua-filter /Users/pattac/Library/R/x86_64/4.3/library/rmarkdown/rmarkdown/lua/pagebreak.lua --lua-filter /Users/pattac/Library/R/x86_64/4.3/library/rmarkdown/rmarkdown/lua/latex-div.lua --self-contained --variable bs3=TRUE --section-divs --table-of-contents --toc-depth 3 --variable toc_float=1 --variable toc_selectors=h1,h2,h3 --variable toc_collapsed=1 --variable toc_smooth_scroll=1 --variable toc_print=1 --template /Users/pattac/Library/R/x86_64/4.3/library/rmarkdown/rmd/h/default.html --highlight-style tango --number-sections --variable theme=cerulean --mathjax --variable 'mathjax-url=https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML' --include-in-header /var/folders/7z/x3znsmd13nj_ky3wsbt3_sw4zzmgyq/T//Rtmpug4Q2O/rmarkdown-str1411829e669b9.html
##
## Output created: metLinkR_report.html
## [1] "(5/5) Wrote output files to metLinkR_output/"
MetLinkR creates a new subdirectory in the current working directory called “metLinkR_output” which the various outputs are written to. Make sure that that this new directory ends up in a memorable and desireable location.
MetLinkR creates four basic outputs:
For questions/support/bugfixes for metLinkR, please see our github repository or contact the author directly at andrew.patt@nih.gov.
sessionInfo()
## R version 4.3.2 (2023-10-31)
## Platform: x86_64-apple-darwin20 (64-bit)
## Running under: macOS Sonoma 14.7.4
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] parallel stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] metLinkR_0.0.0.9000
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 viridisLite_0.4.2 farver_2.1.2 dplyr_1.1.4
## [5] blob_1.2.4 filelock_1.0.3 fastmap_1.2.0 BiocFileCache_2.13.0
## [9] promises_1.3.0 digest_0.6.37 mime_0.12 lifecycle_1.0.4
## [13] ellipsis_0.3.2 processx_3.8.4 RSQLite_2.3.7 magrittr_2.0.3
## [17] compiler_4.3.2 rlang_1.1.5 sass_0.4.9 tools_4.3.2
## [21] yaml_2.3.10 data.table_1.16.0 knitr_1.49 labeling_0.4.3
## [25] htmlwidgets_1.6.4 bit_4.0.5 pkgbuild_1.4.4 curl_6.2.1
## [29] plyr_1.8.9 xml2_1.3.7 pkgload_1.3.4 miniUI_0.1.1.1
## [33] withr_3.0.2 purrr_1.0.4 desc_1.4.3 grid_4.3.2
## [37] urlchecker_1.0.1 profvis_0.3.8 xtable_1.8-4 colorspace_2.1-1
## [41] ggplot2_3.5.1 scales_1.3.0 iterators_1.0.14 cli_3.6.4.9000
## [45] UpSetR_1.4.0 rmarkdown_2.29 crayon_1.5.3 generics_0.1.3
## [49] remotes_2.5.0 xlsx_0.6.5 rstudioapi_0.17.1 httr_1.4.7
## [53] sessioninfo_1.2.2 DBI_1.2.3 cachem_1.1.0 stringr_1.5.1
## [57] vctrs_0.6.5 devtools_2.4.5 jsonlite_1.9.1 callr_3.7.6
## [61] bit64_4.0.5 systemfonts_1.1.0 foreach_1.5.2 tidyr_1.3.1
## [65] jquerylib_0.1.4 glue_1.8.0 codetools_0.2-19 ps_1.7.7
## [69] gtable_0.3.6 stringi_1.8.4 rJava_1.0-11 later_1.3.2
## [73] munsell_0.5.1 tibble_3.2.1 pillar_1.10.1 xlsxjars_0.6.1
## [77] htmltools_0.5.8.1 R6_2.6.1 dbplyr_2.5.0 doParallel_1.0.17
## [81] evaluate_1.0.3 shiny_1.9.1 kableExtra_1.4.0 memoise_2.0.1
## [85] httpuv_1.6.15 bslib_0.9.0 Rcpp_1.0.14 RaMP_3.0.2
## [89] gridExtra_2.3 svglite_2.1.3 xfun_0.51 fs_1.6.5
## [93] usethis_2.2.3 pkgconfig_2.0.3