Introduction

This vignette showcases the functionality of the metLinkR R package, the code for which can be found at our github site.

MetLinkR is a package designed to harmonize and align metabolite identifiers across multiple input studies. Outputs are arranged to provide a number of ways to explore alignment results that will meet the different needs of various users. MetLinkR uses RefMet and RaMP-DB for metabolite name standardization and synonym searching, respectively. It is a relatively simple package with just one exported function meant for the user. Because meta-analyses of metabolomic studies can include tens of studies with thousands of metabolites, metLinkR performs queries in parallel, which requires simple setup to take advantage of.

Next we showcase the use of the main function of the package, harmonizeInputFiles(). This step also requires the creation of a csv manifest that contains metadata regarding the supplied metabolite IDs. Lastly, we provide an overview of the outputs provided by a typical metLinkR run.

Setting up parallel computing

MetLinkR automatically performs parallel computations using the parallel and doParallel packages, but requires that the user supply the number of cores that can be reserved for the task. This can be done with the detectCores() function:

require(parallel)
available_cores <- parallel::detectCores()
## For maximum speed, we reserve all cores minus one, so the machine can still perform some background tasks
n_cores <- available_cores - 1

Building identifier manifest

Next we need to build a csv manifest to describe the metadata provided for our metabolite entities in our inputs. The example manifest for this vignette is as follows:

FileNames	ShortFileName	HMDB	Metabolite_Name	PubChem_CID	KEGG	LIPIDMAPS	chebi
example_data/SamplefromCOMETS.csv	inputfile	HMDB_ID	metabolite_name	PUBCHEM	NA	NA	NA
example_data/2019_Metabolon_Metadata.csv	VickyFile	HMDB	BIOCHEMICAL.NAME	PUBCHEM	NA	NA	NA
example_data/Metabolon_Annotations_Serum_hmdbformatted.csv	JohnFile	HMDB	CHEMICAL_NAME	PUBCHEM	NA	NA	NA
example_data/CSVModifiedBroadfilefromVicky20182019.csv	broadfileVicky	HMDB_ID	Metabolite	COMP_ID	NA	NA	NA
example_data/Broad_2022Aug_annotations.csv	broadfileEwy	hmdbId	name	pubChemId	NA	NA	NA

The first column, “FileNames” provides the pathname for each of the five supplied input datasets.
The second, “ShortFileName”, is an arbitrary nickname provided by the user that will be used for graphics and other outputs.
The remaining six columns correspond to the different identifier types accepted by MetLinkR. These include common name, HMDB ID, PubChem ID, KEGG ID, LIPIDMAPS ID, and Chebi ID. The entries in these rows are the column name relating to that identifier type in the supplied file. Note that NAs are acceptable if that identifier type is not present in a particular input. Further note that there are no restrictions in terms of which identifiers must be supplied; an identifier type does not need to be present across all input files for it to be incorporated into the mappings.

MetLinkR performs an exhaustive search of all identifier types supplied, meaning that all supplied IDs are queried to ensure the highest possible mapping rate. ID types are prioritized for harmonization purposes in the following order, from highest to lowest:

HMDB
KEGG
LIPIDMAPS
Chebi
Common Name
PubChem

With the manifest created and our input data files properly organized, we’re ready to run metLinkR.

Running harmonization

The main exported function for metLinkR is harmonizeInputSheets, which takes four paramaters as input:

inputcsv: the file/pathname for the identifier manifest we constructed
n_cores: the number of cores for parallelization
mapping_library_format: one of the main outputs of a metLinkR analysis is the mapping library, which contains the associations with metabolite species from each input file to a consensus name generated by RefMet. This parameter specifies if this library should be returned in “long” or “wide” or “both” formats.
remove_parentheses_for_synonym_search: We found a common issue with parsing common names was that some platforms would include redundant or secondary information in parentheses following a primary ID. This optional filter removes this information after attempting a primary match.
use_metabolon_parsers: removes special characters such as “*” and “(1)” or “(2)” when dealing with ambiguous or duplicate features from the Metabolon platform, respectively
majority_vote: When multiple IDs are supplied for a metabolite, either (1) use the most common RefMet output among options, or (2) prioritize options in the following order: HMDB > KEGG > LIPIDMAPS > Chebi > Common Name > PubChem

harmonizeInputSheets will keep you apprised of its progress as it runs, and provide some stats on the mapping rates to the console as well as run time. Don’t worry about saving this information, it will also be present in the output files.

require("metLinkR")
setwd(paste0(path.package("metLinkR"),"/extdata"))
n_cores <- parallel::detectCores() - 1

metLinkR_output <- harmonizeInputSheets(inputcsv=system.file("extdata/HarmInputFiles.csv", package="metLinkR"),
                                        n_cores = n_cores,
                                        mapping_library_format="both",
                                        remove_parentheses_for_synonym_search = TRUE,
                                        use_metabolon_parsers = TRUE,
                                        majority_vote = TRUE)

## (1/5) Imported files

## Loading RaMP-DB version 2.5.4 from cache.
## Loading RaMP-DB version 2.5.4 from cache.
## Loading RaMP-DB version 2.5.4 from cache.
## Loading RaMP-DB version 2.5.4 from cache.
## Loading RaMP-DB version 2.5.4 from cache.

## (2/5) Performed initial RefMet mapping

## (3/5) Found RaMP synonyms for unmapped inputs

## (4/5) Queried RaMP synonyms in RefMet

## MetLinkR achieved the following mapping rates:
## inputfile: 95.9%
## VickyFile: 74.3%
## JohnFile: 72.1%
## broadfileVicky: 55.6%
## broadfileEwy: 73.2%
## Global mapping rate:  75.1 %

## Loading RaMP-DB version 2.5.4 from cache.

## 
## 
## processing file: metLinkR_report.Rmd

## 1/11                  
## 2/11 [setup]          
## 3/11                  
## 4/11 [unnamed-chunk-6]

## 5/11                  
## 6/11 [unnamed-chunk-7]

## 7/11                  
## 8/11 [unnamed-chunk-8]
## 9/11                  
## 10/11 [unnamed-chunk-9]

## 11/11

## output file: metLinkR_report.knit.md

## /usr/local/bin/pandoc +RTS -K512m -RTS metLinkR_report.knit.md --to html4 --from markdown+autolink_bare_uris+tex_math_single_backslash --output metLinkR_report.html --lua-filter /Users/pattac/Library/R/x86_64/4.3/library/rmarkdown/rmarkdown/lua/pagebreak.lua --lua-filter /Users/pattac/Library/R/x86_64/4.3/library/rmarkdown/rmarkdown/lua/latex-div.lua --self-contained --variable bs3=TRUE --section-divs --table-of-contents --toc-depth 3 --variable toc_float=1 --variable toc_selectors=h1,h2,h3 --variable toc_collapsed=1 --variable toc_smooth_scroll=1 --variable toc_print=1 --template /Users/pattac/Library/R/x86_64/4.3/library/rmarkdown/rmd/h/default.html --highlight-style tango --number-sections --variable theme=cerulean --mathjax --variable 'mathjax-url=https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML' --include-in-header /var/folders/7z/x3znsmd13nj_ky3wsbt3_sw4zzmgyq/T//Rtmpug4Q2O/rmarkdown-str1411829e669b9.html

## 
## Output created: metLinkR_report.html

## [1] "(5/5) Wrote output files to metLinkR_output/"

Exploring outputs

MetLinkR creates a new subdirectory in the current working directory called “metLinkR_output” which the various outputs are written to. Make sure that that this new directory ends up in a memorable and desireable location.

MetLinkR creates four basic outputs:

Mapping library: As described above, the file that contains the mappings between the consensus metabolite species names, and the entries as they appear in the input files. Can be in either long or wide format.
Annotated input files: the input files are returned with an added column that contains the consensus name for the metabolite in that row.
PDF report: A report is generated that details the mapping rates by file, lists the identifier types that were used to achieve the mappings, and a breakdown of the ClassyFire chemical superclass mappings that were associated with the identified metabolites.
Text log: the text log contains information on runtime, date/time the run was generated, a copy of the identifer manifest, and R session info for the run.

For questions/support/bugfixes for metLinkR, please see our github repository or contact the author directly at andrew.patt@nih.gov.

Session Info

sessionInfo()

## R version 4.3.2 (2023-10-31)
## Platform: x86_64-apple-darwin20 (64-bit)
## Running under: macOS Sonoma 14.7.4
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] metLinkR_0.0.0.9000
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1     viridisLite_0.4.2    farver_2.1.2         dplyr_1.1.4         
##  [5] blob_1.2.4           filelock_1.0.3       fastmap_1.2.0        BiocFileCache_2.13.0
##  [9] promises_1.3.0       digest_0.6.37        mime_0.12            lifecycle_1.0.4     
## [13] ellipsis_0.3.2       processx_3.8.4       RSQLite_2.3.7        magrittr_2.0.3      
## [17] compiler_4.3.2       rlang_1.1.5          sass_0.4.9           tools_4.3.2         
## [21] yaml_2.3.10          data.table_1.16.0    knitr_1.49           labeling_0.4.3      
## [25] htmlwidgets_1.6.4    bit_4.0.5            pkgbuild_1.4.4       curl_6.2.1          
## [29] plyr_1.8.9           xml2_1.3.7           pkgload_1.3.4        miniUI_0.1.1.1      
## [33] withr_3.0.2          purrr_1.0.4          desc_1.4.3           grid_4.3.2          
## [37] urlchecker_1.0.1     profvis_0.3.8        xtable_1.8-4         colorspace_2.1-1    
## [41] ggplot2_3.5.1        scales_1.3.0         iterators_1.0.14     cli_3.6.4.9000      
## [45] UpSetR_1.4.0         rmarkdown_2.29       crayon_1.5.3         generics_0.1.3      
## [49] remotes_2.5.0        xlsx_0.6.5           rstudioapi_0.17.1    httr_1.4.7          
## [53] sessioninfo_1.2.2    DBI_1.2.3            cachem_1.1.0         stringr_1.5.1       
## [57] vctrs_0.6.5          devtools_2.4.5       jsonlite_1.9.1       callr_3.7.6         
## [61] bit64_4.0.5          systemfonts_1.1.0    foreach_1.5.2        tidyr_1.3.1         
## [65] jquerylib_0.1.4      glue_1.8.0           codetools_0.2-19     ps_1.7.7            
## [69] gtable_0.3.6         stringi_1.8.4        rJava_1.0-11         later_1.3.2         
## [73] munsell_0.5.1        tibble_3.2.1         pillar_1.10.1        xlsxjars_0.6.1      
## [77] htmltools_0.5.8.1    R6_2.6.1             dbplyr_2.5.0         doParallel_1.0.17   
## [81] evaluate_1.0.3       shiny_1.9.1          kableExtra_1.4.0     memoise_2.0.1       
## [85] httpuv_1.6.15        bslib_0.9.0          Rcpp_1.0.14          RaMP_3.0.2          
## [89] gridExtra_2.3        svglite_2.1.3        xfun_0.51            fs_1.6.5            
## [93] usethis_2.2.3        pkgconfig_2.0.3

MetLinkR Vignette

Andrew Patt