1. Introduction

The PubChemR package is designed for R users who need to interact with the PubChem database, a free resource from the National Center for Biotechnology Information (NCBI). PubChem is a key repository of chemical and biological data, including information on chemical structures, identifiers, chemical and physical properties, biological activities, patents, health, safety, toxicity data, and much more.

This package simplifies the process of accessing and manipulating this vast array of data directly from R, making it a valuable resource for chemists, biologists, bioinformaticians, and researchers in related fields. In this vignette, we will explore the various functionalities offered by the PubChemR package. Each function is designed to allow users to efficiently retrieve specific types of data from PubChem. We will cover how to install and load the package, provide detailed descriptions of each function, and demonstrate their usage with practical examples.

2. Installation

The PubChemR package is can be installed either from the Comprehensive R Archive Network (CRAN) or directly from its GitHub repository, offering users the flexibility to choose between the stable CRAN version or the latest development version with potentially newer features and fixes.

Installing from CRAN

For most users, installing PubChemR from CRAN is the recommended method as it ensures a stable and tested version of the package. You can install it using the standard R package installation command:

install.packages("PubChemR")

This command will download and install the PubChemR package along with any dependencies it requires. Once installed, you can load the package in your R session as follows:

library(PubChemR)

Installing the Development Version from GitHub

For users who are interested in the latest features and updates that might not yet be available on CRAN, the development version of PubChemR can be installed from GitHub. This version is likely to include recent enhancements and bug fixes but may also be less stable than the CRAN release.

To install the development version, you will first need to install the devtools package, which provides functions to install packages directly from GitHub and other sources. You can install devtools from CRAN using:

install.packages("devtools")

Once devtools is installed, you can install the development version of PubChemR using:

devtools::install_github("selcukorkmaz/PubChemR")

This command downloads and installs the package from the specified GitHub repository. After installation, load the package as usual:

library(PubChemR)

3. Implementation

The PubChemR package offers a suite of functions designed to interact with the PubChem database, allowing users to retrieve and manipulate chemical data efficiently. Below is an overview of the main functions provided by the package:

3.1. Retrieving AIDs with get_aids()

The get_aids function is designed to retrieve Assay IDs (AIDs) from the PubChem database. This function is useful for accessing detailed assay data related to specific compounds or substances, which is crucial in fields such as pharmacology, biochemistry, and molecular biology.

The function supports a range of identifiers including integers (e.g., CID and SID) and strings (e.g., name, SMILES, InChIKey and formula). Users can specify the namespace and domain for the query, as well as the type of search to be performed (e.g., substructure, superstructure, similarity, identity).

Here are the main parameters of the function:

  • identifier: A vector of positive integers (e.g. cid, sid) or identifier strings (name, smiles, inchikey, formula).
  • namespace: Specifies the type of identifier provided.
  • domain: Specifies the domain of the query.
  • searchtype: Specifies the type of search to be performed.
  • options: Additional arguments.

Retrieving AIDs by CID

In this example, we retrieve AIDs for the compounds with CID (Compound ID) 2244 (aspirin), 2519 (caffein) and 3672 (ibuprofen):

aids_by_cid <- get_aids(
  identifier = c(2244, 2519, 3672),
  namespace = "cid",
  domain = "compound"
)

aids_by_cid
#> 
#>  Assay IDs (AIDs) from PubChem Database
#> 
#>  Request Details:  
#>   - Domain: Compound
#>   - Namespace: CID
#>   - Identifier: 2244, 2519, ... and 1 more.
#> 
#>  NOTE: run AIDs(...) to extract Assays ID data. See ?AIDs for help.

The above code retrieves AIDs for the compounds with CIDs 2244, 2519 and 3672. The output shows the request details including the domain (Compound), namespace (Compound ID), and identifier (2244, 2519, … and 1 more). This provides a summary of the query performed.

To retrieve the AIDs associated with these compounds, we use the AIDs function on the result. This getter function return the results either as a tibble (data frame) or as a list, depending on the .to.data.frame argument.

aids <- AIDs(object = aids_by_cid, .to.data.frame = TRUE)
aids
#> # A tibble: 8,931 × 2
#>      CID   AID
#>    <dbl> <dbl>
#>  1  2244     1
#>  2  2244     3
#>  3  2244     9
#>  4  2244    15
#>  5  2244    19
#>  6  2244    21
#>  7  2244    23
#>  8  2244    25
#>  9  2244    29
#> 10  2244    31
#> # ℹ 8,921 more rows

The output is a tibble (data frame) with two columns: CID and AID. The CID column contains the compound IDs (2244, 2519 and 3672), and the AID column contains the Assay IDs.

table(aids$CID)
#> 
#> 2244 2519 3672 
#> 3240 2362 3329

There are 8,831 rows in total, indicating 3,195 assays related to the aspirin, 2,352 assays related to the caffein and 3,284 assays related to the ibuprofen.

Retrieving AIDs by SID

In this example, we retrieve Assay IDs for the substance with SID (Substance ID) 103414350:

aids_by_sid <- get_aids(
  identifier = c(103414350, 103204295),
  namespace = "sid",
  domain = "substance"
)

aids_by_sid
#> 
#>  Assay IDs (AIDs) from PubChem Database
#> 
#>  Request Details:  
#>   - Domain: Substance
#>   - Namespace: SID
#>   - Identifier: 103414350, 103204295
#> 
#>  NOTE: run AIDs(...) to extract Assays ID data. See ?AIDs for help.

The above code retrieves Assay IDs for the substance with SIDs (Substance IDs) 103414350 and 103204295. The output shows the request details including the domain (Substance), namespace (Substance ID), and identifier (103414350, 103204295). This provides a summary of the query performed.

To retrieve the Assay IDs associated with the SIDs 103414350 and 103204295, we use the AIDs function on the result. This getter function returns the results either as a tibble (data frame) or as a list, depending on the .to.data.frame argument.

AIDs(object = aids_by_sid, .to.data.frame = TRUE)
#> # A tibble: 8 × 2
#>         SID    AID
#>       <dbl>  <dbl>
#> 1 103414350   7810
#> 2 103414350   7815
#> 3 103414350   7816
#> 4 103414350   7820
#> 5 103414350  18990
#> 6 103204295   8712
#> 7 103204295   9506
#> 8 103204295 151808

The output is a tibble (data frame) with two columns: SID and AID. The SID column contains the substance ID (103414350 and 103204295), and the AID column contains the Assay There are a total of 8 rows, with 5 assays related to 103414350 and 3 assays related to 103204295.

Retrieving AIDs by Name

In this example, we retrieve Assay IDs for the compounds with the names paracetamol, naproxen, and diclofenac:

aids_by_name <- get_aids(
  identifier = c("paracetamol", "naproxen", "diclofenac"),
  namespace = "name",
  domain = "compound"
)

aids_by_name
#> 
#>  Assay IDs (AIDs) from PubChem Database
#> 
#>  Request Details:  
#>   - Domain: Compound
#>   - Namespace: Name
#>   - Identifier: paracetamol, naproxen, ... and 1 more.
#> 
#>  NOTE: run AIDs(...) to extract Assays ID data. See ?AIDs for help.

The output shows the request details including the domain (Compound), namespace (Name), and identifier (aspirin). This provides a summary of the query performed.

To retrieve the Assay IDs associated with the compound names, we use the AIDs function on the result:

aids <- AIDs(object = aids_by_name, .to.data.frame = TRUE)
aids
#> # A tibble: 5,281 × 3
#>    NAME          CID   AID
#>    <chr>       <dbl> <dbl>
#>  1 paracetamol  1983   155
#>  2 paracetamol  1983   157
#>  3 paracetamol  1983   161
#>  4 paracetamol  1983   165
#>  5 paracetamol  1983   167
#>  6 paracetamol  1983   175
#>  7 paracetamol  1983   248
#>  8 paracetamol  1983   357
#>  9 paracetamol  1983   377
#> 10 paracetamol  1983   410
#> # ℹ 5,271 more rows

The output is a tibble with three columns: NAME, CID and AID. The NAME column includes compound names, the CID column contains the compound IDs, and the AID column contains the assay IDs.

table(aids$NAME)
#> 
#>  diclofenac    naproxen paracetamol 
#>        1593        1586        2102

There are 5,192 rows in total, indicating 1,593 assays related to the diclofenac, 1,542 assays related to the naproxen and 2,057 assays related to the paracetamol.

Retrieving AIDs by SMILES

In this example, we retrieve Assay IDs (AIDs) for aspirin using its SMILES representation:

aids_by_smiles <- get_aids(
  identifier = "CC(=O)OC1=CC=CC=C1C(=O)O",
  namespace = "smiles",
  domain = "compound"
)

aids_by_smiles
#> 
#>  Assay IDs (AIDs) from PubChem Database
#> 
#>  Request Details:  
#>   - Domain: Compound
#>   - Namespace: SMILES
#>   - Identifier: CC(=O)OC1=CC=CC=C1C(=O)O
#> 
#>  NOTE: run AIDs(...) to extract Assays ID data. See ?AIDs for help.

The above code retrieves AIDs for aspirin with the SMILES notation CC(=O)OC1=CC=CC=C1C(=O)O. The domain is set to compound and the namespace is set to smiles to indicate that the identifier is a SMILES string.

To extract the AIDs associated with the SMILES representation, we use the AIDs function on the result:

AIDs(object = aids_by_smiles, .to.data.frame = TRUE)
#> # A tibble: 3,240 × 3
#>    SMILES                     CID   AID
#>    <chr>                    <dbl> <dbl>
#>  1 CC(=O)OC1=CC=CC=C1C(=O)O  2244     1
#>  2 CC(=O)OC1=CC=CC=C1C(=O)O  2244     3
#>  3 CC(=O)OC1=CC=CC=C1C(=O)O  2244     9
#>  4 CC(=O)OC1=CC=CC=C1C(=O)O  2244    15
#>  5 CC(=O)OC1=CC=CC=C1C(=O)O  2244    19
#>  6 CC(=O)OC1=CC=CC=C1C(=O)O  2244    21
#>  7 CC(=O)OC1=CC=CC=C1C(=O)O  2244    23
#>  8 CC(=O)OC1=CC=CC=C1C(=O)O  2244    25
#>  9 CC(=O)OC1=CC=CC=C1C(=O)O  2244    29
#> 10 CC(=O)OC1=CC=CC=C1C(=O)O  2244    31
#> # ℹ 3,230 more rows

The output is a tibble with three columns: SMILES, CID and AID. The SMILES column includes SMILES representation of aspirin, the CID column contains the compound ID of aspirin, and the AID column contains the related assay IDs.

Retrieving AIDs by InChIKey

In this example, we retrieve Assay IDs for the compound with InChIKey (International Chemical Identifier Key) GALPCCIBXQLXSH-UHFFFAOYSA-N:

aids_by_inchikey <- get_aids(
  identifier = "GALPCCIBXQLXSH-UHFFFAOYSA-N",
  namespace = "inchikey",
  domain = "compound"
)

aids_by_inchikey
#> 
#>  Assay IDs (AIDs) from PubChem Database
#> 
#>  Request Details:  
#>   - Domain: Compound
#>   - Namespace: INCHI_Key
#>   - Identifier: GALPCCIBXQLXSH-UHFFFAOYSA-N
#> 
#>  NOTE: run AIDs(...) to extract Assays ID data. See ?AIDs for help.

The above code retrieves Assay IDs for the compound with InChIKey GALPCCIBXQLXSH-UHFFFAOYSA-N. The output shows the request details including the domain (Compound), namespace (INCHI Key), and identifier (GALPCCIBXQLXSH-UHFFFAOYSA-N). This provides a summary of the query performed.

To retrieve the Assay IDs associated with the InChIKey, we use the AIDs function on the result. This getter function returns the results either as a tibble (data frame) or as a list, depending on the .to.data.frame argument.

AIDs(object = aids_by_inchikey, .to.data.frame = TRUE)
#> # A tibble: 5 × 3
#>   INCHIKEY                         CID   AID
#>   <chr>                          <dbl> <dbl>
#> 1 GALPCCIBXQLXSH-UHFFFAOYSA-N 44375542  7810
#> 2 GALPCCIBXQLXSH-UHFFFAOYSA-N 44375542  7815
#> 3 GALPCCIBXQLXSH-UHFFFAOYSA-N 44375542  7816
#> 4 GALPCCIBXQLXSH-UHFFFAOYSA-N 44375542  7820
#> 5 GALPCCIBXQLXSH-UHFFFAOYSA-N 44375542 18990

The output is a tibble (data frame) with three columns: INCHIKEY, CID, and AID. The INCHIKEY column contains the InChIKey (GALPCCIBXQLXSH-UHFFFAOYSA-N in this case), the CID column contains the compound ID (44375542), and the AID column contains the Assay IDs. This tibble format makes it easy to analyze and manipulate the data in R. There are 5 rows in total, indicating the assays related to the compound.

Retrieving AIDs by Formula

In this example, we retrieve Assay IDs for compounds with the molecular formula C15H12N2O2:

aids_by_formula <- get_aids(
  identifier = "C15H12N2O2",
  namespace = "formula",
  domain = "compound"
)

aids_by_formula
#> 
#>  Assay IDs (AIDs) from PubChem Database
#> 
#>  Request Details:  
#>   - Domain: Compound
#>   - Namespace: Formula
#>   - Identifier: C15H12N2O2
#> 
#>  NOTE: run AIDs(...) to extract Assays ID data. See ?AIDs for help.

The above code retrieves Assay IDs for compounds with the molecular formula C15H12N2O2. The output shows the request details including the domain (Compound), namespace (Formula), and identifier (C15H12N2O2). This provides a summary of the query performed.

To retrieve the Assay IDs associated with this formula, we use the AIDs function on the result. This getter function returns the results either as a tibble (data frame) or as a list, depending on the .to.data.frame argument.

AIDs(object = aids_by_formula, .to.data.frame = TRUE)
#> # A tibble: 50,172 × 3
#>    FORMULA      CID     AID
#>    <chr>      <dbl>   <dbl>
#>  1 C15H12N2O2  1775  625220
#>  2 C15H12N2O2  1775 1094227
#>  3 C15H12N2O2  1775 1149315
#>  4 C15H12N2O2  1775  255686
#>  5 C15H12N2O2  1775  504845
#>  6 C15H12N2O2  1775    2313
#>  7 C15H12N2O2  1775 1096248
#>  8 C15H12N2O2  1775  136087
#>  9 C15H12N2O2  1775 1731409
#> 10 C15H12N2O2  1775  683946
#> # ℹ 50,162 more rows

The output is a tibble (data frame) with three columns: FORMULA, CID, and AID. The FORMULA column contains the molecular formula (C15H12N2O2), the CID column contains the compound ID, and the AID column contains the Assay IDs. This tibble format makes it easy to analyze and manipulate the data in R. There are 50,116 rows in total, indicating a comprehensive list of assays related to compounds with the specified molecular formula.

3.2. Retrieving CIDs with get_cids()

The get_cids function is designed to retrieve Compound IDs (CIDs) from the PubChem database. This function is particularly useful for users who need to obtain the unique identifiers assigned to chemical substances within PubChem.

The function queries the PubChem database using various identifiers such as names, formulas, or other chemical identifiers. It then extracts the corresponding CIDs and returns them in a structured format. This makes it a versatile tool for researchers working with chemical data.

Here are the main parameters of the function:

  • identifier: A vector of identifiers for which CIDs are to be retrieved. These can be integers (e.g., CID, SID, AID) or strings (e.g., name, SMILES, InChIKey).
  • namespace: Specifies the type of identifier provided. It can be ‘cid’, ‘name’, ‘smiles’, ‘inchi’, etc.
  • domain: The domain of the query, typically ‘compound’.
  • searchtype: The type of search to be performed, such as ‘substructure’ or ‘similarity’.
  • options: Additional arguments passed to the internal get_json function.

Retrieving CIDs by Name

In this example, we retrieve Compound IDs for the compounds with the names aspirin, caffeine, and ibuprofen:

cids_by_name <- get_cids(
  identifier = c("aspirin", "caffein", "ibuprofen"),
  namespace = "name",
  domain = "compound"
)

cids_by_name
#> 
#>  Compound IDs (CIDs) from PubChem Database
#> 
#>  Request Details:  
#>   - Domain: Compound
#>   - Namespace: Name
#>   - Identifier: aspirin, caffein, ... and 1 more.
#> 
#>  NOTE: run CIDs(...) to extract Compound ID data. See ?CIDs for help.

The above code retrieves Compound IDs for the compounds named aspirin, caffeine, and ibuprofen. The output shows the request details including the domain (Compound), namespace (Name), and identifiers (aspirin, caffeine, ibuprofen). This provides a summary of the query performed.

To retrieve the Compound IDs associated with the compound names, we use the CIDs function on the result:

CIDs(object = cids_by_name)
#> # A tibble: 3 × 2
#>   Name        CID
#>   <chr>     <dbl>
#> 1 aspirin    2244
#> 2 caffein    2519
#> 3 ibuprofen  3672

The CIDs function call on the result extracts the Compound IDs associated with the compound names. The output is a tibble with two columns: Name and CID. The Name column contains the compound names, and the CID column contains the Compound IDs. This tibble format makes it easy to handle and analyze the data in R.

Retrieving CIDs by SMILES

In this example, we retrieve Compound IDs (CIDs) for a compound using its SMILES representation:

cids_by_smiles <- get_cids(
  identifier = "C([C@@H]1[C@H]([C@@H]([C@H](C(O1)O)O)O)O)O",
  namespace = "smiles",
  domain = "compound"
)

cids_by_smiles
#> 
#>  Compound IDs (CIDs) from PubChem Database
#> 
#>  Request Details:  
#>   - Domain: Compound
#>   - Namespace: SMILES
#>   - Identifier: C([C@@H]1[C@H]([C@@H]([C@H](C(O1)O)O)O)O)O
#> 
#>  NOTE: run CIDs(...) to extract Compound ID data. See ?CIDs for help.

The above code retrieves CIDs for the compound with the SMILES notation C([C@@H])O. The domain is set to compound and the namespace is set to smiles to indicate that the identifier is a SMILES string.

To extract the CIDs associated with the SMILES representation, we use the CIDs function on the result:

CIDs(object = cids_by_smiles)
#> # A tibble: 1 × 2
#>   SMILES                                       CID
#>   <chr>                                      <dbl>
#> 1 C([C@@H]1[C@H]([C@@H]([C@H](C(O1)O)O)O)O)O  5793

The CIDs function call on the result extracts the CIDs associated with the SMILES notation C([C@@H])O. The output is a tibble with two columns: SMILES and CID. The SMILES column contains the SMILES notation, and the CID column contains the Compound IDs. This output shows that the specified compound is associated with CID 5793.

Retrieving CIDs by InChIKey

In this example, we retrieve Compound IDs (CIDs) for a compound using its InChIKey:

cids_by_inchikey <- get_cids(
  identifier = "HEFNNWSXXWATRW-UHFFFAOYSA-N",
  namespace = "inchikey",
  domain = "compound"
)

cids_by_inchikey
#> 
#>  Compound IDs (CIDs) from PubChem Database
#> 
#>  Request Details:  
#>   - Domain: Compound
#>   - Namespace: INCHI_Key
#>   - Identifier: HEFNNWSXXWATRW-UHFFFAOYSA-N
#> 
#>  NOTE: run CIDs(...) to extract Compound ID data. See ?CIDs for help.

The above code retrieves CIDs for the compound with the InChIKey HEFNNWSXXWATRW-UHFFFAOYSA-N. The domain is set to compound and the namespace is set to inchikey to indicate that the identifier is an InChIKey.

To extract the CIDs associated with the InChIKey, we use the CIDs function on the result:

CIDs(object = cids_by_inchikey)
#> # A tibble: 1 × 2
#>   INCHI_Key                     CID
#>   <chr>                       <dbl>
#> 1 HEFNNWSXXWATRW-UHFFFAOYSA-N  3672

The CIDs function call on the result extracts the CIDs associated with the InChIKey HEFNNWSXXWATRW-UHFFFAOYSA-N. The output is a tibble with two columns: INCHI Key and CID. The INCHI Key column contains the InChIKey, and the CID column contains the Compound IDs. This output shows that the specified compound is associated with CID 3672.

Retrieving CIDs by Formula

In this example, we retrieve Compound IDs (CIDs) for compounds with the molecular formula C15H12N2O2:

cids_by_formula <- get_cids(
  identifier = "C15H12N2O2",
  namespace = "formula",
  domain = "compound"
)

cids_by_formula
#> 
#>  Compound IDs (CIDs) from PubChem Database
#> 
#>  Request Details:  
#>   - Domain: Compound
#>   - Namespace: Formula
#>   - Identifier: C15H12N2O2
#> 
#>  NOTE: run CIDs(...) to extract Compound ID data. See ?CIDs for help.

The above code retrieves Compound IDs for compounds with the molecular formula C15H12N2O2. The output shows the request details including the domain (Compound), namespace (Formula), and identifier (C15H12N2O2). This provides a summary of the query performed.

To retrieve the Compound IDs associated with this formula, we use the CIDs function on the result. This getter function returns the results either as a tibble (data frame) or as a list, depending on the .to.data.frame argument.

CIDs(object = cids_by_formula, .to.data.frame = TRUE)
#> # A tibble: 5,039 × 2
#>    Formula          CID
#>    <chr>          <dbl>
#>  1 C15H12N2O2      1775
#>  2 C15H12N2O2     34312
#>  3 C15H12N2O2      2555
#>  4 C15H12N2O2     14650
#>  5 C15H12N2O2    129274
#>  6 C15H12N2O2    135290
#>  7 C15H12N2O2    928446
#>  8 C15H12N2O2     70052
#>  9 C15H12N2O2 135430309
#> 10 C15H12N2O2  25113764
#> # ℹ 5,029 more rows

The output is a tibble (data frame) with two columns: Formula and CID. The Formula column contains the molecular formula (C15H12N2O2), and the CID column contains the Compound IDs. This tibble format makes it easy to analyze and manipulate the data in R. There are 5,032 rows in total, indicating a comprehensive list of compounds related to the specified molecular formula.

3.3. Retrieving SIDs with get_sids()

The get_sids function is designed to retrieve Substance IDs (SIDs) from the PubChem database. This function is essential for users who need to identify unique identifiers assigned to specific chemical substances or mixtures in PubChem.

The get_sids function queries the PubChem database using various identifiers and extracts the corresponding SIDs. It is capable of handling multiple identifiers and returns a structured tibble (data frame) containing the SIDs along with the original identifiers. This makes it a versatile tool for researchers working with chemical data.

Here are the main parameters of the function:

  • identifier: A vector specifying the identifiers for which SIDs are to be retrieved. These can be numeric or character vectors.
  • namespace: Specifies the type of identifier provided, with ‘cid’ as the default.
  • domain: The domain of the query, typically ‘compound’.
  • searchtype: Specifies the type of search to be performed, if applicable.
  • options: Additional arguments passed to the internal get_json function.

Retrieving SIDs by CID

In this example, we retrieve Substance IDs (SIDs) for the compound with CID (Compound ID) 2244:

sids_by_cid <- get_sids(
  identifier = c(2244, 2519, 3672),
  namespace = "cid",
  domain = "compound"
)

sids_by_cid
#> 
#>  Substance IDs (SIDs) from PubChem Database
#> 
#>  Request Details:  
#>   - Domain: Compound
#>   - Namespace: CID
#>   - Identifier: 2244, 2519, ... and 1 more.
#> 
#>  NOTE: run SIDs(...) to extract Substance ID data. See ?SIDs for help.

The above code retrieves Substance IDs for the compound with CID (Compound ID) 2244. The output shows the request details including the domain (Compound), namespace (Compound ID), and identifier (2244). This provides a summary of the query performed.

To retrieve the Substance IDs associated with the compound ID 2244, we use the SIDs function on the result. This getter function returns the results either as a tibble (data frame) or as a list, depending on the .to.data.frame argument.

sids <- SIDs(object = sids_by_cid, .to.data.frame = TRUE)
sids
#> # A tibble: 1,294 × 2
#>      CID     SID
#>    <dbl>   <dbl>
#>  1  2244    4594
#>  2  2244   87798
#>  3  2244  476106
#>  4  2244  602429
#>  5  2244  829042
#>  6  2244  832958
#>  7  2244  840714
#>  8  2244 3135921
#>  9  2244 5261264
#> 10  2244 7847177
#> # ℹ 1,284 more rows

The output is a tibble (data frame) with two columns: Compound ID and SID. The Compound ID column contains the compound IDs, and the SID column contains the Substance IDs.

table(sids$`Compound ID`)
#> Warning: Unknown or uninitialised column: `Compound ID`.
#> < table of extent 0 >

There are 1,288 rows in total, indicating 400 substances related to the compound ID 2244, 486 substances related to the compound ID 2519, and 402 substances related to the compound ID 3672.

Retrieving SIDs by AID

In this example, we retrieve Substance IDs (SIDs) for the assay with AID (Assay ID) 1234:

sids_by_aids <- get_sids(
  identifier = "1234",
  namespace = "aid",
  domain = "assay"
)

sids_by_aids
#> 
#>  Substance IDs (SIDs) from PubChem Database
#> 
#>  Request Details:  
#>   - Domain: Assay
#>   - Namespace: AID
#>   - Identifier: 1234
#> 
#>  NOTE: run SIDs(...) to extract Substance ID data. See ?SIDs for help.

The above code retrieves Substance IDs for the assay with AID (Assay ID) 1234. The output shows the request details including the domain (Assay), namespace (Assay ID), and identifier (1234). This provides a summary of the query performed.

To retrieve the Substance IDs associated with the assay ID 1234, we use the SIDs function on the result. This getter function returns the results either as a tibble (data frame) or as a list, depending on the .to.data.frame argument.

SIDs(object = sids_by_aids, .to.data.frame = TRUE)
#> # A tibble: 61 × 2
#>    AID       SID
#>    <chr>   <dbl>
#>  1 1234   845167
#>  2 1234   845769
#>  3 1234   847359
#>  4 1234   857446
#>  5 1234   857769
#>  6 1234   859251
#>  7 1234   864576
#>  8 1234  3714272
#>  9 1234  4252106
#> 10 1234  4259196
#> # ℹ 51 more rows

The output is a tibble (data frame) with two columns: Assay ID and SID. The Assay ID column contains the assay ID (1234 in this case), and the SID column contains the Substance IDs. This tibble format makes it easy to analyze and manipulate the data in R. There are 61 rows in total, indicating a list of substances related to the assay.

Retrieving SIDs by Name

In this example, we retrieve Substance IDs for the compound with the name aspirin:

sids <- get_sids(
  identifier = "aspirin",
  namespace = "name",
  domain = "compound"
)

sids
#> 
#>  Substance IDs (SIDs) from PubChem Database
#> 
#>  Request Details:  
#>   - Domain: Compound
#>   - Namespace: Name
#>   - Identifier: aspirin
#> 
#>  NOTE: run SIDs(...) to extract Substance ID data. See ?SIDs for help.

The above code retrieves Substance IDs for the compound named aspirin. The output shows the request details including the domain (Compound), namespace (Name), and identifier (aspirin). This provides a summary of the query performed.

To retrieve the Substance IDs associated with the compound name aspirin, we use the SIDs function on the result:

SIDs(object = sids)
#> # A tibble: 403 × 2
#>    Name        SID
#>    <chr>     <dbl>
#>  1 aspirin    4594
#>  2 aspirin   87798
#>  3 aspirin  476106
#>  4 aspirin  602429
#>  5 aspirin  829042
#>  6 aspirin  832958
#>  7 aspirin  840714
#>  8 aspirin 3135921
#>  9 aspirin 5261264
#> 10 aspirin 7847177
#> # ℹ 393 more rows

The SIDs function call on the result extracts the Substance IDs associated with the compound name aspirin. The output is a tibble with two columns: SID and Name. The SID column contains the Substance IDs, and the Name column contains the compound name (aspirin in this case). This tibble format makes it easy to handle and analyze the data in R. There are 2,356 rows in total, indicating a comprehensive list of substances related to the compound name aspirin.

Retrieving SIDs by SMILES

sids_by_smiles <- get_sids(
  identifier = "C([C@@H]1[C@H]([C@@H]([C@H](C(O1)O)O)O)O)O",
  namespace = "smiles",
  domain = "compound"
)

sids_by_smiles
#> 
#>  Substance IDs (SIDs) from PubChem Database
#> 
#>  Request Details:  
#>   - Domain: Compound
#>   - Namespace: SMILES
#>   - Identifier: C([C@@H]1[C@H]([C@@H]([C@H](C(O1)O)O)O)O)O
#> 
#>  NOTE: run SIDs(...) to extract Substance ID data. See ?SIDs for help.
SIDs(object = sids_by_smiles)
#> # A tibble: 230 × 2
#>    SMILES                                          SID
#>    <chr>                                         <dbl>
#>  1 C([C@@H]1[C@H]([C@@H]([C@H](C(O1)O)O)O)O)O     3333
#>  2 C([C@@H]1[C@H]([C@@H]([C@H](C(O1)O)O)O)O)O   819111
#>  3 C([C@@H]1[C@H]([C@@H]([C@H](C(O1)O)O)O)O)O   823016
#>  4 C([C@@H]1[C@H]([C@@H]([C@H](C(O1)O)O)O)O)O   823057
#>  5 C([C@@H]1[C@H]([C@@H]([C@H](C(O1)O)O)O)O)O   833240
#>  6 C([C@@H]1[C@H]([C@@H]([C@H](C(O1)O)O)O)O)O   841535
#>  7 C([C@@H]1[C@H]([C@@H]([C@H](C(O1)O)O)O)O)O  7847077
#>  8 C([C@@H]1[C@H]([C@@H]([C@H](C(O1)O)O)O)O)O  8023353
#>  9 C([C@@H]1[C@H]([C@@H]([C@H](C(O1)O)O)O)O)O  8153564
#> 10 C([C@@H]1[C@H]([C@@H]([C@H](C(O1)O)O)O)O)O 14720288
#> # ℹ 220 more rows

Retrieving SIDs by InChIKey

In this example, we retrieve Substance IDs (SIDs) for a compound using its InChIKey:

sids_by_inchikey <- get_sids(
  identifier = "BPGDAMSIGCZZLK-UHFFFAOYSA-N",
  namespace = "inchikey",
  domain = "compound"
)

sids_by_inchikey
#> 
#>  Substance IDs (SIDs) from PubChem Database
#> 
#>  Request Details:  
#>   - Domain: Compound
#>   - Namespace: INCHI_Key
#>   - Identifier: BPGDAMSIGCZZLK-UHFFFAOYSA-N
#> 
#>  NOTE: run SIDs(...) to extract Substance ID data. See ?SIDs for help.

The above code retrieves SIDs for the compound with the InChIKey BPGDAMSIGCZZLK-UHFFFAOYSA-N. The domain is set to compound and the namespace is set to inchikey to indicate that the identifier is an InChIKey.

To extract the SIDs associated with the InChIKey, we use the SIDs function on the result:

SIDs(object = sids_by_inchikey)
#> # A tibble: 93 × 2
#>    INCHI_Key                        SID
#>    <chr>                          <dbl>
#>  1 BPGDAMSIGCZZLK-UHFFFAOYSA-N   106508
#>  2 BPGDAMSIGCZZLK-UHFFFAOYSA-N  6152946
#>  3 BPGDAMSIGCZZLK-UHFFFAOYSA-N  8159218
#>  4 BPGDAMSIGCZZLK-UHFFFAOYSA-N 10530904
#>  5 BPGDAMSIGCZZLK-UHFFFAOYSA-N 16165986
#>  6 BPGDAMSIGCZZLK-UHFFFAOYSA-N 36258367
#>  7 BPGDAMSIGCZZLK-UHFFFAOYSA-N 49834150
#>  8 BPGDAMSIGCZZLK-UHFFFAOYSA-N 49862448
#>  9 BPGDAMSIGCZZLK-UHFFFAOYSA-N 76795655
#> 10 BPGDAMSIGCZZLK-UHFFFAOYSA-N 91749770
#> # ℹ 83 more rows

The SIDs function call on the result extracts the SIDs associated with the InChIKey BPGDAMSIGCZZLK-UHFFFAOYSA-N. The output is a tibble with two columns: INCHI Key and SID. The INCHI Key column contains the InChIKey, and the SID column contains the Substance IDs. This output shows that the specified compound is associated with 93 substance entries, each represented by a SID.

Retrieving SIDs by Formula

sids_by_formula <- get_sids(
  identifier = "C15H12N2O2",
  namespace = "formula",
  domain = "compound"
)

sids_by_formula
#> 
#>  Substance IDs (SIDs) from PubChem Database
#> 
#>  Request Details:  
#>   - Domain: Compound
#>   - Namespace: Formula
#>   - Identifier: C15H12N2O2
#> 
#>  NOTE: run SIDs(...) to extract Substance ID data. See ?SIDs for help.
SIDs(object = sids_by_formula, .to.data.frame = TRUE)
#> # A tibble: 347 × 2
#>    Formula        SID
#>    <chr>        <dbl>
#>  1 C15H12N2O2    9647
#>  2 C15H12N2O2   74340
#>  3 C15H12N2O2  592179
#>  4 C15H12N2O2  596082
#>  5 C15H12N2O2  841957
#>  6 C15H12N2O2 3136997
#>  7 C15H12N2O2 4284342
#>  8 C15H12N2O2 5171921
#>  9 C15H12N2O2 7847578
#> 10 C15H12N2O2 7980312
#> # ℹ 337 more rows

3.4. Retrieving Assay Data with get_assays()

The get_assays function is designed to retrieve biological assay data from the PubChem database. This function is particularly useful for researchers and scientists who need descriptive information about various biological assays.

The function queries the PubChem database using specified identifiers and returns a list of assay data. It is capable of fetching various assay information, including experimental data, results, and methodologies.

Here are the main parameters of the function:

identifier: A vector of positive specifying the assay identifiers (AIDs) for which data are to be retrieved. operation: The operation to be performed on the input records, defaulting to NULL. Expected opreation: record, concise, aids, sids, cids, description, targets/, , summary, classification. options: Additional parameters for the query, currently not affecting the results.

Retrieving Assays by AIDs

In this example, we retrieve assay data for several specific AIDs:

assay_data <- get_assays(
  identifier = c(485314, 485341, 504466, 624202, 651820), 
  namespace = "aid"
)

assay_data
#> 
#>  An object of class 'PubChemInstanceList'
#> 
#>  Number of instances: 5
#>   - Domain: Assay
#>   - Namespace: AID
#>   - Identifier(s): 485314, 485341, ... and 3 more.
#> 
#>  * Run 'instance(...)' function to extract specific instances from the complete list, and
#>    'request_args(...)' to see all the requested instance identifiers.
#>  * See ?instance and ?request_args for details.

The above code retrieves assay data for multiple AIDs. The output shows the request details, including the domain (Assay), namespace (Assay ID), and identifiers. It also provides instructions on how to retrieve specific instances from the complete list and view all requested instance identifiers.

To view the request arguments:

request_args(object = assay_data)
#> $namespace
#> [1] "aid"
#> 
#> $identifier
#> [1] 485314 485341 504466 624202 651820
#> 
#> $domain
#> [1] "assay"
#> 
#> $operation
#> [1] "description"

To retrieve detailed information about a specific assay (e.g., 651820), you can use the instance function on the result:

aid_651820 <- instance(object = assay_data, .which = 651820)
aid_651820
#> 
#>  An object of class 'PubChemInstance'
#> 
#>  Request Details:  
#>   - Domain: Assay
#>   - Namespace: AID
#>   - Identifier: 651820
#> 
#>  Instance Details:  
#>   - aid (2): [<named numeric>] id, version
#>   - aid_source (1): [<named list>] db
#>   - name (1): [<unnamed character>] 
#>   - description (11): [<unnamed character>] 
#>   - protocol (1): [<unnamed character>] 
#>   - comment (4): [<unnamed character>] 
#>   - xref (1): [<unnamed list>] 
#>   - results (35): [<unnamed list>] 
#>   - revision (1): [<unnamed numeric>] 
#>   - target (1): [<unnamed list>] 
#>   - activity_outcome_method (1): [<unnamed numeric>] 
#>   - dr (1): [<unnamed list>] 
#>   - grant_number (1): [<unnamed character>] 
#>   - project_category (1): [<unnamed numeric>] 
#> 
#>  NOTE: Run getter function 'retrieve()' with element name above to extract data from corresponding list. 
#>        See ?retrieve for details.

The instance function call on the result extracts detailed information about the specific assay, including experimental data, results, and methodologies. This information is crucial for understanding the biological activity and properties of the compounds tested in the assay.

To extract specific details from the assay data, you can use the retrieve function with various slots:

retrieve(object = aid_651820, .slot = "aid", .to.data.frame = TRUE)
#> # A tibble: 2 × 3
#>   Identifier Name     Value
#>        <dbl> <chr>    <dbl>
#> 1     651820 id      651820
#> 2     651820 version      1

This code extracts the Assay ID and version of the assay, providing a concise summary of the assay’s unique identifier and its version in the PubChem database.

retrieve(object = aid_651820, .slot = "aid_source", .to.data.frame = TRUE)
#> # A tibble: 1 × 3
#>   Identifier name  source_id
#>        <dbl> <chr> <chr>    
#> 1     651820 NCGC  HCV100

This code retrieves the source information for the assay, including the name of the source and the source ID, which helps in identifying the origin of the assay data.

retrieve(object = aid_651820, .slot = "name", .to.data.frame = FALSE)
#> $Identifier
#> [1] 651820
#> 
#> [[2]]
#> [1] "qHTS Assay for Inhibitors of Hepatitis C Virus (HCV)"

This code extracts the name of the assay, providing a clear description of the assay’s purpose and target.

retrieve(object = aid_651820, .slot = "description", .to.data.frame = FALSE, .verbose = TRUE)
#> 
#>  PubChem Assay Details (description)
#> 
#>  Hepatitis C virus (HCV) infects about 200 million people in the world.  Many infected people progress to chronic liver disease including cirrhosis with a risk of developing liver cancer.  To date, there is no effective vaccine for hepatitis C.  Current therapy based on interferon is only effective in about half of the patients and is associated with significant adverse effects.  The fraction of people with HCV who can complete a successful treatment is estimated to be no more than 10 percent.  Recent development of direct-acting antivirals against HCV, such as protease and polymerase inhibitors, is promising but still requires combination with peginterferon and ribavirin for maximal efficacy.  In addition, these agents are associated with high rate of resistance and many have significant side effects. 
#>  
#>  Due to the lack of a culture system for infectious HCV, the search for new HCV drugs has been greatly hampered.  Cell-based screen for HCV inhibitors in use today is based on the HCV replicon system, which only targets the RNA replication step of the viral lifecycle and does not encompass viral entry, processing, assembly and secretion.  High-throughput screening (HTS) with an infectious HCV system would cover the complete spectrum of potentially druggable targets in all stages of HCV lifecycle, and would have more biological relevance than other cell-based assays.  Moreover, targeting several key processes in the viral life cycle may not only increase antiviral efficacy; more importantly, it may also reduce the capacity of the virus to develop resistance to the compound. 
#>  
#>  The goal of this project is to identify novel HCV inhibitors as new therapies for hepatitis C, using a highly sensitive and specific assay platform which is based on a HCV infectious cell culture system established in the laboratory and adapted for high-throughput HCV drug screen.
#>  
#>  NIH Chemical Genomics Center [NCGC]
#>  NIH Molecular Libraries Probe Centers Network [MLPCN]
#>  
#>  MLPCN Grant: MH095511
#>  Assay Submitter (PI): Jake Liang, NIDDK

This code retrieves the detailed description of the assay, including its purpose, the challenges addressed, and the methodology used. This is crucial for understanding the context and rationale behind the assay.

retrieve(object = aid_651820, .slot = "protocol", .to.data.frame = FALSE, .verbose = TRUE)
#> 
#>  PubChem Assay Details (protocol)
#> 
#>  The assay will start with plating 1,000 cells/well in 3 muL volume and culture for 4 h. Then 23 nL of compounds from the library collection will be added to each well, followed by adding 2.5 muL of HCVcc-Cre virus (~ 0.5 moi) and further cultured for 44 h before the luciferase assay. A volume of 4.5 muL luciferase substrates will be added to each well and the plates will be incubated at room temperature for 15 min. and then read for 15 sec. for the luciferase activity

This code retrieves the detailed protocol for conducting the assay, providing step-by-step instructions, including the materials needed, preparation steps, and the assay procedure. This is crucial for replicating the experiment and ensuring consistent results.

retrieve(object = aid_651820, .slot = "comment", .to.data.frame = FALSE, .verbose = TRUE)
#> 
#>  PubChem Assay Details (comment)
#> 
#>  Compound Ranking:
#>  
#>  1. Compounds are first classified as having full titration curves, partial modulation, partial curve (weaker actives), single point activity (at highest concentration only), or inactive. See data field "Curve Description". For this assay, apparent inhibitors are ranked higher than compounds that showed apparent activation.
#>  2. For all inactive compounds, PUBCHEM_ACTIVITY_SCORE is 0. For all active compounds, a score range was given for each curve class type given above.  Active compounds have PUBCHEM_ACTIVITY_SCORE between 40 and 100.  Inconclusive compounds have PUBCHEM_ACTIVITY_SCORE between 1 and 39.  Fit_LogAC50 was used for determining relative score and was scaled to each curve class' score range.

This code retrieves additional contextual information and detailed criteria for evaluating the activity of compounds in the assay. In this specific case, it includes the PUBCHEM_ACTIVITY_OUTCOME and PUBCHEM_ACTIVITY_SCORE, which help in interpreting the assay results and determining the activity level of the compounds tested.

retrieve(object = aid_651820, .slot = "xref", .to.data.frame = FALSE)
#> $Identifier
#> [1] 651820
#> 
#> $xref
#>                     dburl 
#> "http://www.ncgc.nih.gov"

This code retrieves external references related to the assay, such as links to relevant publications and additional assay IDs. This helps in contextualizing the assay within the broader scientific literature and finding related studies.

retrieve(object = aid_651820, .slot = "results", .to.data.frame = TRUE)
#> # A tibble: 74 × 8
#>    Identifier tid   name                 description     type  unit  ac    tc   
#>         <dbl> <chr> <chr>                <chr>           <chr> <chr> <chr> <chr>
#>  1     651820 1     Phenotype            Indicates type… 4     254   <NA>  <NA> 
#>  2     651820 2     Potency              Concentration … 1     5     TRUE  <NA> 
#>  3     651820 3     Efficacy             Maximal effica… 1     15    <NA>  <NA> 
#>  4     651820 4     Analysis Comment     Annotation/not… 4     254   <NA>  <NA> 
#>  5     651820 5     Activity_Score       Activity score. 2     254   <NA>  <NA> 
#>  6     651820 6     Curve_Description    A description … 4     254   <NA>  <NA> 
#>  7     651820 7     Fit_LogAC50          The logarithm … 1     254   <NA>  <NA> 
#>  8     651820 8     Fit_HillSlope        The Hill slope… 1     254   <NA>  <NA> 
#>  9     651820 9     Fit_R2               R^2 fit value … 1     254   <NA>  <NA> 
#> 10     651820 10    Fit_InfiniteActivity The asymptotic… 1     15    <NA>  <NA> 
#> # ℹ 64 more rows

This code retrieves a tibble with detailed experimental results, including EC50 values, activation percentages, and other key metrics. This data is essential for analyzing the performance of the compounds in the assay and making informed conclusions about their biological activity.

retrieve(object = aid_651820, .slot = "revision", .to.data.frame = FALSE)
#> $Identifier
#> [1] 651820
#> 
#> [[2]]
#> [1] 1

This code retrieves the revision number of the assay data, indicating the version of the data retrieved. This helps track changes and updates to the assay information over time.

retrieve(object = aid_651820, .slot = "activity_outcome_method", .to.data.frame = FALSE)
#> $Identifier
#> [1] 651820
#> 
#> [[2]]
#> [1] 2

This code retrieves the method used to determine the activity outcome of the compounds in the assay. This information is crucial for understanding the criteria and process used to classify the compounds’ activity levels.

retrieve(object = aid_651820, .slot = "project_category", .to.data.frame = FALSE)
#> $Identifier
#> [1] 651820
#> 
#> [[2]]
#> [1] 2

This code retrieves the category of the project under which the assay was conducted. This helps in identifying the broader context and objectives of the research project associated with the assay.

3.5. Retrieving Compound Data with get_compounds()

The get_compounds function is designed to streamline the process of retrieving detailed compound data from the extensive PubChem database. This function is an invaluable tool for chemists, biologists, pharmacologists, and researchers who require comprehensive chemical compound information for their scientific investigations and analyses.

The function interfaces directly with the PubChem database, allowing users to query and retrieve a wide array of data on chemical compounds. Upon execution, the function returns a list containing detailed information about each queried compound. This information can encompass various aspects such as:

  • Chemical Structures: Detailed representations of the molecular structure of compounds.
  • Chemical Properties: Information on physical and chemical properties such as molecular weight, boiling point, melting point, solubility, and more.
  • Biological Activities: Data on the biological activities and effects of the compounds, including bioassay results.
  • Synonyms and Identifiers: A comprehensive list of alternative names and identifiers for the compounds.
  • Safety and Toxicity Information: Data on the safety and potential toxicity of the compounds.

Here are the main parameters of the function:

  • identifier: A vector specifying the compound identifiers. These identifiers can be either positive integers (such as CIDs, which are unique compound identifiers in PubChem) or identifier strings (such as chemical names, SMILES strings, InChI, etc.). This parameter allows for flexible input methods tailored to the specific needs of the user.
  • namespace: Specifies the type of identifier provided in the identifier parameter. Common values for this parameter include:
    • “cid” (Compound Identifier)
    • “name” (Chemical Name)
    • “smiles” (Simplified Molecular Input Line Entry System)
    • “inchi” (International Chemical Identifier)
    • “sdf” (Structure-Data File)
  • operation: An optional parameter specifying the operation to be performed on the input records. This can include operations such as filtering, sorting, or transforming the data based on specific criteria. By default, this parameter is set to NULL, indicating no additional operations are performed.
  • searchtype: An optional parameter that defines the type of search to be conducted. This can be used to refine and specify the search strategy, such as exact match, substructure search, or similarity search. By default, this parameter is set to NULL, indicating a general search.
  • options: A list of additional parameters that can be used to customize the query further. This can include options such as result limits, output formats, and other advanced settings to tailor the data retrieval process to specific requirements.

Retrieving Compounds by CIDs

In this example, we retrieve compound data for specific CIDs (Compound IDs) 2244 and 5245:

compound_data <- get_compounds(
  identifier = c(2244, 5245),
  namespace = "cid"
)

compound_data
#> 
#>  An object of class 'PubChemInstanceList'
#> 
#>  Number of instances: 2
#>   - Domain: Compound
#>   - Namespace: CID
#>   - Identifier(s): 2244, 5245
#> 
#>  * Run 'instance(...)' function to extract specific instances from the complete list, and
#>    'request_args(...)' to see all the requested instance identifiers.
#>  * See ?instance and ?request_args for details.

The above code retrieves compound data for the compounds with CIDs 2244 and 5245. The output shows the request details, including the domain (Compound), namespace (Compound ID), and identifiers. It also provides instructions on how to retrieve specific instances from the complete list and view all requested instance identifiers.

To view the request arguments:

request_args(object = compound_data)
#> $namespace
#> [1] "cid"
#> 
#> $identifier
#> [1] 2244 5245
#> 
#> $domain
#> [1] "compound"
#> 
#> $operation
#> NULL
#> 
#> $options
#> NULL
#> 
#> $searchtype
#> NULL

To retrieve detailed information about a specific compound, you can use the instance function on the result:

compound_2244 <- instance(object = compound_data, .which = 2244)
compound_2244
#> 
#>  An object of class 'PubChemInstance'
#> 
#>  Request Details:  
#>   - Domain: Compound
#>   - Namespace: CID
#>   - Identifier: 2244
#> 
#>  Instance Details:  
#>   - id (1): [<named list>] id
#>   - atoms (2): [<named list>] aid, element
#>   - bonds (3): [<named list>] aid1, aid2, order
#>   - coords (1): [<unnamed list>] 
#>   - charge (1): [<unnamed numeric>] 
#>   - props (22): [<unnamed list>] 
#>   - count (10): [<named numeric>] heavy_atom, atom_chiral, atom_chiral_def, atom_chiral_undef, ...
#> 
#>  NOTE: Run getter function 'retrieve()' with element name above to extract data from corresponding list. 
#>        See ?retrieve for details.

The instance function call on the result extracts detailed information about the specific compound, including chemical structures, properties, and identifiers.

To retrieve specific data elements from the compound data, you can use the retrieve function with the relevant slots:

retrieve(object = compound_2244, .slot = "id", .to.data.frame = TRUE)
#> # A tibble: 1 × 2
#>   Identifier    id
#>        <dbl> <dbl>
#> 1       2244  2244

The retrieve function call with the id slot extracts the compound identifier (CID) for the specific compound. In this case, the CID is 2244, confirming the identity of the compound.

retrieve(object = compound_2244, .slot = "atoms", .to.data.frame = FALSE)
#> $Identifier
#> [1] 2244
#> 
#> $aid
#>  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21
#> 
#> $element
#>  [1] 8 8 8 8 6 6 6 6 6 6 6 6 6 1 1 1 1 1 1 1 1

The retrieve function call with the atoms slot extracts information about the atoms in the compound. The output includes two vectors: aid, representing the atom IDs, and element, representing the atomic numbers of the elements. For example, element 8 represents oxygen, and element 6 represents carbon.

retrieve(object = compound_2244, .slot = "bonds", .to.data.frame = FALSE)
#> $Identifier
#> [1] 2244
#> 
#> $aid1
#>  [1]  1  1  2  2  3  4  5  5  6  6  7  7  8  8  9  9 10 12 13 13 13
#> 
#> $aid2
#>  [1]  5 12 11 21 11 12  6  7  8 11  9 14 10 15 10 16 17 13 18 19 20
#> 
#> $order
#>  [1] 1 1 1 1 2 2 1 2 2 1 1 1 1 1 2 1 1 1 1 1 1

The retrieve function call with the bonds slot extracts information about the bonds in the compound. The output includes three vectors: aid1 and aid2 represent the atom IDs involved in each bond, and order represents the bond order (e.g., single, double bonds).

retrieve(object = compound_2244, .slot = "coords", .to.data.frame = FALSE)
#> $Identifier
#> [1] 2244
#> 
#> $type
#> [1]   1   5 255
#> 
#> $aid
#>  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21
#> 
#> $conformers
#> $conformers[[1]]
#> $conformers[[1]]$x
#>  [1] 3.7320 6.3301 4.5981 2.8660 4.5981 5.4641 4.5981 6.3301 5.4641 6.3301
#> [11] 5.4641 2.8660 2.0000 4.0611 6.8671 5.4641 6.8671 2.3100 1.4631 1.6900
#> [21] 6.3301
#> 
#> $conformers[[1]]$y
#>  [1] -0.0600  1.4400  1.4400 -1.5600 -0.5600 -0.0600 -1.5600 -0.5600 -2.0600
#> [10] -1.5600  0.9400 -0.5600 -0.0600 -1.8700 -0.2500 -2.6800 -1.8700  0.4769
#> [19]  0.2500 -0.5969  2.0600
#> 
#> $conformers[[1]]$style
#> $conformers[[1]]$style$annotation
#> [1] 8 8 8 8 8 8
#> 
#> $conformers[[1]]$style$aid1
#> [1] 5 5 6 7 8 9
#> 
#> $conformers[[1]]$style$aid2
#> [1]  6  7  8  9 10 10

The retrieve function call with the coords slot extracts the coordinates of the atoms in the compound. The output includes details such as:

  • type: Represents the type of coordinates.
  • aid: Atom IDs for which the coordinates are provided.
  • conformers: Contains the conformer data, including x and y coordinates for each atom. This provides the spatial arrangement of the atoms in the compound, which is crucial for understanding the compound’s 3D structure and interactions.
retrieve(object = compound_2244, .slot = "props", .to.data.frame = TRUE)
#> # A tibble: 22 × 11
#>    Identifier label name  datatype release value implementation version software
#>         <dbl> <chr> <chr> <chr>    <chr>   <chr> <chr>          <chr>   <chr>   
#>  1       2244 Comp… Cano… 5        2021.1… 1     <NA>           <NA>    <NA>    
#>  2       2244 Comp… <NA>  7        2021.1… 212   E_COMPLEXITY   3.4.8.… Cactvs  
#>  3       2244 Count Hydr… 5        2021.1… 4     E_NHACCEPTORS  3.4.8.… Cactvs  
#>  4       2244 Count Hydr… 5        2021.1… 1     E_NHDONORS     3.4.8.… Cactvs  
#>  5       2244 Count Rota… 5        2021.1… 3     E_NROTBONDS    3.4.8.… Cactvs  
#>  6       2244 Fing… SubS… 16       2021.1… 0000… E_SCREEN       3.4.8.… Cactvs  
#>  7       2244 IUPA… Allo… 1        2021.1… 2-ac… <NA>           2.7.0   Lexiche…
#>  8       2244 IUPA… CAS-… 1        2021.1… 2-ac… <NA>           2.7.0   Lexiche…
#>  9       2244 IUPA… Mark… 1        2021.1… 2-ac… <NA>           2.7.0   Lexiche…
#> 10       2244 IUPA… Pref… 1        2021.1… 2-ac… <NA>           2.7.0   Lexiche…
#> # ℹ 12 more rows
#> # ℹ 2 more variables: source <chr>, parameters <chr>

The retrieve function call with the props slot extracts detailed properties of the compound, including information such as label, name, data type, release, value, implementation, version, software, and source. This comprehensive information covers various physical, chemical, and structural properties of the compound.

retrieve(object = compound_2244, .slot = "count", .to.data.frame = TRUE)
#> # A tibble: 10 × 3
#>    Identifier Name              Value
#>         <dbl> <chr>             <dbl>
#>  1       2244 heavy_atom           13
#>  2       2244 atom_chiral           0
#>  3       2244 atom_chiral_def       0
#>  4       2244 atom_chiral_undef     0
#>  5       2244 bond_chiral           0
#>  6       2244 bond_chiral_def       0
#>  7       2244 bond_chiral_undef     0
#>  8       2244 isotope_atom          0
#>  9       2244 covalent_unit         1
#> 10       2244 tautomers            -1

The retrieve function call with the count slot extracts various count metrics for the compound. The output includes a tibble with two columns: Name and Value. This information includes:

heavy_atom: The number of heavy atoms in the compound. atom_chiral, atom_chiral_def, atom_chiral_undef: Counts of chiral atoms and their defined/undefined states. bond_chiral, bond_chiral_def, bond_chiral_undef: Counts of chiral bonds and their defined/undefined states. isotope_atom: The number of isotopic atoms. covalent_unit: The number of covalent units in the compound. tautomers: The number of tautomers.

These counts provide insights into the compound’s chemical complexity and stereochemistry, which are essential for understanding its reactivity and biological activity.

3.6. Retrieving Substance Data with get_substances()

The get_substances function retrieves substance data from the PubChem database based on a specified identifier and namespace. This function is crucial for obtaining detailed information about a substance, including its various identifiers, sources, synonyms, comments, cross-references, and compound details.

Here are the main parameters of the function:

  • identifier: A character or numeric vector specifying the identifiers for the request. This can be a substance ID (SID), name, or other supported identifier.
  • namespace: Specifies the namespace for the request. The default value is ‘sid’.
  • operation: Specifies the operation to be performed on the input records. The default value is NULL.
  • searchtype: Specifies the type of search to be performed. The default value is NULL.
  • options: Additional parameters for the query. These can be used to customize the search further.

Retrieving Substances by Name

In this example, we retrieve substance data for aspirin:

substance_data <- get_substances(
  identifier = "aspirin",   
  namespace = "name"
)

substance_data
#> 
#>  An object of class 'PubChemInstanceList'
#> 
#>  Number of instances: 1
#>   - Domain: Substance
#>   - Namespace: Name
#>   - Identifier(s): aspirin
#> 
#>  * Run 'instance(...)' function to extract specific instances from the complete list, and
#>    'request_args(...)' to see all the requested instance identifiers.
#>  * See ?instance and ?request_args for details.

The above code retrieves substance data for the identifier “aspirin”. The output indicates that the request details include the domain (Substance), namespace (Name), and identifier (aspirin). It also mentions that you can run the instance(…) function to extract specific instances and request_args(…) to see all requested instance identifiers.

To see the arguments used in the request, use the request_args function:

request_args(object = substance_data)
#> $namespace
#> [1] "name"
#> 
#> $identifier
#> [1] "aspirin"
#> 
#> $domain
#> [1] "substance"

This output shows the namespace (“name”), identifier (“aspirin”), and domain (“substance”) used in the request.

To extract specific substance data, we use the instance function with the specified identifier:

substance_aspirin <- instance(object = substance_data, .which = "aspirin")

substance_aspirin
#> 
#>  Substance Data from PubChem Database 
#> 
#>  Request Details:  
#>   - Domain: Substance
#>   - Namespace: Name
#>   - Identifier: aspirin
#> 
#>  Number of substances retrieved: 146
#> 
#>  Substances contain data within following slots; 
#>   - sid (2): [<named numeric>] id, version
#>   - source (1): [<named list>] db
#>   - synonyms (6): [<unnamed character>] 
#>   - comment (2): [<unnamed character>] 
#>   - xref (4): [<unnamed list>] 
#>   - compound (2): [<unnamed list>] 
#> 
#>  NOTE: Run getter function 'retrieve()' with element name above to extract data from corresponding list. 
#>        See ?retrieve for details.

The above output shows the request details for aspirin and indicates that 143 substances were retrieved. It lists the slots available for further data extraction. These slots include sid, source, synonyms, comment, xref, and compound.

To extract data from the sid slot as a data frame:

retrieve(object = substance_aspirin, .slot = "sid", .to.data.frame = TRUE)
#> # A tibble: 2 × 3
#>   Identifier Name    Value
#>   <chr>      <chr>   <dbl>
#> 1 aspirin    id       4594
#> 2 aspirin    version    10

This output shows the id and version for the substance “aspirin”. The id is 4594 and the version is 10.

To extract data from the source slot as a data frame:

retrieve(object = substance_aspirin, .slot = "source", .to.data.frame = TRUE)
#> # A tibble: 1 × 3
#>   Identifier name  source_id
#>   <chr>      <chr> <chr>    
#> 1 aspirin    KEGG  C01405

This output shows the source information for “aspirin”. The source is KEGG, and the source ID is C01405.

To extract data from the synonyms slot:

retrieve(object = substance_aspirin, .slot = "synonyms", .to.data.frame = FALSE)
#> $Identifier
#> [1] "aspirin"
#> 
#> [[2]]
#> [1] "2-Acetoxybenzenecarboxylic acid"
#> 
#> [[3]]
#> [1] "50-78-2"
#> 
#> [[4]]
#> [1] "Acetylsalicylate"
#> 
#> [[5]]
#> [1] "Acetylsalicylic acid"
#> 
#> [[6]]
#> [1] "Aspirin"
#> 
#> [[7]]
#> [1] "C01405"

This output lists the synonyms for “aspirin”. These include “2-Acetoxybenzenecarboxylic acid”, “50-78-2”, “Acetylsalicylate”, “Acetylsalicylic acid”, “Aspirin”, and “C01405”.

To extract data from the comment slot with verbosity:

retrieve(object = substance_aspirin, .slot = "comment", .to.data.frame = FALSE, .verbose = TRUE)
#> 
#>  PubChem Substance Details (comment)
#> 
#>  Same as: <a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?sid=7847177">D00109</a>
#>  Is a reactant of enzyme EC: 3.1.1.55

This output shows comments related to “aspirin”. It indicates that “aspirin” is the same as D00109 and is a reactant of the enzyme EC: 3.1.1.55.

To extract data from the xref slot with verbosity:

retrieve(object = substance_aspirin, .slot = "xref", .to.data.frame = FALSE, .verbose = TRUE)
#> 
#>  PubChem Substance Details (xref)
#> 
#>  > Source: regid
#>     Value: C01405
#> 
#>  > Source: rn
#>     Value: 50-78-2
#> 
#>  > Source: dburl
#>     Value: http://www.genome.jp/kegg/
#> 
#>  > Source: sburl
#>     Value: http://www.genome.jp/dbget-bin/www_bget?cpd:C01405

This output shows cross-references for “aspirin”. It includes the source “regid” with value C01405, the source “rn” with value 50-78-2, the source “dburl” with the URL for the KEGG database, and the source “sburl” with a specific URL for the compound in the KEGG database.

To extract data from the compound slot:

retrieve(object = substance_aspirin, .slot = "compound", .to.data.frame = FALSE)
#> $Identifier
#> [1] "aspirin"
#> 
#> [[2]]
#> [[2]]$id
#> type 
#>    0 
#> 
#> [[2]]$atoms
#> [[2]]$atoms$aid
#>  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13
#> 
#> [[2]]$atoms$element
#>  [1] 8 8 8 8 6 6 6 6 6 6 6 6 6
#> 
#> 
#> [[2]]$bonds
#> [[2]]$bonds$aid1
#>  [1]  1  1  2  3  4  5  5  5  6  8  9 10 11
#> 
#> [[2]]$bonds$aid2
#>  [1]  6 10  7  7 10  6  7  8  9 11 12 13 12
#> 
#> [[2]]$bonds$order
#>  [1] 1 1 2 1 2 1 1 2 2 1 1 1 2
#> 
#> 
#> [[2]]$coords
#> [[2]]$coords[[1]]
#> [[2]]$coords[[1]]$type
#> [1] 1 3
#> 
#> [[2]]$coords[[1]]$aid
#>  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13
#> 
#> [[2]]$coords[[1]]$conformers
#> [[2]]$coords[[1]]$conformers[[1]]
#> [[2]]$coords[[1]]$conformers[[1]]$x
#>  [1] 22.7278 19.0863 21.5033 23.9396 20.2981 21.5226 20.2981 19.0928 21.5226
#> [10] 23.9396 19.0928 20.2981 25.1450
#> 
#> [[2]]$coords[[1]]$conformers[[1]]$y
#>  [1] -15.8040 -14.0004 -13.9940 -17.9642 -15.8105 -16.5029 -14.6927 -16.5029
#>  [9] -17.9133 -16.4964 -17.9133 -18.6250 -15.7977
#> 
#> 
#> 
#> 
#> 
#> [[2]]$charge
#> [1] 0
#> 
#> 
#> [[3]]
#> [[3]]$id
#> [[3]]$id$type
#> [1] 1
#> 
#> [[3]]$id$id
#>  cid 
#> 2244

This output shows detailed compound data for “aspirin”. It includes the atom IDs, elements, bond information, coordinates, and charge. Additionally, it provides an ID of the compound in PubChem (cid 2244).

Each section provides specific details about the substance “aspirin”, making it possible to analyze different aspects of the substance data from the PubChem database.

3.7. Retrieving Chemical Properties with get_properties()

The get_properties function facilitates the retrieval of specific chemical properties of compounds from the PubChem database. This function is essential for researchers and chemists who require detailed chemical information about various compounds.

The function queries the PubChem database using specified identifiers and returns a list or dataframe containing the requested properties of each compound. These properties can include molecular weight, chemical formula, isomeric SMILES, and more, depending on the available data in PubChem and the properties requested. You may find the full list of properties at https://pubchem.ncbi.nlm.nih.gov/docs/pug-rest#section=Compound-Property-Tables.

Here are the main parameters of the function:

  • properties: A character vector specifying the properties to be retrieved. This vector can include various chemical properties like mass, molecular formula, InChI, etc.

  • identifier: A vector of identifiers for the compounds. These identifiers can be either positive integers (such as CIDs, which are unique compound identifiers in PubChem) or identifier strings (such as chemical names, SMILES strings, InChI, etc.).

  • namespace: Specifies the type of identifier provided in the identifier parameter. The default value is ‘cid’. Common values for this parameter include cid, name, smiles inchi

  • searchtype: An optional parameter that defines the type of search to be conducted. This can be used to refine and specify the search strategy, such as exact match, substructure search, or similarity search. By default, this parameter is set to NULL, indicating a general search.

  • options: Additional arguments for the query. These can be used to customize the search further, but by default, it is set to NULL.

  • propertyMatch: A list that specifies matching criteria for the properties. It includes:

    • .ignore.case: A logical value indicating whether to ignore case when matching property names. Default is FALSE.
    • type: Specifies the type of match to be performed, such as “contain”, “exact”, “all”. Default is “contain”.

Retrieving Properties by Compounds

In this example, we retrieve properties for the compounds “aspirin” and “ibuprofen”. The propertyMatch argument is used to specify matching criteria, such as ignoring case and using a “contain” type search. Therefore, this code retrieves the properties containing “mass”, “molecular”, and “inchi” for the compounds “aspirin” and “ibuprofen”, ignoring case sensitivity.

props <- get_properties(
  properties = c("mass", "molecular", "inchi"),
  identifier = c("aspirin", "ibuprofen"),
  namespace = "name",
  propertyMatch = list(
    .ignore.case = TRUE,
    type = "contain"
  )
)
props
#> 
#>  An object of class 'PubChemInstanceList'
#> 
#>  Number of instances: 2
#>   - Domain: Compound
#>   - Namespace: Name
#>   - Identifier(s): aspirin, ibuprofen
#> 
#>  * Run 'instance(...)' function to extract specific instances from the complete list, and
#>    'request_args(...)' to see all the requested instance identifiers.
#>  * See ?instance and ?request_args for details.

To extract specific details from the property data, you can use the retrieve function with various slots:

retrieve(object = props, .which = "aspirin", .to.data.frame = TRUE)
#> # A tibble: 1 × 8
#>   Identifier   CID MolecularFormula MolecularWeight InChI     InChIKey ExactMass
#>   <chr>      <dbl> <chr>            <chr>           <chr>     <chr>    <chr>    
#> 1 aspirin     2244 C9H8O4           180.16          InChI=1S… BSYNRYM… 180.0422…
#> # ℹ 1 more variable: MonoisotopicMass <chr>

This code extracts the properties of aspirin, providing a detailed summary of its CID, molecular formula, molecular weight, InChI, InChIKey, exact mass, and monoisotopic mass.

retrieve(object = props, .which = "ibuprofen", .to.data.frame = FALSE)
#> $Identifier
#> [1] "ibuprofen"
#> 
#> $CID
#> [1] 3672
#> 
#> $MolecularFormula
#> [1] "C13H18O2"
#> 
#> $MolecularWeight
#> [1] "206.28"
#> 
#> $InChI
#> [1] "InChI=1S/C13H18O2/c1-9(2)8-11-4-6-12(7-5-11)10(3)13(14)15/h4-7,9-10H,8H2,1-3H3,(H,14,15)"
#> 
#> $InChIKey
#> [1] "HEFNNWSXXWATRW-UHFFFAOYSA-N"
#> 
#> $ExactMass
#> [1] "206.130679813"
#> 
#> $MonoisotopicMass
#> [1] "206.130679813"

This code extracts the properties of ibuprofen and displays them as a list. The properties include CID, molecular formula, molecular weight, InChI, InChIKey, exact mass, and monoisotopic mass.

retrieve(object = props, .to.data.frame = TRUE, .combine.all = TRUE)
#> # A tibble: 2 × 8
#>   Identifier   CID MolecularFormula MolecularWeight InChI     InChIKey ExactMass
#>   <chr>      <dbl> <chr>            <chr>           <chr>     <chr>    <chr>    
#> 1 aspirin     2244 C9H8O4           180.16          InChI=1S… BSYNRYM… 180.0422…
#> 2 ibuprofen   3672 C13H18O2         206.28          InChI=1S… HEFNNWS… 206.1306…
#> # ℹ 1 more variable: MonoisotopicMass <chr>

This code combines the properties of all retrieved compounds (aspirin and ibuprofen) into a single dataframe, making it easier to compare their properties side-by-side.

3.8. Retrieving Chemical Properties with get_synonyms()

The get_synonyms function is designed to retrieve synonyms for chemical compounds or substances from the PubChem database. It is particularly useful for obtaining various names and identifiers associated with a specific chemical entity.

The function queries the PubChem database for synonyms of a given identifier (such as a Compound ID or a chemical name) and returns a comprehensive list of alternative names and identifiers. This can include systematic names, trade names, registry numbers, and other forms of identification used in scientific literature and industry.

Here are the main parameters of the function:

  • identifier: The identifier for which synonyms are to be retrieved. This can be a numeric value (like a Compound ID) or a character string (like a chemical name).
  • namespace: Specifies the namespace for the query. Common values include: ‘cid’ (Compound Identifier) [default] ‘name’ (Chemical Name)
  • domain: Specifies the domain for the request. Typically, this is ‘compound’. The default value is ‘compound’.
  • searchtype: Specifies the type of search to be performed. The default value is NULL.
  • options: Additional arguments for customization of the request.

Retrieving Synonyms by Compound

In this example, we retrieve synonyms for the compound “aspirin”:

synonyms <- get_synonyms(
  identifier = "aspirin",
  namespace = "name"
)

synonyms
#> 
#>  Synonyms from PubChem Database
#> 
#>  Request Details:  
#>   - Domain: Compound
#>   - Namespace: Name
#>   - Identifier: aspirin
#> 
#>  NOTE: run 'synonyms(...)' to extract synonyms data. See ?synonyms for help.

The above code retrieves synonyms for the compound “aspirin” using its name as the identifier. The namespace is set to “name” to indicate that the identifier is a chemical name.

The output is a list of synonyms for the compound “aspirin”. These synonyms include various names and identifiers associated with the compound in different contexts, such as:

  • Systematic names (e.g., “2-Acetoxybenzoic acid”)
  • Trade names (e.g., “Ecotrin”)
  • Registry numbers (e.g., “50-78-2”)
  • Other alternative names (e.g., “Acetosalin”, “Polopiryna”)

The retrieved synonyms provide a comprehensive view of the different names and identifiers that can be used to reference the same chemical entity in scientific literature and industry.

3.9. Retrieving List of Depositors with get_all_sources()

The get_all_sources function facilitates the retrieval of a list of all current depositors for substances or assays from the PubChem database. This function is particularly useful for users who need to identify and analyze the sources of chemical data.

The function queries the PubChem database to obtain a comprehensive list of sources (such as laboratories, companies, or research institutions) that have contributed substance or assay data. This information can be crucial for researchers and professionals who are tracking the origin of specific chemical data or assessing the diversity of data sources in PubChem.

Here is the main parameter of the function:

  • domain: Specifies the domain for which sources are to be retrieved. The domain can be either ‘substance’ or ‘assay’. The default value is ‘substance’.

Retrieving All Sources by Substances

In this example, we retrieve all sources for substances:

substance_sources <- get_all_sources(
  domain = "substance"
)

substance_sources
#>   [1] "001Chemical"                                                                                                                                           
#>   [2] "10X CHEM"                                                                                                                                              
#>   [3] "1st Scientific"                                                                                                                                        
#>   [4] "3A SpeedChemical Inc"                                                                                                                                  
#>   [5] "3B Scientific (Wuhan) Corp"                                                                                                                            
#>   [6] "3WAY PHARM INC"                                                                                                                                        
#>   [7] "4C Pharma Scientific Inc"                                                                                                                              
#>   [8] "A&J Pharmtech CO., LTD."                                                                                                                               
#>   [9] "A1 BioChem Labs"                                                                                                                                       
#>  [10] "A2B Chem"                                                                                                                                              
#>  [11] "A2Z Chemical"                                                                                                                                          
#>  [12] "AA BLOCKS"                                                                                                                                             
#>  [13] "AAA Chemistry"                                                                                                                                         
#>  [14] "Aaron Chemicals LLC"                                                                                                                                   
#>  [15] "AAT Bioquest"                                                                                                                                          
#>  [16] "AbaChemScene"                                                                                                                                          
#>  [17] "Abacipharm Corp"                                                                                                                                       
#>  [18] "ABBLIS Chemicals"                                                                                                                                      
#>  [19] "Abbott Labs"                                                                                                                                           
#>  [20] "abcr GmbH"                                                                                                                                             
#>  [21] "Abe Lab, University of Texas MD Anderson Cancer Center"                                                                                                
#>  [22] "ABI Chem"                                                                                                                                              
#>  [23] "AbMole Bioscience"                                                                                                                                     
#>  [24] "AbovChem LLC"                                                                                                                                          
#>  [25] "Abu Montakim Tareq, International Islamic University Chittagong"                                                                                       
#>  [26] "Acadechem"                                                                                                                                             
#>  [27] "Accela ChemBio Inc."                                                                                                                                   
#>  [28] "Ace Therapeutics"                                                                                                                                      
#>  [29] "Acemol"                                                                                                                                                
#>  [30] "Aceschem Inc"                                                                                                                                          
#>  [31] "Acesobio"                                                                                                                                              
#>  [32] "Achem-Block"                                                                                                                                           
#>  [33] "Achemica"                                                                                                                                              
#>  [34] "Achemo Scientific Limited"                                                                                                                             
#>  [35] "Achemtek"                                                                                                                                              
#>  [36] "Acmec Biochemical"                                                                                                                                     
#>  [37] "ACO Pharm Screening Compound"                                                                                                                          
#>  [38] "Acorn PharmaTech Product List"                                                                                                                         
#>  [39] "ACT Chemical"                                                                                                                                          
#>  [40] "Activate Scientific"                                                                                                                                   
#>  [41] "Active Biopharma"                                                                                                                                      
#>  [42] "Adooq BioScience"                                                                                                                                      
#>  [43] "Advanced Technology & Industrial Co., Ltd."                                                                                                            
#>  [44] "AEchem Scientific Corp., USA"                                                                                                                          
#>  [45] "Agios Pharmaceuticals"                                                                                                                                 
#>  [46] "AHH Chemical co.,ltd"                                                                                                                                  
#>  [47] "AIBioTech, LLC"                                                                                                                                        
#>  [48] "AK Scientific, Inc. (AKSCI)"                                                                                                                           
#>  [49] "AKos Consulting & Solutions"                                                                                                                           
#>  [50] "Aladdin"                                                                                                                                               
#>  [51] "Alagar Yadav, Karpagam University"                                                                                                                     
#>  [52] "Alcatraz Chemicals"                                                                                                                                    
#>  [53] "AlchemyPharm"                                                                                                                                          
#>  [54] "Alfa Chemistry"                                                                                                                                        
#>  [55] "AlfaChemInvent LLC"                                                                                                                                    
#>  [56] "Alichem"                                                                                                                                               
#>  [57] "Alinda Chemical Trade Company Ltd"                                                                                                                     
#>  [58] "ALKEMIX"                                                                                                                                               
#>  [59] "Allbio Pharm Co., Ltd"                                                                                                                                 
#>  [60] "Alomone Labs"                                                                                                                                          
#>  [61] "Alsachim"                                                                                                                                              
#>  [62] "Amadis Chemical"                                                                                                                                       
#>  [63] "Amatye"                                                                                                                                                
#>  [64] "Ambeed"                                                                                                                                                
#>  [65] "Ambinter"                                                                                                                                              
#>  [66] "Ambit Biosciences"                                                                                                                                     
#>  [67] "Amfluoro"                                                                                                                                              
#>  [68] "AmicBase - Antimicrobial Activities"                                                                                                                   
#>  [69] "Ampyridine Co.,Ltd"                                                                                                                                    
#>  [70] "AN PharmaTech"                                                                                                                                         
#>  [71] "Analytical Resources Core (ARC), Colorado State University (CSU)"                                                                                      
#>  [72] "Angayarkanni Lab, Department of Microbial Biotechnology, Bharathiar University"                                                                        
#>  [73] "Angel Pharmatech Ltd."                                                                                                                                 
#>  [74] "Angene Chemical"                                                                                                                                       
#>  [75] "Annker Organics"                                                                                                                                       
#>  [76] "Ansion Pharma"                                                                                                                                         
#>  [77] "Anten Chemical"                                                                                                                                        
#>  [78] "Anward"                                                                                                                                                
#>  [79] "AOBChem USA"                                                                                                                                           
#>  [80] "AOBIOUS INC"                                                                                                                                           
#>  [81] "Apeiron Synthesis"                                                                                                                                     
#>  [82] "ApexBio Technology"                                                                                                                                    
#>  [83] "Apexmol"                                                                                                                                               
#>  [84] "Apollo Scientific"                                                                                                                                     
#>  [85] "April Scientific Inc."                                                                                                                                 
#>  [86] "Aribo Reagent"                                                                                                                                         
#>  [87] "Ark Pharm, Inc."                                                                                                                                       
#>  [88] "Ark Pharma Scientific Limited"                                                                                                                         
#>  [89] "Aromalake Chemical"                                                                                                                                    
#>  [90] "Aromsyn catalogue"                                                                                                                                     
#>  [91] "Aronis"                                                                                                                                                
#>  [92] "Arromax Pharmatech Co., Ltd"                                                                                                                           
#>  [93] "ASAS Labor GmbH"                                                                                                                                       
#>  [94] "ASCA GmbH - Angewandte Synthesechemie Adlershof"                                                                                                       
#>  [95] "ASINEX"                                                                                                                                                
#>  [96] "Assembly Blocks Pvt. Ltd."                                                                                                                             
#>  [97] "AstaTech, Inc."                                                                                                                                        
#>  [98] "ATPase-Kinase Pharmacophores (AKP)"                                                                                                                    
#>  [99] "Aurora Fine Chemicals LLC"                                                                                                                             
#> [100] "Aurum Pharmatech LLC"                                                                                                                                  
#> [101] "AVA Biochem Switzerland"                                                                                                                               
#> [102] "AvaChem Scientific"                                                                                                                                    
#> [103] "Avanti Polar Lipids"                                                                                                                                   
#> [104] "Avantor Inc"                                                                                                                                           
#> [105] "AX Molecules Inc"                                                                                                                                      
#> [106] "Axispharm"                                                                                                                                             
#> [107] "Axon Medchem"                                                                                                                                          
#> [108] "AZEPINE"                                                                                                                                               
#> [109] "B&C Chemical"                                                                                                                                          
#> [110] "Baker Lab, Chemistry Department, The University of North Carolina at Chapel Hill"                                                                      
#> [111] "Bangyong Technology  Co., Ltd."                                                                                                                        
#> [112] "Bar-Sagi Lab, NYU School of Medicine"                                                                                                                  
#> [113] "Barrie Walker, BARK Information Services"                                                                                                              
#> [114] "Baynoe Chem"                                                                                                                                           
#> [115] "Be-Medicine"                                                                                                                                           
#> [116] "Beijing Advanced Technology Co, Ltd"                                                                                                                   
#> [117] "Belisle Laboratory, Department of Microbiology, Immunology and Pathology, Colorado State University"                                                   
#> [118] "Beltsville Human Nutrition Research Center, ARS, USDA"                                                                                                 
#> [119] "BenchChem"                                                                                                                                             
#> [120] "BePharm Ltd."                                                                                                                                          
#> [121] "BerrChemical"                                                                                                                                          
#> [122] "Bertin Pharma"                                                                                                                                         
#> [123] "Bestdo Inc"                                                                                                                                            
#> [124] "Bhaskar Lab, Department of Zoology, Sri Venkateswara University, Tirupati, Andhra Pradesh, INDIA"                                                      
#> [125] "Bic Biotech"                                                                                                                                           
#> [126] "BIDD"                                                                                                                                                  
#> [127] "BIND"                                                                                                                                                  
#> [128] "BindingDB"                                                                                                                                             
#> [129] "BioAustralis Fine Chemicals"                                                                                                                           
#> [130] "BioChemPartner"                                                                                                                                        
#> [131] "Biocore"                                                                                                                                               
#> [132] "BioCrick"                                                                                                                                              
#> [133] "BioCyc"                                                                                                                                                
#> [134] "Biological Magnetic Resonance Data Bank (BMRB)"                                                                                                        
#> [135] "Biomatrik Inc. (Monodispersed PEG Manufacturer)"                                                                                                       
#> [136] "Biopharma PEG Scientific Inc"                                                                                                                          
#> [137] "Bioprocess Technology Lab, Department of Microbiology, Bharathidasan University"                                                                       
#> [138] "Biopurify Phytochemicals"                                                                                                                              
#> [139] "Biorbyt"                                                                                                                                               
#> [140] "Biosynce Pharmatech"                                                                                                                                   
#> [141] "Biosynth"                                                                                                                                              
#> [142] "BLD Pharm"                                                                                                                                             
#> [143] "BOC Sciences"                                                                                                                                          
#> [144] "Boehringer Ingelheim - opnMe.com"                                                                                                                      
#> [145] "Boerchem"                                                                                                                                              
#> [146] "Bonglee Kim Lab, Department of Cancer Preventive Material Development, Kyung Hee University"                                                           
#> [147] "Boone Lab, Chemical Genomics, University of Toronto"                                                                                                   
#> [148] "Boroncore"                                                                                                                                             
#> [149] "Boronpharm"                                                                                                                                            
#> [150] "Bradner/Qi Labs at DFCI"                                                                                                                               
#> [151] "Brenntag Connect"                                                                                                                                      
#> [152] "Bright Pigments, Inc"                                                                                                                                  
#> [153] "Broad Institute"                                                                                                                                       
#> [154] "BroadPharm"                                                                                                                                            
#> [155] "Bu Lab, School of Pharmaceutical Sciences, Sun Yat-Sen University"                                                                                     
#> [156] "Buhrlage Lab, Dana-Farber Cancer Institute and Novartis Institutes for BioMedical Research (Cambridge, Mass)"                                          
#> [157] "Burek Lab, Department of Anaesthesiology, Intensive Care, Emergency and Pain Med, University Hospital Wuerzburg"                                       
#> [158] "Burnham Center for Chemical Genomics"                                                                                                                  
#> [159] "C. David Weaver Laboratory, Vanderbilt University"                                                                                                     
#> [160] "Calbiochem"                                                                                                                                            
#> [161] "California Peptide Research, Inc."                                                                                                                     
#> [162] "Cancer Functional Genomics, Wellcome Trust Sanger Institute"                                                                                           
#> [163] "Cancer Research UK Cambridge Research Institute"                                                                                                       
#> [164] "Cangzhou Enke Pharma Tech Co.,Ltd."                                                                                                                    
#> [165] "CAPOT"                                                                                                                                                 
#> [166] "Carbott PharmTech Inc."                                                                                                                                
#> [167] "Carcinogenic Potency Database (CPDB)"                                                                                                                  
#> [168] "Career Henan Chemical Co"                                                                                                                              
#> [169] "Cayman Chemical"                                                                                                                                       
#> [170] "CC_PMLSC"                                                                                                                                              
#> [171] "CCSbase"                                                                                                                                               
#> [172] "CD Biosynsis"                                                                                                                                          
#> [173] "CD Formulation"                                                                                                                                        
#> [174] "CEGChem"                                                                                                                                               
#> [175] "Center for Chemical Genomics, University of Michigan"                                                                                                  
#> [176] "Center for Natural Product Technologies at UIC (CENAPT)"                                                                                               
#> [177] "CF Plus Chemicals"                                                                                                                                     
#> [178] "ChangChem"                                                                                                                                             
#> [179] "Changzhou Highassay Chemical Co., Ltd"                                                                                                                 
#> [180] "Changzhou Naide Chemical"                                                                                                                              
#> [181] "ChEBI"                                                                                                                                                 
#> [182] "Chem-Impex International, Inc."                                                                                                                        
#> [183] "Chem-Space.com Database"                                                                                                                               
#> [184] "Chemaphor Chemical Services"                                                                                                                           
#> [185] "ChemBank"                                                                                                                                              
#> [186] "Chembase.cn"                                                                                                                                           
#> [187] "ChemBioBank"                                                                                                                                           
#> [188] "ChEMBL"                                                                                                                                                
#> [189] "ChemBlock"                                                                                                                                             
#> [190] "ChemBridge"                                                                                                                                            
#> [191] "Chemchart"                                                                                                                                             
#> [192] "ChemDB"                                                                                                                                                
#> [193] "ChemDiv"                                                                                                                                               
#> [194] "Chemenu Inc."                                                                                                                                          
#> [195] "ChemExper Chemical Directory"                                                                                                                          
#> [196] "ChemFaces"                                                                                                                                             
#> [197] "ChemFish Tokyo Co., Ltd."                                                                                                                              
#> [198] "Chemhere"                                                                                                                                              
#> [199] "Chemical Biology Department, Max Planck Institute of Molecular Physiology"                                                                             
#> [200] "Chemical Carcinogenesis Research Information System (CCRIS)"                                                                                           
#> [201] "chemical genetic matrix"                                                                                                                               
#> [202] "Chemical Probes Portal"                                                                                                                                
#> [203] "Chemical Synthesis Database"                                                                                                                           
#> [204] "ChemIDplus"                                                                                                                                            
#> [205] "Chemieliva Pharmaceutical Co., Ltd"                                                                                                                    
#> [206] "ChemieTek"                                                                                                                                             
#> [207] "Cheminformatics Friedrich-Schiller-University Jena"                                                                                                    
#> [208] "ChemLabIndex"                                                                                                                                          
#> [209] "ChemMol"                                                                                                                                               
#> [210] "Chemodex Ltd."                                                                                                                                         
#> [211] "Chemoproteomic Metabolic Pathway Resource, Scripps University"                                                                                         
#> [212] "Chemotion"                                                                                                                                             
#> [213] "ChemProbes"                                                                                                                                            
#> [214] "ChemShuttle"                                                                                                                                           
#> [215] "Chemsoon"                                                                                                                                              
#> [216] "ChemSpider"                                                                                                                                            
#> [217] "ChemTik"                                                                                                                                               
#> [218] "ChemWise"                                                                                                                                              
#> [219] "Chen Lab, School of Medicine, Emory University"                                                                                                        
#> [220] "CHESS fine organics"                                                                                                                                   
#> [221] "China MainChem Co., Ltd"                                                                                                                               
#> [222] "Chiralblock Biosciences"                                                                                                                               
#> [223] "CHIRALEN"                                                                                                                                              
#> [224] "Chirial Bio-material Co., Ltd."                                                                                                                        
#> [225] "Chiron AS"                                                                                                                                             
#> [226] "Chris Southan"                                                                                                                                         
#> [227] "Chung Lab, Department of Pediatrics, Emory University"                                                                                                 
#> [228] "Circadian Research, Kay Laboratory, University of California at San Diego (UCSD)"                                                                      
#> [229] "Ciulli Lab, Division of Biological Chemistry and Drug Discovery, University of Dundee"                                                                 
#> [230] "Clearsynth"                                                                                                                                            
#> [231] "Clinivex"                                                                                                                                              
#> [232] "CLRI (CSIR)"                                                                                                                                           
#> [233] "CMLD-BU"                                                                                                                                               
#> [234] "Collaborative Drug Discovery, Inc."                                                                                                                    
#> [235] "Columbia University Molecular Screening Center"                                                                                                        
#> [236] "Combi-Blocks"                                                                                                                                          
#> [237] "Comparative Toxicogenomics Database (CTD)"                                                                                                             
#> [238] "Compass Remediation Chemicals"                                                                                                                         
#> [239] "Cooke Chemical Co., Ltd"                                                                                                                               
#> [240] "CoreSyn"                                                                                                                                               
#> [241] "Corson Lab, School of Medicine, Indiana University"                                                                                                    
#> [242] "Cosutin Industrial"                                                                                                                                    
#> [243] "Creasyn Finechem"                                                                                                                                      
#> [244] "Creative Biogene"                                                                                                                                      
#> [245] "Creative Biolabs"                                                                                                                                      
#> [246] "Creative Enzymes"                                                                                                                                      
#> [247] "Creative Proteomics"                                                                                                                                   
#> [248] "CreativePeptides"                                                                                                                                      
#> [249] "Crooks Lab, College of Pharmacy, University of Arkansas for Medical Sciences"                                                                          
#> [250] "Crystallography Open Database (COD)"                                                                                                                   
#> [251] "CSNpharm"                                                                                                                                              
#> [252] "Cure First"                                                                                                                                            
#> [253] "cyandye llc"                                                                                                                                           
#> [254] "Cyclic PharmaTech"                                                                                                                                     
#> [255] "CYH Pharma"                                                                                                                                            
#> [256] "CymitQuimica"                                                                                                                                          
#> [257] "Dao Fu Chemical"                                                                                                                                       
#> [258] "DAOGE BIOPHARMA"                                                                                                                                       
#> [259] "Davey Lab, Department of Microbiology, NEIDL, Boston University"                                                                                       
#> [260] "Day Biochem"                                                                                                                                           
#> [261] "DC Chemicals"                                                                                                                                          
#> [262] "Debye Scientific Co., Ltd"                                                                                                                             
#> [263] "Denison Lab, Department of Environmental Toxicology, UC Davis"                                                                                         
#> [264] "Department of drug chemistry, Lithuanian University of Health Sciences"                                                                                
#> [265] "Department of Molecular Cell Biology, Weizmann Institute of Science"                                                                                   
#> [266] "Department of Pharmacy, LMU"                                                                                                                           
#> [267] "Derbyshire Lab, Chemistry Department, Duke University"                                                                                                 
#> [268] "DerMardirossian Lab, San Diego Biomedical Research Institute"                                                                                          
#> [269] "Dharmacon, a Horizon Discovery Group company"                                                                                                          
#> [270] "Diabetic Complications Screening"                                                                                                                      
#> [271] "DiRusso Lab, Biochemistry Department, University of Nebraska"                                                                                          
#> [272] "DiscoveryGate"                                                                                                                                         
#> [273] "Domainex"                                                                                                                                              
#> [