Text Analysis with the ctrailsgov Package

Taylor Arnold and Michael J Kane

Setup

To start, we will load the package and the sample dataset. The same code below can be used with the entire dataset, but may be a bit slower.

library(ctrialsgov)
library(dplyr)
ctgov_load_sample()

Keywords in Context

The function ctgov_kwic highlights all of the occurances of a term within its context (the few words before and after the term occurs). For example, if we want to show the occurances of the term “bladder” in the titles of the interventional trials we can do this:

z <- ctgov_query(study_type = "Interventional")
ctgov_kwic("bladder", z$brief_title)
## ible Local Advanved |Bladder| Cancer
##  of Sulforaphane in |Bladder| Cancer Chemoprevent
##       Comparison of |Bladder| Filling vs. Non-Fil
## tment of Overactive |Bladder|/Urge Incontinence
## in the Detection of |Bladder| Cancer in the Surve
## A-F Betafood on Gall|bladder| and Liver Function.
## Non-Muscle Invasive |Bladder| Cancer
## iopathic Overactive |Bladder| With Urinary Incont
## tion for Overactive |Bladder|
## anced or Metastatic |Bladder| Cancer
##      Autologous Neo-|Bladder| Construct in Non-Ne
## urogenic Overactive |Bladder| and Urge Predominan
##  AMG 706 on the Gall|bladder| in Advanced Solid T
## Men With Overactive |Bladder|.

The function also has an option to include a title along with each occurance that is printed alongside each row. Here we will print the NCT id for each trial:

z <- ctgov_query(study_type = "Interventional")
ctgov_kwic("bladder", z$brief_title, z$nct_id)
## [NCT04553939] ible Local Advanved |Bladder| Cancer
## [NCT03517995]  of Sulforaphane in |Bladder| Cancer Chemoprevent
## [NCT04210479]       Comparison of |Bladder| Filling vs. Non-Fil
## [NCT03535857] tment of Overactive |Bladder|/Urge Incontinence
## [NCT02560584] in the Detection of |Bladder| Cancer in the Surve
## [NCT01981343] A-F Betafood on Gall|bladder| and Liver Function.
## [NCT01625260] Non-Muscle Invasive |Bladder| Cancer
## [NCT00910845] iopathic Overactive |Bladder| With Urinary Incont
## [NCT00912314] tion for Overactive |Bladder|
## [NCT00635726] anced or Metastatic |Bladder| Cancer
## [NCT00594139]      Autologous Neo-|Bladder| Construct in Non-Ne
## [NCT00594139] urogenic Overactive |Bladder| and Urge Predominan
## [NCT00448786]  AMG 706 on the Gall|bladder| in Advanced Solid T
## [NCT00282932] Men With Overactive |Bladder|.

There are some other options that can be used to change the way that the output is displayed. The default (shown above) prints the results out using the cat function. Other options return the results as a character vector of data frame, which are useful for further post-processing. There is also a flag use_color that prints the term in color rather than with pipes; it looks great in a terminal or RStudio but does not display correctly when knit to HTML.

TF-IDF

We can use a technique called term frequence-inverse document frequency (TF-IDF) to determine the most important words in a collection of of text fields. To implement this in R we will use the ctgov_tfidf function:

z <- ctrialsgov::ctgov_query()
tfidf <- ctgov_tfidf(z$description)
print(tfidf, n = 30)
## # A tibble: 3,074 × 2
##      doc terms                                               
##    <int> <chr>                                               
##  1     0 aortix|heightened|aki|providing|abdominal           
##  2     1 pollution|ms|air|viral|france                       
##  3     2 fmt|diversity|microbiota|weight|gut                 
##  4     3 nerves|landmarks|bony|guidance|knee                 
##  5     4 antihistamines|h1|inadequately|spontaneous|suffering
##  6     5 bandage|seroma|drain|categorical|variables          
##  7     6 vagal|mediterranean|nerve|diet|depression           
##  8     7 veterans|peer|whole|steps|structured                
##  9     8 suicidal|ideation|telehealth|engagement|counseling  
## 10     9 bct|ce|breast|structures|cancers                    
## 11    10 athletes|pathways|exercise|biomarkers|strenuous     
## 12    11 acetylsalicylic|vessels|artery|affects|acid         
## 13    12 mhealth|90|monitoring|organ|impact                  
## 14    13 cascade|sugar|glucose|sensor|doctors                
## 15    14 variant|b1351|cov2|sars|b16172                      
## 16    15 scenario|uncertainties|oncological|relating|real    
## 17    16 itch|epigenetic|mechanisms|chronic|antagonists      
## 18    17 9vhpv|1526|hiv|living|uninfected                    
## 19    18 dengue|fever|permeability|five|vascular             
## 20    19 influenza|icu|aspergillosis|eortc|pathogen          
## 21    20 cannabigerol|cbg|thc|appetite|stimulating           
## 22    21 antibiotics|decide|how|parent|prescribed            
## 23    22 counseling|education|behavior|behavioral|his        
## 24    23 intrauterine|adhesiolysis|leaf|film|named           
## 25    24 purifiers|cardiopulmonary|indicators|students|air   
## 26    25 donepezil|french|alzheimers|efficiency|authorities  
## 27    26 avelumab|checkpoint|breast|immune|aspirin           
## 28    27 dbs|ps|expectations|pd|preoperative                 
## 29    28 brentuximab|vedotin|classic|nivolumab|checkpoint    
## 30    29 wl|calorie|aas|ba|crc                               
## # … with 3,044 more rows

The default takes the lower case version of the terms, but (particularly with acronyms) it may be better to preserve the capitalization of the terms. Here is how we can do that in this example:

tfidf <- ctgov_tfidf(z$description, tolower = FALSE)
print(tfidf, n = 30)
## # A tibble: 3,074 × 2
##      doc terms                                              
##    <int> <chr>                                              
##  1     0 heightened|AKI|providing|abdominal|System          
##  2     1 pollution|MS|air|viral|France                      
##  3     2 FMT|diversity|microbiota|weight|gut                
##  4     3 nerves|landmarks|bony|guidance|knee                
##  5     4 H1|inadequately|spontaneous|suffering|comparison   
##  6     5 seroma|drain|categorical|variables|regression      
##  7     6 Mediterranean|vagal|nerve|diet|depression          
##  8     7 Whole|Veterans|package|Health|mental               
##  9     8 suicidal|ideation|telehealth|engagement|counseling 
## 10     9 BCT|CE|breast|structures|cancers                   
## 11    10 athletes|pathways|exercise|biomarkers|strenuous    
## 12    11 Acetylsalicylic|Acid|vessels|artery|affects        
## 13    12 mHealth|90|monitoring|organ|impact                 
## 14    13 sugar|glucose|sensor|doctors|venous                
## 15    14 variant|CoV2|SARS|Beta|vaccine                     
## 16    15 scenario|uncertainties|oncological|relating|real   
## 17    16 itch|epigenetic|mechanisms|chronic|Similarly       
## 18    17 9vHPV|1526|HIV|living|uninfected                   
## 19    18 dengue|fever|permeability|five|vascular            
## 20    19 influenza|ICU|EORTC|aspergillosis|pathogen         
## 21    20 THC|appetite|stimulating|subjective|analgesic      
## 22    21 antibiotics|decide|how|prescribed|pneumonia        
## 23    22 counseling|education|behavioral|his|behavior       
## 24    23 intrauterine|film|named|adhesion|barrier           
## 25    24 purifiers|cardiopulmonary|air|students|indicators  
## 26    25 French|Alzheimers|efficiency|controversy|reimbursed
## 27    26 checkpoint|Immune|breast|immune|aspirin            
## 28    27 DBS|PS|expectations|PD|preoperative                
## 29    28 vedotin|brentuximab|nivolumab|classic|checkpoint   
## 30    29 WL|calorie|AAs|BA|CRC                              
## # … with 3,044 more rows

We can also refine the results by including fewer rare terms. The argument min_df specifies the minimal proportion of documents that must contain a term for it to be returned as a keyword; the upper bound can also be specified with the argument max_df.

tfidf <- ctgov_tfidf(z$description, min_df = 0.02, max_df = 0.2)
print(tfidf, n = 30)
## # A tibble: 3,072 × 2
##      doc terms                                                
##    <int> <chr>                                                
##  1     0 injury|support|cardiovascular|performance|feasibility
##  2     1 impact|care|health|risk|better                       
##  3     2 weight|loss|body|10|least                            
##  4     3 but|these|compare|two|which                          
##  5     4 adult|tolerability|chronic|placebo|participants      
##  6     5 outcome|multiple|analysis|evaluated|performed        
##  7     6 depression|function|assess|efficacy                  
##  8     7 support|health|level|care|primary                    
##  9     8 improved|support|high|intervention|care              
## 10     9 breast|out|being|performed|if                        
## 11    10 exercise|events|associated|compared|inflammation     
## 12    11 condition|heart|information|blood|when               
## 13    12 impact|days|objective|outcome|secondary              
## 14    13 levels|diabetes|blood|level|device                   
## 15    14 vaccine|novel|include|label|open                     
## 16    15 about|new|studies|free|benefit                       
## 17    16 chronic|examine|testing|disorder|while               
## 18    17 vaccine|women|among|those|dose                       
## 19    18 death|treat|syndrome|pilot|evaluation                
## 20    19 pulmonary|incidence|observational|multi|identify     
## 21    20 alone|combination|assess|effects                     
## 22    21 how|often|children|if|not                            
## 23    22 diseases|chronic|follow|it|changes                   
## 24    23 novel|aims|controlled|randomized|efficacy            
## 25    24 explore|aims|changes|function|health                 
## 26    25 non|disease|cognitive|approach|currently             
## 27    26 breast|immune|approximately|called|cancer            
## 28    27 postoperative|improvement|result|brain|specific      
## 29    28 treated|cells|may|cancer|ability                     
## 30    29 prevention|interventions|weight|its|no               
## # … with 3,042 more rows

Any number of text fields can be passed to the ctgov_tokens function; all of the fields for a specific trial are pasted together and treated a single block of text.

Document Similarity

Finally, the package also provides a function for producing similarity scores based on the text fields of the studies. Here, we will produce a similarity matrix based on the description field of Interventional, Industry-sponsored, Phase 2 trials.

z <- ctgov_query(
  study_type = "Interventional", sponsor_type = "Industry", phase = "Phase 2"
)
scores <- ctgov_text_similarity(z$description, min_df = 0, max_df = 0.1)
dim(scores)
## [1] 147 147

The returned value is a square matrix with one row and one colum for each clinical trial in the set. We can use these scores to find studies that are particularly close to one another in the words used within their descriptions. Here for example we can see five studies that use similar terms in their descriptions:

index <- order(scores[,100], decreasing = TRUE)[1:5]
z$brief_title[index]
## [1] "AL-38583 Ophthalmic Solution for Allergic Conjunctivitis Associated Inflammation" 
## [2] "Safety and Efficacy of BRM421 for Dry Eye Syndrome"                               
## [3] "Phosphorylcholine PC-mAb Effects in Subjects With Elevated Lipoprotein a"         
## [4] "Safety, Clinical Tolerability and Immunogenicity of Increasing Doses of gpASIT+TM"
## [5] "Phase 2 Clinical Trial of CartiLife in the United States"

Further post-processing can be done with the similarity scores, such as spectral clustering and dimensionality reduction.