Medicines Side-effects and their Substitutes

This dataset contains comprehensive information on over 248,000 medical drugs from all manufacturers available worldwide. The data includes details such as drug names, active ingredients, therapeutic uses, dosage, side effects, and substitutes. The dataset aims to provide a useful resource for medical researchers, healthcare professionals, and drug manufacturers.

1 Importing Libraries

For data manipulation and tidying up data tidyverse package in R has always been best. tidyverse is a collection of packages of R such as

dplyr and tidyr for manipulating data
ggplot2 for visualizing and rendering plots
lubridate for dealing with dates and time series
forcats for factoring of data
readr for importing, reading, writing different file formats

Lets import the libraries

Code

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

2 Importing data

Code

med_data <- read_csv("F:/r_language/quarto/blog/Data/250k Medicines Usage, Side Effects and Substitutes.csv",
  guess_max = 30000) #guess_max function is used to correctly guess the type of the columns

We imported the .csv file and we can see there are a total of 248218 rows and 58 columns in which 1 column is integer(dbl) 57 columns are classifies as character(chr) . The function guess_max makes sure that column type is identified correctly in the data.

2.1 Glancing data

Now lets take a sneak peek into the data.

Code

med_data %>% glimpse()

Rows: 248,218
Columns: 58
$ id                  <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,…
$ name                <chr> "augmentin 625 duo tablet", "azithral 500 tablet",…
$ substitute0         <chr> "Penciclav 500 mg/125 mg Tablet", "Zithrocare 500m…
$ substitute1         <chr> "Moxikind-CV 625 Tablet", "Azax 500 Tablet", "Ambr…
$ substitute2         <chr> "Moxiforce-CV 625 Tablet", "Zady 500 Tablet", "Zer…
$ substitute3         <chr> "Fightox 625 Tablet", "Cazithro 500mg Tablet", "Ca…
$ substitute4         <chr> "Novamox CV 625mg Tablet", "Trulimax 500mg Tablet"…
$ sideEffect0         <chr> "Vomiting", "Vomiting", "Nausea", "Headache", "Sle…
$ sideEffect1         <chr> "Nausea", "Nausea", "Vomiting", "Drowsiness", "Dry…
$ sideEffect2         <chr> "Diarrhea", "Abdominal pain", "Diarrhea", "Dizzine…
$ sideEffect3         <chr> NA, "Diarrhea", "Upset stomach", "Nausea", NA, "Sk…
$ sideEffect4         <chr> NA, NA, "Stomach pain", NA, NA, "Flu-like symptoms…
$ sideEffect5         <chr> NA, NA, "Allergic reaction", NA, NA, "Headache", N…
$ sideEffect6         <chr> NA, NA, "Dizziness", NA, NA, "Drowsiness", NA, NA,…
$ sideEffect7         <chr> NA, NA, "Headache", NA, NA, "Dizziness", NA, NA, N…
$ sideEffect8         <chr> NA, NA, "Rash", NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ sideEffect9         <chr> NA, NA, "Hives", NA, NA, NA, NA, NA, NA, NA, NA, N…
$ sideEffect10        <chr> NA, NA, "Tremors", NA, NA, NA, NA, NA, NA, NA, NA,…
$ sideEffect11        <chr> NA, NA, "Palpitations", NA, NA, NA, NA, NA, NA, NA…
$ sideEffect12        <chr> NA, NA, "Muscle cramp", NA, NA, NA, NA, NA, NA, NA…
$ sideEffect13        <chr> NA, NA, "Increased heart rate", NA, NA, NA, NA, NA…
$ sideEffect14        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ sideEffect15        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ sideEffect16        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ sideEffect17        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ sideEffect18        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ sideEffect19        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ sideEffect20        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ sideEffect21        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ sideEffect22        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ sideEffect23        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ sideEffect24        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ sideEffect25        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ sideEffect26        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ sideEffect27        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ sideEffect28        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ sideEffect29        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ sideEffect30        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ sideEffect31        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ sideEffect32        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ sideEffect33        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ sideEffect34        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ sideEffect35        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ sideEffect36        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ sideEffect37        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ sideEffect38        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ sideEffect39        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ sideEffect40        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ sideEffect41        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ use0                <chr> "Treatment of Bacterial infections", "Treatment of…
$ use1                <chr> NA, NA, NA, "Treatment of Allergic conditions", NA…
$ use2                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ use3                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ use4                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `Chemical Class`    <chr> NA, "Macrolides", NA, "Diphenylmethane Derivative"…
$ `Habit Forming`     <chr> "No", "No", "No", "No", "No", "No", "No", "No", "N…
$ `Therapeutic Class` <chr> "ANTI INFECTIVES", "ANTI INFECTIVES", "RESPIRATORY…
$ `Action Class`      <chr> NA, "Macrolides", NA, "H1 Antihistaminics (second …

Code

# finding rows and columns of data
med_data %>% dim()

[1] 248218     58

A total of {r} nrow(med_data) rows are present with {r} ncol(med_data) columns in which the

id is a number that can act as a primary key
name is the name of the drugs. - substitute0 to substitue4 are alternate drugs to the drug in 2nd column that has the same use. - sideEffect0 to sideEffect41 are side-effects caused by the drug - use0 to use4 are what drug can be used to cure
Chemical Class is the chemical group of the medicine
Habit Forming is if a drug is addictive or not
Therapeutic Class is about how a drug is intended to work
Action Class is categorization a drug works in the system

3 Cleaning the data

We can’t be sure that all the data in the columns is without any excess spaces and wrongly indented commas or brackets.

Code

# converting all the columns of character to lower case letters
med_data <- med_data %>%
  rename_with(~gsub(" ", "_", tolower(.x))) %>% 
  mutate(across(where(is_character), ~tolower(.))) %>% 
  mutate(across(where(is_character), ~trimws(.)))

# replacing all the '{' with '(' and '}' with ')'

med_data <- med_data %>% 
  mutate(chemical_class = str_replace_all(chemical_class, "\\{", "\\("),
         chemical_class = str_replace_all(chemical_class, "\\}", "\\)"))

3.1 Finding `NA`s and Dulicates

Lets look at NAs in the data and the duplicates

Code

# finding NA's in each columns
med_data %>% map(~sum(is.na(.))) %>% unlist()

               id              name       substitute0       substitute1 
                0                 0              9597             14351 
      substitute2       substitute3       substitute4       sideeffect0 
            17985             21362             24256                 0 
      sideeffect1       sideeffect2       sideeffect3       sideeffect4 
             9802             18718             40580             84658 
      sideeffect5       sideeffect6       sideeffect7       sideeffect8 
           116960            156361            180468            199712 
      sideeffect9      sideeffect10      sideeffect11      sideeffect12 
           210510            220944            227887            231936 
     sideeffect13      sideeffect14      sideeffect15      sideeffect16 
           233491            237799            240537            242209 
     sideeffect17      sideeffect18      sideeffect19      sideeffect20 
           242836            243703            244272            244995 
     sideeffect21      sideeffect22      sideeffect23      sideeffect24 
           245093            245170            245313            245495 
     sideeffect25      sideeffect26      sideeffect27      sideeffect28 
           246715            246715            246724            246724 
     sideeffect29      sideeffect30      sideeffect31      sideeffect32 
           246780            246889            246889            246890 
     sideeffect33      sideeffect34      sideeffect35      sideeffect36 
           247049            247052            248216            248216 
     sideeffect37      sideeffect38      sideeffect39      sideeffect40 
           248216            248216            248216            248216 
     sideeffect41              use0              use1              use2 
           248216                 0            174853            219911 
             use3              use4    chemical_class     habit_forming 
           240839            243247            110427                 0 
therapeutic_class      action_class 
               69            110182

Code

# finding duplicates
duplicated(med_data) %>% sum()

[1] 0

There are no duplicated values but there are so many NAs which is not helpful. Only 5 columns i.e., id, name, sideEffect0, use0, Habit Forming does not have any empty values in the column.

3.2 Finding unique values

Even though there no “NA”s in id and name of the drug lets make sure there are no duplicates

Code

# counting unique values in each column
med_data %>% map(n_distinct) %>% unlist()

               id              name       substitute0       substitute1 
           248218            222825             19374             16309 
      substitute2       substitute3       substitute4       sideeffect0 
            14289             12774             11689               326 
      sideeffect1       sideeffect2       sideeffect3       sideeffect4 
              335               352               363               359 
      sideeffect5       sideeffect6       sideeffect7       sideeffect8 
              325               299               275               254 
      sideeffect9      sideeffect10      sideeffect11      sideeffect12 
              232               212               182               174 
     sideeffect13      sideeffect14      sideeffect15      sideeffect16 
              145               121                95                78 
     sideeffect17      sideeffect18      sideeffect19      sideeffect20 
               66                52                42                36 
     sideeffect21      sideeffect22      sideeffect23      sideeffect24 
               30                25                18                14 
     sideeffect25      sideeffect26      sideeffect27      sideeffect28 
               11                11                10                10 
     sideeffect29      sideeffect30      sideeffect31      sideeffect32 
                9                 6                 6                 5 
     sideeffect33      sideeffect34      sideeffect35      sideeffect36 
                4                 3                 2                 2 
     sideeffect37      sideeffect38      sideeffect39      sideeffect40 
                2                 2                 2                 2 
     sideeffect41              use0              use1              use2 
                2               655               335               139 
             use3              use4    chemical_class     habit_forming 
               74                34               833                 2 
therapeutic_class      action_class 
               23               432

There over 2,48,218 ids but 222825 drug names are present at least 24000 names are repeated. Lets check which are repeated.

Code

# 
duplicated_values <- med_data %>% select(-id) %>% duplicated()

duplicated_values %>% sum()

[1] 24204

Lets remove duplicates from the datafame and create a dataset with unique values.

Code

# using filter function to remove duplicates
med_data_unique <- med_data %>% filter(!duplicated(select(., -id)))

dim(med_data_unique)

[1] 224014     58

4 Data Manipulation

4.1 Pivoting Data

For machines longer format data is much more readable and workable than wider format and we can drop NAs in the columns much more easily, without loosing data but it also comes at a cost while longer format data is easy for machines to read but very difficult for humans to comprehend and the number of rows can increase to very high numbers to a point that it’s not worth it.

We can pivot data and make the wide data format into narrow data format and make it more accessible to manipulate.

Code

#|label: pivoting_substitute_drug

# pivoting data
med_data_sub <- med_data_unique %>% select(id:substitute4, use0:use4) %>% 
  pivot_longer(cols = starts_with("substitute"),
               names_to = "sub_num",
               values_to = "substitute_drug")
# counting NA's
med_data_sub %>% map(~sum(is.na(.))) %>% unlist()

             id            name            use0            use1            use2 
              0               0               0          786770          991525 
           use3            use4         sub_num substitute_drug 
        1086255         1097150               0           80639

Code

glimpse(med_data_sub)

Rows: 1,120,070
Columns: 9
$ id              <dbl> 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, …
$ name            <chr> "augmentin 625 duo tablet", "augmentin 625 duo tablet"…
$ use0            <chr> "treatment of bacterial infections", "treatment of bac…
$ use1            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ use2            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ use3            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ use4            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ sub_num         <chr> "substitute0", "substitute1", "substitute2", "substitu…
$ substitute_drug <chr> "penciclav 500 mg/125 mg tablet", "moxikind-cv 625 tab…

As you can see the data only has 9 columns and 1241090 rows.

Now lets pivot the use of the drugs so that we can make the data more tidy which helps with removing of the duplicates and the NA values easily.

Code

# pivoting data 
medi_use_pivot <- 
  med_data_sub %>% select(-sub_num) %>% 
  pivot_longer(cols = starts_with("use"),
               names_to = "use_num",
               values_to = "use") %>% 
  select(-use_num) %>% filter(!is.na(use))

# checking for NA values
medi_use_pivot %>% map(~sum(is.na(.))) %>% unlist()

             id            name substitute_drug             use 
              0               0          111133               0

Code

# checking for duplicates
medi_use_pivot %>% duplicated() %>% sum()

[1] 88192

Code

# removing duplicated data
med_use <- medi_use_pivot %>% filter(!duplicated(.))

# glimpse of data
glimpse(med_use)

Rows: 1,550,458
Columns: 4
$ id              <dbl> 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, …
$ name            <chr> "augmentin 625 duo tablet", "augmentin 625 duo tablet"…
$ substitute_drug <chr> "penciclav 500 mg/125 mg tablet", "moxikind-cv 625 tab…
$ use             <chr> "treatment of bacterial infections", "treatment of bac…

We can use pivot method for sideEffects to convert the data into a longer format. I am doing this case by case and not all in a single table because that would cause very long tables and a lot of NAs which would be hard to filter and we can join different tables using *_join functions with id column as it can act as a primary key.

Code

# pivoting data with side-effect columns

side_effect_med <- 
  med_data_unique %>% select(id, name, sideeffect0:sideeffect41) %>%
  pivot_longer(cols = starts_with("sideeffect"),
               names_to = "sideeffect_num",
               values_to = "side_effects") %>% 
  select(-sideeffect_num)

# counting NA and duplicates
side_effect_med %>% map(~sum(is.na(.))) %>% unlist

          id         name side_effects 
           0            0      7950288

Code

# dropping NA's and duplicates
side_effect_med <- side_effect_med %>% drop_na() %>% 
  filter(!duplicated(.))

# finding duplicates
duplicated(side_effect_med) %>% sum()

[1] 0

Code

glimpse(side_effect_med)

Rows: 1,458,295
Columns: 3
$ id           <dbl> 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, …
$ name         <chr> "augmentin 625 duo tablet", "augmentin 625 duo tablet", "…
$ side_effects <chr> "vomiting", "nausea", "diarrhea", "vomiting", "nausea", "…

Now lets use pivoted data to plot graphs

5 Visualising with `ggplot2`

ggplot2 is one of the most versatile packages I have come across for the purpose of visualizing using Grammar of Graphics

5.1 Bar plots

Let’s find out and plot to which class most of the drugs in data belong to.

Code

chem_cl_top_10 <-  
  med_data_unique %>% select(name, chemical_class) %>% 
  count(chemical_class) %>% rename("number_of_meds" = n) %>% 
  slice_max(number_of_meds, n=10) %>%
  filter(chemical_class != "NA") %>% 
  mutate(chemical_class = str_to_sentence(chemical_class))

chem_cl_top_10 %>% 
  ggplot(aes(x = fct_reorder(chemical_class, number_of_meds),
             y = number_of_meds)) +
  geom_col(aes(fill = chemical_class)) +
  theme_minimal() +
  theme(legend.position = "none",
        axis.text.y = element_text(size = 10)) +
  labs(x = "Chemical Class", y= "Number of Medicines",
       title = "Most Common Chemical Class"
       ) +
  theme(plot.title = element_text(size = 20)) +
  scale_fill_brewer(palette = "Set1") + coord_flip()

By the graph we know that most of the drugs in data belong to the chemical class {r} chem_cl_top_10[1,1] with {r} chem_cl_top_10[1,2] drugs belong to that class, followed by {r} chem_cl_top_10[2,1] with {r} chem_cl_top_10[2,2] followed by {r} chem_cl_top_10[3,1], {r} chem_cl_top_10[4, 1] and {r} chem_cl_top_10[5, 1].

Now, that we have some basic idea of the data lets answer some questions

6 Finding Answers to Specific Questions

Now, that we have some basic idea of the data lets answer some questions

Lets begin with simple ones

6.1 Addictive drugs

Lets find the most addictive drugs in the data set and to which chemical class they belong to.

Code

habit_forming_classes <- 
  med_data_unique %>% filter(habit_forming == "yes") %>% 
  select(name, chemical_class) %>%
  count(chemical_class, sort = TRUE)

habit_forming_classes

# A tibble: 17 × 2
   chemical_class                                   n
   <chr>                                        <int>
 1 <NA>                                          2532
 2 benzodiazepines derivative                    2025
 3 anisole derivative                             365
 4 imidazopyridine derivative                     194
 5 benzodiazepine derivative                       83
 6 barbituric acid derivative                      55
 7 diphenylmethane derivative                      49
 8 cyclopyrrolone derivative                       28
 9 phenanthrenes derivatives                       26
10 phenanthrenes derivative                        22
11 aralkylamine derivative                         21
12 benzomorphan derivatives                        20
13 phenylpiperidine derivatives                    12
14 pyrazolopyrimidine derivative                   11
15 ultrashort-acting barbituric acid derivative    11
16 amphetamines derivatives                         6
17 phenylheptylamines derivative                    1

We can see in the table that {r} habit_forming_classes[1,2] drugs which are addictive does not have their class mentioned, while {r} habit_forming_classes[2,1] have {r} habit_forming_classes[2,2] drugs which are habit forming.

6.2 No Substitute Drugs

Find the drugs with no substitute drugs, that have less side-effects, is not habit forming, and has many uses

Code

med_data_unique %>%
  # finding drugs with no sbustitute
  filter(if_all(substitute0:substitute4, is.na) &
         # medicine with no one side-effect
         if_all(sideeffect1:sideeffect41, is.na) &
           sideeffect0 == "no common side effects seen" &
         # Medicine with most uses
         if_all(use0:use2, ~!is.na(.)) &
         # not habit forming
         habit_forming == "no" &
         # Chemical Class, therapeutic class,action class is known
         !is.na(chemical_class) &
         !is.na(therapeutic_class) &
         !is.na(action_class)) %>% 
  head()

# A tibble: 1 × 58
     id name         substitute0 substitute1 substitute2 substitute3 substitute4
  <dbl> <chr>        <chr>       <chr>       <chr>       <chr>       <chr>      
1 33940 bentoform d… <NA>        <NA>        <NA>        <NA>        <NA>       
# ℹ 51 more variables: sideeffect0 <chr>, sideeffect1 <chr>, sideeffect2 <chr>,
#   sideeffect3 <chr>, sideeffect4 <chr>, sideeffect5 <chr>, sideeffect6 <chr>,
#   sideeffect7 <chr>, sideeffect8 <chr>, sideeffect9 <chr>,
#   sideeffect10 <chr>, sideeffect11 <chr>, sideeffect12 <chr>,
#   sideeffect13 <chr>, sideeffect14 <chr>, sideeffect15 <chr>,
#   sideeffect16 <chr>, sideeffect17 <chr>, sideeffect18 <chr>,
#   sideeffect19 <chr>, sideeffect20 <chr>, sideeffect21 <chr>, …

Among the {r} nrow(med_data_unique) drugs only Betoform Dental Gel is the drug with No known side-effects, no alternate drugs, is not habit forming and has a known Chemical Class.

6.3 Most Popular Drug form

Lets find the most common type of form i.e, Tablet, Tonic, etc. in the data set.

Code

med_data_unique[2,2]

# A tibble: 1 × 1
  name               
  <chr>              
1 azithral 500 tablet

The medicine name in the end contains its form but it might not be true for all so lets do a string search so that the for is detected correctly in which it is sold or consumed if we extract it into a separate column we can know the most popular type.

Making a new dataframe by detecting strings of the column name

Code

med_form_df <- 
  med_data_unique %>%
  select(name) %>%
  mutate(med_type = case_when(
    # searching for specific type of medicine and making it a column
    str_detect(name, "tablet") ~ "tablet",
    str_detect(name, "capsule") ~ "capsule",
    str_detect(name, "syrup") ~ "syrup",
    str_detect(name, "oral suspension") ~ "oral suspension",
    str_detect(name, "suspension") ~ "suspension",
    str_detect(name, "cream|lotion") ~ "cream",
    str_detect(name, ".*gel") ~ "gel",
    str_detect(name, "drop|drops") ~ "drop",
    str_detect(name, "bar|bars") ~ "bar",
    str_detect(name, "solution|solutions") ~ "solution",
    str_detect(name, ".*cap|.*caps") ~ "caps",
    str_detect(name, "infusion") ~ "infusion",
    str_detect(name, "injection") ~ "injection",
    str_detect(name, "granules") ~ "granules",
    TRUE ~ "others"
  )) %>%
  filter(!is.na(med_type)) %>%
  count(med_type, sort = TRUE)

med_form_df

# A tibble: 15 × 2
   med_type             n
   <chr>            <int>
 1 tablet          135267
 2 injection        26643
 3 capsule          18966
 4 syrup            16003
 5 drop              5508
 6 cream             5502
 7 others            5319
 8 oral suspension   4373
 9 suspension        2510
10 gel               1731
11 solution          1014
12 infusion           767
13 caps               275
14 granules            78
15 bar                 58

As we can see that Tablets is the most common form with {r} med_form_df[1,2] followed by Injection with {r} med_form_df[2,2], Capsules and Syrups take third and fourth place respectively.

6.4 Most Common Side-Effects

This is where pivoting data comes to the use, we cannot find the most common side-effect as there are 42 columns of them with NAs in the middle which gets complicated. By pivoting data to longer format we make each side-effect has its own row which leads to removal of NA easily.

Code

side_effect_med %>% 
  filter(side_effects != "no common side effects seen") %>%
  count(side_effects, sort = TRUE) %>%  
  slice_max(n, n = 10) %>% 
  mutate(side_effects = str_to_title(side_effects)) %>% 
  ggplot(aes(x = fct_reorder(side_effects, n), y = n)) +
  geom_col(aes(fill = n)) +
  scale_y_continuous(
    labels = scales::number_format(scale = 1e-3, suffix = "K")
  ) +
  labs(x = "Side-effects", y = "Frequency", 
       title = "Most Common Side-effects") +
  theme(legend.position = "none")

6.5 Action class with most unique Side-effects

Lets find which action_class has most unique side-effects in the data

Code

action_class_sideeffects <-
  med_data_unique %>% select(name, sideeffect0:sideeffect41,
                           action_class) %>% 
  pivot_longer(cols = sideeffect0:sideeffect41,
               names_to = "sideeffect_num",
               values_to = "side_effect") %>%
  # removing medicine name and sideeffect_num
  select(-sideeffect_num, -name) %>%
  # removing duplicates so that only unique side-effect & action_class remain
  filter(!duplicated(.))
  
action_class_sideeffects %>% count(action_class, sort = TRUE) %>% 
  drop_na() %>% slice_max(n = 10, n)

# A tibble: 10 × 2
   action_class                        n
   <chr>                           <int>
 1 glucocorticoids                    88
 2 tyrosine kinase inhibitors         79
 3 vitamins                           73
 4 anticancer-others                  66
 5 antimetabolites                    66
 6 atypical antipsychotics            65
 7 sodium channel modulators (aed)    62
 8 alkaloids-cytotoxic agents         56
 9 alkylating agent                   53
10 quinolones/ fluroquinolones        53

Drugs with Glucocorticoids have 88 unique side-effects, followed by Tyrosine Kinase Inhibitors with 79 , Vitamins and Anticancer-others come next.

1 Importing Libraries

2 Importing data

2.1 Glancing data

3 Cleaning the data

3.1 Finding NAs and Dulicates

3.2 Finding unique values

4 Data Manipulation

4.1 Pivoting Data

5 Visualising with ggplot2

5.1 Bar plots

6 Finding Answers to Specific Questions

6.1 Addictive drugs

6.2 No Substitute Drugs

6.3 Most Popular Drug form

6.4 Most Common Side-Effects

6.5 Action class with most unique Side-effects

3.1 Finding `NA`s and Dulicates

5 Visualising with `ggplot2`