Getting started with topic modeling with dog breed traits

By Brendan Graham in tidy tuesday topic modeling tidytext

February 1, 2022

Introduction

I first learned about topic modeling after watching this video of Julia Silge creating topic models for Spice Girls lyrics. After checking out this weeks Tidy Tuesday datasets I thought I might try to learn about topic modeling and to develop one using dog breed traits.

Explore the data

breed_traits <-
  readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-02-01/breed_traits.csv') %>%
  janitor::clean_names() %>%
  mutate(breed = str_squish(breed))

trait_description <-
  readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-02-01/trait_description.csv') %>%
  janitor::clean_names() %>%
  mutate(trait = str_squish(trait))

breed_rank_all <-
  readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-02-01/breed_rank.csv') %>%
  janitor::clean_names() %>%
  mutate(breed = str_squish(breed))

breed_traits

## # A tibble: 195 × 17
##    breed     affectionate_wit… good_with_young_… good_with_other… shedding_level
##    <chr>                 <dbl>             <dbl>            <dbl>          <dbl>
##  1 Retrieve…                 5                 5                5              4
##  2 French B…                 5                 5                4              3
##  3 German S…                 5                 5                3              4
##  4 Retrieve…                 5                 5                5              4
##  5 Bulldogs                  4                 3                3              3
##  6 Poodles                   5                 5                3              1
##  7 Beagles                   3                 5                5              3
##  8 Rottweil…                 5                 3                3              3
##  9 Pointers…                 5                 5                4              3
## 10 Dachshun…                 5                 3                4              2
## # … with 185 more rows, and 12 more variables: coat_grooming_frequency <dbl>,
## #   drooling_level <dbl>, coat_type <chr>, coat_length <chr>,
## #   openness_to_strangers <dbl>, playfulness_level <dbl>,
## #   watchdog_protective_nature <dbl>, adaptability_level <dbl>,
## #   trainability_level <dbl>, energy_level <dbl>, barking_level <dbl>,
## #   mental_stimulation_needs <dbl>

# trait_description <- 
#   trait_description %>% 
#   mutate(trait = stringr::str_to_lower(trait),
#          trait = stringr::str_replace_all(trait, ' ', '_'),
#          trait = stringr::str_replace_all(trait, '/', '_'))

trait_description

## # A tibble: 16 × 4
##    trait                      trait_1                    trait_5  description   
##    <chr>                      <chr>                      <chr>    <chr>         
##  1 Affectionate With Family   Independent                Lovey-D… How affection…
##  2 Good With Young Children   Not Recommended            Good Wi… A breed's lev…
##  3 Good With Other Dogs       Not Recommended            Good Wi… How generally…
##  4 Shedding Level             No Shedding                Hair Ev… How much fur …
##  5 Coat Grooming Frequency    Monthly                    Daily    How frequentl…
##  6 Drooling Level             Less Likely to Drool       Always … How drool-pro…
##  7 Coat Type                  -                          -        Canine coats …
##  8 Coat Length                -                          -        How long the …
##  9 Openness To Strangers      Reserved                   Everyon… How welcoming…
## 10 Playfulness Level          Only When You Want To Play Non-Stop How enthusias…
## 11 Watchdog/Protective Nature What's Mine Is Yours       Vigilant A breed's ten…
## 12 Adaptability Level         Lives For Routine          Highly … How easily a …
## 13 Trainability Level         Self-Willed                Eager t… How easy it w…
## 14 Energy Level               Couch Potato               High En… The amount of…
## 15 Barking Level              Only To Alert              Very Vo… How often thi…
## 16 Mental Stimulation Needs   Happy to Lounge            Needs a… How much ment…

breed_rank_all

## # A tibble: 195 × 11
##    breed       x2013_rank x2014_rank x2015_rank x2016_rank x2017_rank x2018_rank
##    <chr>            <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
##  1 Retrievers…          1          1          1          1          1          1
##  2 French Bul…         11          9          6          6          4          4
##  3 German She…          2          2          2          2          2          2
##  4 Retrievers…          3          3          3          3          3          3
##  5 Bulldogs             5          4          4          4          5          5
##  6 Poodles              8          7          8          7          7          7
##  7 Beagles              4          5          5          5          6          6
##  8 Rottweilers          9         10          9          8          8          8
##  9 Pointers (…         13         12         11         11         10          9
## 10 Dachshunds          10         11         13         13         13         12
## # … with 185 more rows, and 4 more variables: x2019_rank <dbl>,
## #   x2020_rank <dbl>, links <chr>, image <chr>

Top and bottom 5 breeds based on avg rank 2012-2020

breed_rank_all %>%
  select(-c(links, image)) %>%
  pivot_longer(cols = starts_with("x"), values_to = 'rank') %>%
  group_by(breed) %>%
  summarise(mean_rank = mean(rank, na.rm = T),
            sd = sd(rank, na.rm = T)) %>%
  ungroup() %>%
  arrange(mean_rank) %>%
  mutate(rank = row_number()) %>% 
  filter(rank <=5 | rank >= nrow(.) - 4) %>% 
    add_table(rows = 10)

Between 2013 - 2020, Labrador Retrievers were never not #1. French Bulldogs ranking was are most improved from 9 to 2, while Yorkshire Terriers have largest drop from 6 to 10.

breed_rank_all %>%
  # select(-c(links, image)) %>%
  pivot_longer(cols = starts_with("x"), values_to = 'rank') %>%
  mutate(name = str_replace_all(name, "x", ""),
         name = str_replace_all(name, "_rank", ""),
         year = as.numeric(name)
  ) %>%
  select(-name) %>%
  group_by(year) %>%
  mutate(top_15 = ifelse(rank <= 10, 1, 0)) %>% 
  filter(top_15 == 1) %>%
  ungroup() %>%
  group_by(breed) %>%
  mutate(
    label = case_when(
      year == min(year) ~ paste(breed, rank),
      year == max(year) ~ paste(breed, rank),
      TRUE ~ NA_character_),
    image_url = case_when(
      year == min(year) ~ image,
      year == max(year) ~ image,
      TRUE ~ NA_character_)) %>%
  ggplot(aes(x = as.factor(year), y = rank, group = breed, label = label)) + 
  geom_point(show.legend = FALSE, aes(color = breed)) + 
  geom_line(show.legend = FALSE, aes(color = breed)) + 
  geom_image(aes(image = image_url), size=.05) + 
  geom_label_repel(nudge_y = 0.4) + 
  bg_theme() + 
  scale_y_continuous(breaks = seq(1, 10, 1)) + 
  labs(x = NULL, y = "rank", title = "Changes in Breed Ranks", subtitle = "top 10 breeds in each year only")

mean_rank <- 
  breed_rank_all %>%
  select(-c(links, image)) %>%
  pivot_longer(cols = starts_with("x"), values_to = 'rank') %>%
  group_by(breed) %>%
  summarise(mean_rank = mean(rank, na.rm = T),
            sd = sd(rank)) %>% 
  mutate(breed = trimws(breed),
         breed = as.character(breed)) %>%
  ungroup() %>%
  arrange(mean_rank) %>%
  mutate(rank = row_number()) %>%
  ungroup() %>%
  as_tibble()

traits <- 
  breed_traits %>%
  select(breed, coat_type, coat_length, everything()) %>% 
  ungroup() %>%
  as_tibble() %>%
  pivot_longer(cols = 4:ncol(.), names_to = "trait", values_to = 'trait_score') %>% 
  left_join(., trait_description, by = 'trait') %>%
  ungroup() %>%
  as_tibble() %>% 
  left_join(., mean_rank, by = 'breed')

traits %>% 
  add_table()

Topic Modeling

This dataset gave the opportunity to dive into Text Mining with R to learn more about text analysis and topic modeling.

Topic modeling is a method for unsupervised classification of such documents, similar to clustering on numeric data, which finds natural groups of items even when we’re not sure what we’re looking for.

Next the tidytext’s unnest_tokens() is used to separate the trait descriptions into words, then stop words and some other the common words are removed. This results in a dataframe with 1 word per trait per row. as a first step

trait_list <- 
  c("Affectionate With Family", "Trainability Level", "Barking Level", "Mental Stimulation Needs")

tidy_descripion <-
  trait_description %>%
  filter(trait %in% trait_list) %>%
  unnest_tokens(word, description) %>%
  anti_join(get_stopwords()) %>% 
  filter(!word %in% c("breed", "breeds", "breed's", "dogs"))
  
tidy_descripion %>%
  count(word, sort = TRUE)

## # A tibble: 76 × 2
##    word             n
##    <chr>        <int>
##  1 want             4
##  2 can              3
##  3 bark             2
##  4 dog              2
##  5 everyone         2
##  6 like             2
##  7 others           2
##  8 owner            2
##  9 projects         2
## 10 affectionate     1
## # … with 66 more rows

Next we need to make the data into a DocumentTermMatrix using cast_dtm()

desc_dtm <-
  tidy_descripion %>%
  count(trait, word) %>%
  tidytext::cast_dtm(trait, word, n)

dim(desc_dtm)

## [1]  4 76

This means there are 4 traits (i.e. documents) and 76 different tokens (i.e. terms or words) in our dataset for modeling. Then we use LDA() to create a 4 topic model, one for each of the 4 traits we filtered for above.

# Latent Dirichlet allocation algorithms for topic modelling
traits_lda <- 
  LDA(desc_dtm, k = 4, control = list(seed = 173))

traits_lda

## A LDA_VEM topic model with 4 topics.

we can examine per-topic-per-word probabilities, beta, where the model computes the probability of a given term being generated from a given topic. For example, the term “affectionate” has an almost zero probability of being generated from topics 1, 2, or 3, but it makes up about 5% of topic 4.

# per-topic-per-word probabilities.
trait_topics <- 
  tidy(traits_lda, matrix = "beta")

trait_topics

## # A tibble: 304 × 3
##    topic term              beta
##    <int> <chr>            <dbl>
##  1     1 affectionate 1.67e-146
##  2     2 affectionate 7.24e-146
##  3     3 affectionate 5.37e-146
##  4     4 affectionate 5.88e-  2
##  5     1 aloof        1.67e-146
##  6     2 aloof        7.24e-146
##  7     3 aloof        5.37e-146
##  8     4 aloof        5.88e-  2
##  9     1 best         1.67e-146
## 10     2 best         7.24e-146
## # … with 294 more rows

From the plot we can pretty clearly see that topic 1 is probably Mental Stimulation Needs, topic 2 is likely Trainability Level, topic 3 is Barking Level and topic 4 is probably Affectionate With Family

top_terms <- 
  trait_topics %>%
  ungroup() %>%
  group_by(topic) %>%
  slice_max(beta, n = 5, with_ties = FALSE) %>% 
  ungroup() %>%
  arrange(topic, -beta)

top_terms %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(x = beta, y = term, fill = factor(topic))) +
  geom_col(show.legend = FALSE, alpha = .76) +
  facet_wrap(~ topic, scales = "free") +
  scale_y_reordered() + 
  bg_theme() +
  scale_fill_npg()

We can then examine per-document-per-topic probabilities, gamma. This lets us take a look at topic probabilities of each trait. This aligns well with the plot above.

traits_gamma <- 
  tidy(traits_lda, matrix = "gamma")

traits_gamma

## # A tibble: 16 × 3
##    document                 topic    gamma
##    <chr>                    <int>    <dbl>
##  1 Affectionate With Family     1 0.000719
##  2 Barking Level                1 0.000582
##  3 Mental Stimulation Needs     1 0.999   
##  4 Trainability Level           1 0.000643
##  5 Affectionate With Family     2 0.000719
##  6 Barking Level                2 0.000582
##  7 Mental Stimulation Needs     2 0.000395
##  8 Trainability Level           2 0.998   
##  9 Affectionate With Family     3 0.000719
## 10 Barking Level                3 0.998   
## 11 Mental Stimulation Needs     3 0.000395
## 12 Trainability Level           3 0.000643
## 13 Affectionate With Family     4 0.998   
## 14 Barking Level                4 0.000582
## 15 Mental Stimulation Needs     4 0.000395
## 16 Trainability Level           4 0.000643

traits_gamma %>%
  ggplot(aes(x = factor(topic), y = gamma, color = factor(topic))) +
  geom_point(size = 3, show.legend = FALSE) +
  facet_wrap(~ document) +
  labs(x = "topic", y = expression(gamma)) + 
  bg_theme() +
  scale_color_npg()

Posted on:: February 1, 2022

Length:: 8 minute read, 1603 words

Categories:: tidy tuesday topic modeling tidytext

Tags:: tidy tuesday topic modeling tidytext

See Also:: Classifying Bigfoot Encounters; Creating a Hex Bin Map to Show Changes Pell Grants; Combining 'Traditional' and Text-Based Models to Board Game Ratings