Categorical variables • elicitr

library(elicitr)
#> Error in get(paste0(generic, ".", class), envir = get_method_env()) : 
#>   object 'type_sum.accel' not found

Many of the concepts introduced in vignette("continuous_variables") are also applicable to categorical variables, and the name of the functions are the same but have the prefix cat instead of cont. However, there are some differences in the workflow for loading and analysing data collected during the elicitation of categorical variables. This vignette will guide you through the process of loading and analysing categorical data.

Datasets

There are three datasets included in the package for demonstration purposes: ?topic_1, ?topic_2, and ?topic_3:

topic_1
#> # A tibble: 120 × 5
#>    name            category   option   confidence estimate
#>    <chr>           <chr>      <chr>         <dbl>    <dbl>
#>  1 Derek Maclellan category_1 option_1         15     0.08
#>  2 Derek Maclellan category_2 option_1         15     0   
#>  3 Derek Maclellan category_3 option_1         15     0.85
#>  4 Derek Maclellan category_4 option_1         15     0.02
#>  5 Derek Maclellan category_5 option_1         15     0.05
#>  6 Derek Maclellan category_1 option_2         35     0.02
#>  7 Derek Maclellan category_2 option_2         35     0.11
#>  8 Derek Maclellan category_3 option_2         35     0.18
#>  9 Derek Maclellan category_4 option_2         35     0.02
#> 10 Derek Maclellan category_5 option_2         35     0.67
#> # ℹ 110 more rows

topic_2
#> # A tibble: 100 × 5
#>    name              category   option   confidence estimate
#>    <chr>             <chr>      <chr>         <dbl>    <dbl>
#>  1 Christopher Felix category_1 option_1        100     0.09
#>  2 Christopher Felix category_2 option_1        100     0.21
#>  3 Christopher Felix category_3 option_1        100     0.11
#>  4 Christopher Felix category_4 option_1        100     0.59
#>  5 Christopher Felix category_5 option_1        100     0   
#>  6 Christopher Felix category_1 option_2         20     0.09
#>  7 Christopher Felix category_2 option_2         20     0.05
#>  8 Christopher Felix category_3 option_2         20     0.33
#>  9 Christopher Felix category_4 option_2         20     0.24
#> 10 Christopher Felix category_5 option_2         20     0.29
#> # ℹ 90 more rows

topic_3
#> # A tibble: 90 × 5
#>    name            category   option   confidence estimate
#>    <chr>           <chr>      <chr>         <dbl>    <dbl>
#>  1 Derek Maclellan category_1 option_1         80     0.02
#>  2 Derek Maclellan category_2 option_1         80     0.02
#>  3 Derek Maclellan category_3 option_1         80     0.01
#>  4 Derek Maclellan category_4 option_1         80     0.87
#>  5 Derek Maclellan category_5 option_1         80     0.08
#>  6 Derek Maclellan category_1 option_2         50     0.11
#>  7 Derek Maclellan category_2 option_2         50     0.09
#>  8 Derek Maclellan category_3 option_2         50     0.17
#>  9 Derek Maclellan category_4 option_2         50     0.09
#> 10 Derek Maclellan category_5 option_2         50     0.54
#> # ℹ 80 more rows

In each dataset the first column contains the name of the expert and the second the categories of the categorical variable. Each category can have different options, saved in column three. The fourth column contains the expert’s confidence, and the fifth the expert’s estimate.

Load data

We start by creating the ?elic_cat object with the function cat_start(). As for the continuous variables, this objects stores the metadata of the elicitation process:

my_categories <- c("category_1", "category_2", "category_3",
                   "category_4", "category_5")
my_options <- c("option_1", "option_2", "option_3", "option_4")
my_topics <- c("topic_1", "topic_2", "topic_3")
my_elicitation <- cat_start(categories = my_categories,
                            options = my_options,
                            experts = 6,
                            topics = my_topics)
#> ✔ <elic_cat> object for "Elicitation" correctly initialised
my_elicitation
#> 
#> ── Elicitation ──
#> 
#> • Categories: "category_1", "category_2", "category_3", "category_4", and
#> "category_5"
#> • Options: "option_1", "option_2", "option_3", and "option_4"
#> • Number of experts: 6
#> • Topics: "topic_1", "topic_2", and "topic_3"
#> • Data available for 0 topics

This elicitation process is for a categorical variables with 5 categories estimated for four options and three topics by six experts.

Similarly as we did for continuous variables, we can load the data with the function cat_load():

my_elicitation <- cat_add_data(my_elicitation,
                               data_source = topic_1,
                               topic = "topic_1") |>
  cat_add_data(data_source = topic_2, topic = "topic_2") |>
  cat_add_data(data_source = topic_3, topic = "topic_3")
#> ✔ Data added to Topic "topic_1" from "data.frame"
#> ✔ Data added to Topic "topic_2" from "data.frame"
#> ✔ Data added to Topic "topic_3" from "data.frame"

Again, metadata are used to validate the data. If the data is not consistent with the metadata, an error message will be displayed. For example, if we try to load data with a category not defined in the metadata:

malformed_data <- topic_1
malformed_data[1, 2] <- "category_6"
cat_add_data(my_elicitation,
             data_source = malformed_data,
             topic = "topic_1")
#> Error in `cat_add_data()`:
#> ! The column with the name of the categories contains unexpected values:
#> ✖ The value "category_6" is not valid.
#> ℹ Check the metadata in the <elic_cat> object.

Get data

Data can be retrieved from the elic_cat object with the cat_get_data() function:

cat_get_data(my_elicitation, topic = "topic_1")
#> # A tibble: 120 × 5
#>    id      category   option   confidence estimate
#>    <chr>   <chr>      <chr>         <dbl>    <dbl>
#>  1 5ac97e0 category_1 option_1         15     0.08
#>  2 5ac97e0 category_2 option_1         15     0   
#>  3 5ac97e0 category_3 option_1         15     0.85
#>  4 5ac97e0 category_4 option_1         15     0.02
#>  5 5ac97e0 category_5 option_1         15     0.05
#>  6 5ac97e0 category_1 option_2         35     0.02
#>  7 5ac97e0 category_2 option_2         35     0.11
#>  8 5ac97e0 category_3 option_2         35     0.18
#>  9 5ac97e0 category_4 option_2         35     0.02
#> 10 5ac97e0 category_5 option_2         35     0.67
#> # ℹ 110 more rows

Notice that the name of the expert has been anonymised and assigned to the column id. Data can be retrieved only for given options:

cat_get_data(my_elicitation, topic = "topic_2", option = "option_1")
#> # A tibble: 25 × 5
#>    id      category   option   confidence estimate
#>    <chr>   <chr>      <chr>         <dbl>    <dbl>
#>  1 e51202e category_1 option_1        100     0.09
#>  2 e51202e category_2 option_1        100     0.21
#>  3 e51202e category_3 option_1        100     0.11
#>  4 e51202e category_4 option_1        100     0.59
#>  5 e51202e category_5 option_1        100     0   
#>  6 e78cbf4 category_1 option_1         75     0.31
#>  7 e78cbf4 category_2 option_1         75     0.27
#>  8 e78cbf4 category_3 option_1         75     0.09
#>  9 e78cbf4 category_4 option_1         75     0.17
#> 10 e78cbf4 category_5 option_1         75     0.16
#> # ℹ 15 more rows

Data analysis

Contrary to continuous variables, there is not yet a function for plotting the raw data. However, we can plot the distribution of the sampled data.

Sample data

Data can be sampled using the function cat_sample() (see the variable documentation for the explanation of the sampling methods). Here we sample 100 values for each option:

samp <- cat_sample_data(my_elicitation,
                        method = "basic",
                        topic = "topic_1",
                        n_votes = 100)
#> ✔ Data sampled successfully using "basic" method.
samp
#> # A tibble: 2,400 × 7
#>    id      option   category_1 category_2 category_3 category_4 category_5
#>    <chr>   <chr>         <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
#>  1 5ac97e0 option_1   4.13e-14          0     0.996   5.48e-107   3.79e- 3
#>  2 5ac97e0 option_1   9.12e- 1          0     0.0881  8.28e- 20   6.40e- 8
#>  3 5ac97e0 option_1   4.57e- 4          0     1.00    6.44e- 15   8.17e-14
#>  4 5ac97e0 option_1   6.03e- 2          0     0.940   3.20e-  5   1.45e- 9
#>  5 5ac97e0 option_1   9.50e- 3          0     0.988   1.73e-  3   3.37e- 4
#>  6 5ac97e0 option_1   3.31e- 3          0     0.997   6.76e- 18   1.95e-31
#>  7 5ac97e0 option_1   4.81e- 6          0     1.00    1.28e- 14   2.11e-12
#>  8 5ac97e0 option_1   1.73e- 4          0     0.998   9.17e- 15   2.22e- 3
#>  9 5ac97e0 option_1   1.15e-19          0     1.00    5.78e- 15   4.62e- 6
#> 10 5ac97e0 option_1   2.59e- 6          0     1.00    1.36e- 30   1.20e- 8
#> # ℹ 2,390 more rows

Sampled data can be summarised for any option:

summary(samp, option = "option_1")
#> # A tibble: 5 × 7
#>   Category         Min       Q1   Median   Mean       Q3   Max
#>   <chr>          <dbl>    <dbl>    <dbl>  <dbl>    <dbl> <dbl>
#> 1 category_1 7.88e- 29 1.63e- 5 7.04e- 3 0.132  0.142    0.999
#> 2 category_2 0         1.25e-73 4.31e-18 0.0498 0.000236 0.974
#> 3 category_3 2.44e- 21 2.65e- 3 1.35e- 1 0.343  0.701    1.00 
#> 4 category_4 2.69e-234 1.44e-22 4.07e- 4 0.153  0.167    0.999
#> 5 category_5 3.74e- 49 5.44e- 4 1.21e- 1 0.323  0.650    1.00

And plotted as violin plot:

plot(samp)

Violin plot of the sampled data for all options.