Getting started with the codyna package

This vignette demonstrates some basic usage of the codyna package. First, we load the package.

library("codyna")

We also load the engagement data available in the package (see ?engagement for further information)

data("engagement", package = "codyna")

Pattern Discovery

The codyna package provides an extensive set of features for discovering patterns in sequence data, such as n-grams, gapped patterns or repeated sequences of the same state using the function discover_patterns. The argument len can be used to specify the pattern lengths to look for. Similarly, argument gap specifies the gap sizes for gapped patterns.

discover_patterns(engagement, type = "ngram", len = 2:3)
#> # A tibble: 36 × 7
#>    pattern                       length frequency proportion count support  lift
#>  * <chr>                          <int>     <dbl>      <dbl> <dbl>   <dbl> <dbl>
#>  1 Active->Active                     2     10218     0.434    969   0.969 0.993
#>  2 Active->Active->Active             3      8386     0.372    931   0.931 0.965
#>  3 Disengaged->Disengaged             2      5186     0.220    811   0.811 1.03 
#>  4 Disengaged->Disengaged->Dise…      3      3925     0.174    706   0.706 1.01 
#>  5 Average->Average                   2      2774     0.118    789   0.789 0.840
#>  6 Average->Active                    2      1545     0.0656   853   0.853 0.891
#>  7 Average->Average->Average          3      1439     0.0638   545   0.545 0.599
#>  8 Average->Active->Active            3      1265     0.0561   806   0.806 0.852
#>  9 Disengaged->Average                2      1092     0.0464   709   0.709 0.824
#> 10 Active->Average                    2      1071     0.0455   695   0.695 0.726
#> # ℹ 26 more rows
discover_patterns(engagement, type = "gapped", gap = 1)
#> # A tibble: 9 × 7
#>   pattern                   length frequency proportion count support  lift
#> * <chr>                      <dbl>     <dbl>      <dbl> <dbl>   <dbl> <dbl>
#> 1 Active->*->Active              3      8718     0.387    934   0.934 0.957
#> 2 Disengaged->*->Disengaged      3      4063     0.180    722   0.722 0.916
#> 3 Average->*->Active             3      2129     0.0944   850   0.85  0.888
#> 4 Average->*->Average            3      1712     0.0759   611   0.611 0.651
#> 5 Active->*->Average             3      1534     0.0680   719   0.719 0.751
#> 6 Disengaged->*->Average         3      1412     0.0626   677   0.677 0.787
#> 7 Active->*->Disengaged          3      1126     0.0499   611   0.611 0.696
#> 8 Average->*->Disengaged         3      1021     0.0453   533   0.533 0.619
#> 9 Disengaged->*->Active          3       840     0.0372   529   0.529 0.603
discover_patterns(engagement, type = "repeated", len = 2:3)
#> # A tibble: 6 × 7
#>   pattern                        length frequency proportion count support  lift
#> * <chr>                           <int>     <dbl>      <dbl> <dbl>   <dbl> <dbl>
#> 1 Active->Active                      2     10218      0.562   969   0.969 0.993
#> 2 Active->Active->Active              3      8386      0.610   931   0.931 0.965
#> 3 Disengaged->Disengaged              2      5186      0.285   811   0.811 1.03 
#> 4 Disengaged->Disengaged->Disen…      3      3925      0.285   706   0.706 1.01 
#> 5 Average->Average                    2      2774      0.153   789   0.789 0.840
#> 6 Average->Average->Average           3      1439      0.105   545   0.545 0.599

The returned data frames show the length of the pattern, the number of times it occurred across all sequences, its proportion among patterns of the same length, the number sequence that contained the pattern, and the proportion of sequences that contained the pattern (support). The function discover_patterns can also be used to look for specific patterns, for example

discover_patterns(engagement, pattern = "Active->*")
#> # A tibble: 3 × 7
#>   pattern            length frequency proportion count support  lift
#> * <chr>               <int>     <dbl>      <dbl> <dbl>   <dbl> <dbl>
#> 1 Active->Active          2     10218     0.859    969   0.969 0.993
#> 2 Active->Average         2      1071     0.0900   695   0.695 0.726
#> 3 Active->Disengaged      2       605     0.0509   508   0.508 0.579

Here, the wildcard * matches any state, i.e., we are looking for patterns that start with the Active state and the following state can be any state.

We can also compute various sequence indices

sequence_indices(engagement)
#> # A tibble: 1,000 × 23
#>    valid_n valid_proportion unique_states mean_spell_duration max_spell_duration
#>      <int>            <dbl>         <int>               <dbl>              <dbl>
#>  1      23                1             3                3.83                 11
#>  2      23                1             3                3.29                 11
#>  3      24                1             3                3.43                  8
#>  4      24                1             3                4                     9
#>  5      24                1             3                3.43                 12
#>  6      23                1             3                5.75                 13
#>  7      23                1             3                2.88                  7
#>  8      23                1             3                3.29                  8
#>  9      23                1             3                2.88                  7
#> 10      24                1             3                8                    20
#> # ℹ 990 more rows
#> # ℹ 18 more variables: longitudinal_entropy <dbl>, simpson_diversity <dbl>,
#> #   self_loop_tendency <dbl>, transition_rate <dbl>,
#> #   transition_complexity <dbl>, initial_state_persistence <dbl>,
#> #   initial_state_proportion <dbl>, initial_state_influence_decay <dbl>,
#> #   cyclic_feedback_strength <dbl>, first_state <chr>, last_state <chr>,
#> #   dominant_state <chr>, dominant_proportion <dbl>, …

Early Warning Signals

The codyna package provides methods for the detection of early warning signals (EWS). These methods have been adapted from the EWSmethods with a focus on high performance. Instead of explicit rolling window calculations, codyna implements the measures using update formulas, resulting up to 1000-fold reduction in computation time in some instances. First, we prepare some simple time series data for analysis.

set.seed(123)
ts_data <- stats::arima.sim(list(order = c(1, 1, 0), ar = 0.6), n = 200)

Both rolling window and expanding window methods are supported.

ews_roll <- detect_warnings(ts_data, method = "rolling")
ews_exp <- detect_warnings(ts_data, method = "expanding")

The function detect_warnings returns an object of class ews, and the results can be easily visualized with the plot method of this class.

plot(ews_roll)

plot(ews_exp)

Regime Detection

One of the core features of codyna is regime detection for time series data. Various methods are included with a user-friendly interface and automated parameter selection based on sensitivity. We continue with the example time series data.

regimes <- detect_regimes(
  data = ts_data,
  method = "threshold",
  sensitivity = "medium"
)
regimes
#> # A tibble: 201 × 9
#>     value  time change    id type          magnitude confidence stability  score
#>  *  <dbl> <dbl> <lgl>  <int> <chr>             <dbl> <lgl>      <chr>      <dbl>
#>  1  0         1 FALSE      1 none               0    NA         Initial   NA    
#>  2  0.623     2 TRUE       2 threshold_me…      0.25 NA         Unstable   0.225
#>  3  0.441     3 FALSE      2 none               0    NA         Transiti…  0.35 
#>  4  2.12      4 FALSE      2 none               0    NA         Transiti…  0.475
#>  5  3.62      5 FALSE      2 none               0    NA         Transiti…  0.6  
#>  6  2.56      6 FALSE      2 none               0    NA         Transiti…  0.725
#>  7  2.62      7 FALSE      2 none               0    NA         Transiti…  0.6  
#>  8  2.19      8 FALSE      2 none               0    NA         Transiti…  0.475
#>  9  0.858     9 FALSE      2 none               0    NA         Transiti…  0.35 
#> 10 -0.158    10 FALSE      2 none               0    NA         Unstable   0.225
#> # ℹ 191 more rows

The columns value and time list the original time series values and time points. The column change shows when regime changes occur, and the type describes the type of regime change (which depends on the applied method). The id column provides the regime identifiers. The column magnitude quantifies the magnitude of the regime shift, and confidence is a method-dependent measure on the likelihood of an actual regime shift. In addition regime stability is described by stability along a stability score provided in the score column. The resulting object is of class regimes which has a customized plot method for visualizing the stability of the regimes along the original time series data.

plot(regimes)