
Combines rows with the same values in start-stop data
simplify_start_stop.RdGiven a data.table-like object containing information in the start-stop format, this function searches for consecutive intervals where values of specific covariates do not change and "simplifies" the dataset by combining these interval into one interval. This may be useful to reduce RAM usage and computation time when dealing with large start-stop data.
Arguments
- data
A
data.tablelike object including at least four columns:id(the case identifier),start(the beginning of the time-interval),stop(the end of the time-interval) and one or more arbitrary columns. May also be any object that can be coerced to be adata.table, such as adata.frameor atibble. Intervals should be right-open (e.g. coded as[start, stop)).- id
A single character string specifying a column in
datacontaining the case identifiers.- start
A single character string specifying a column in
dataspecifying the beginning of a time-interval. Defaults to"start".- stop
A single character string specifying a column in
dataspecifying the ending of a time-interval. Defaults to"stop".- cols
A character vector specifying the columns that should be used to check whether the intervals are unique. If not specified, all columns other than
id,startandstopwill be used.- remove_other_cols
Either
TRUEorFALSE, specifying whether the columns not named in thecolsargument (other thanid,start,stop) should be removed from the output. Defaults toTRUE, because keeping these columns may be misleading. If set toFALSE, please remember that the value of these columns is not neccesarily correct, since intervals have been combined without looking at their values first.
Details
The intervals defined by the start and stop columns are expected to be coded as [start, stop), meaning that the value of start must always be equal to the value of stop in the previous row. Intervals of length 0 are not supported and will produce an error message.
Note that if the input data contains events, users probably want to exclude these event columns from the cols argument. The reason is that the data may contain consecutive intervals that are indeed exactly the same, but refer to two separate events (because intervals always end when an event indicator is TRUE).
Examples
library(MatchTime)
library(data.table)
# get some fake example data
data1 <- data.table(id=1,
start=c(1, 20, 35, 120, 923, 1022, 2000, 3011),
stop=c(20, 35, 120, 923, 1022, 2000, 3011, 3013),
A=c(0, 0, 0, 1, 1, 0, 0, 0),
B=c(1, 0, 0, 1, 0, 0, 0, 0),
C=c(11, 0.2, 17.8, 2.1, 9.0001, 1.2, 33, 22))
data2 <- data.table(id=2,
start=c(1, 20, 35, 120, 923),
stop=c(20, 35, 120, 923, 1022),
A=c(0, 0, 1, 1, 1),
B=c(1, 0, 0, 1, 0),
C=c(11, 0.2, 17.8, 2.1, 9.0001)+1)
data <- rbind(data1, data2)
# simplify in regards to columns "A" and "B"
out <- simplify_start_stop(data, id="id", cols=c("A", "B"))
print(out)
#> Key: <id, start>
#> id start stop A B
#> <num> <num> <num> <num> <num>
#> 1: 1 1 20 0 1
#> 2: 1 20 120 0 0
#> 3: 1 120 923 1 1
#> 4: 1 923 1022 1 0
#> 5: 1 1022 3013 0 0
#> 6: 2 1 20 0 1
#> 7: 2 20 35 0 0
#> 8: 2 35 120 1 0
#> 9: 2 120 923 1 1
#> 10: 2 923 1022 1 0
# simplify in regards to column "A" only
out <- simplify_start_stop(data, id="id", cols="A")
print(out)
#> Key: <id, start>
#> id start stop A
#> <num> <num> <num> <num>
#> 1: 1 1 120 0
#> 2: 1 120 1022 1
#> 3: 1 1022 3013 0
#> 4: 2 1 35 0
#> 5: 2 35 1022 1
# calling it without specifying "cols" results in no changes,
# because C always changes over the defined intervals
out <- simplify_start_stop(data, id="id")
print(out)
#> Key: <id, start>
#> id start stop A B C
#> <num> <num> <num> <num> <num> <num>
#> 1: 1 1 20 0 1 11.0000
#> 2: 1 20 35 0 0 0.2000
#> 3: 1 35 120 0 0 17.8000
#> 4: 1 120 923 1 1 2.1000
#> 5: 1 923 1022 1 0 9.0001
#> 6: 1 1022 2000 0 0 1.2000
#> 7: 1 2000 3011 0 0 33.0000
#> 8: 1 3011 3013 0 0 22.0000
#> 9: 2 1 20 0 1 12.0000
#> 10: 2 20 35 0 0 1.2000
#> 11: 2 35 120 1 0 18.8000
#> 12: 2 120 923 1 1 3.1000
#> 13: 2 923 1022 1 0 10.0001