NEWS

data.table news and updates

If you are viewing this file on CRAN, please check latest news on GitHub where the formatting is also better.

data.table v1.18.2.1 22 Jan 2026

BUG FIXES

When fixing duplicate factor levels, setattr() no longer crashes upon encountering missing factor values, #7595. Thanks to @sindribaldur for the report and @aitap for the fix.
foverlaps() no longer crashes due to out-of-bounds access to list and integer vectors when y has no rows or the non-range part of the join fails, #7597. Thanks to @nextpagesoft for the report and @aitap for the fix.
The dynamic library now exports only R_init_data_table, preventing symbol name conflicts like hash_create with PostgreSQL, #7605. Thanks to @ced75 for the report and @aitap for the fix

Notes

Removed use of non-API ATTRIB, SET_ATTRIB, and findVar #6180. Thanks @aitap for the continued assiduous work here, and @MichaelChirico for the easy fix to replace findVar with R_getVar.
Fixed compilation failure like “error: unknown type name ‘siginfo_t’” in v1.18.0 in some strict environments, e.g., FreeBSD, where the header file declaring the POSIX function waitid does not transitively include the header file defining the siginfo_t type, #7516. Thanks to @jszhao for the report and @aitap for the fix.
sum(<int64 column>) by group is correct with missing entries and GForce activated (#7571). Thanks to @rweberc for the report and @manmita for the fix. The issue was caused by a faulty early break that spilled between groups, and resulted in silently incorrect results!
set() now automatically pre-allocates new column slots if needed, similar to what := already does, #1831 #4100. Thanks to @zachokeeffe and @tyner for the report and @ben-schwen for the fix.

data.table v1.18.0 23 December 2025

BREAKING CHANGE

dcast() now errors when fun.aggregate returns length != 1 (consistent with documentation), regardless of fill, #6629. Previously, when fill was not NULL, dcast warned and returned an undefined result. This change has been planned since 1.16.0 (25 Aug 2024).
melt() returns an integer column for variable when measure.vars is a list of length=1, consistent with the documented behavior, #5209. Thanks to @tdhock for reporting. Any users who were relying on this behavior can change measure.vars=list("col_name") (output variable was column name, now is column index/integer) to measure.vars="col_name" (variable still is column name). This change has been planned since 1.16.0 (25 Aug 2024).
Rolling functions frollmean and frollsum distinguish Inf/-Inf from NA to match the same rules as base R when algo="fast" (previously they were considered the same). If your input into those functions has Inf or -Inf then you will be affected by this change. As a result, the argument that controls the handling of NAs has been renamed from hasNA to has.nf (has non-finite). hasNA continues to work with a warning, for now.
```
## before
frollsum(c(1,2,3,Inf,5,6), 2)
#[1] NA  3  5 NA NA 11

## now
frollsum(c(1,2,3,Inf,5,6), 2)
#[1]  NA   3   5 Inf Inf  11
```
frollapply result is not coerced to numeric anymore. Users’ code could possibly break if it depends on forced coercion of input/output to numeric type.
```
## before
frollapply(c(F,T,F,F,F,T), 2, any)
#[1] NA  1  1  0  0  1

## now
frollapply(c(F,T,F,F,F,T), 2, any)
#[1]    NA  TRUE  TRUE FALSE FALSE  TRUE
```
Additionally argument names in frollapply has been renamed from x to X and n to N to avoid conflicts with common argument names that may be passed to ..., aligning to base R API of lapply. x and n continue to work with a warning, for now.
Negative and missing values of n argument of adaptive rolling functions trigger an error.

NOTICE OF INTENDED FUTURE POTENTIAL BREAKING CHANGES

data.table(x=1, <expr>), where <expr> is an expression resulting in a 1-column matrix without column names, will eventually have names x and V2, not x and V1, consistent with data.table(x=1, <expr>) where <expr> results in an atomic vector, for example data.table(x=1, cbind(1)) and data.table(x=1, 1) will both have columns named x and V2. In this release, the matrix case continues to be named V1, but the new behavior can be activated by setting options(datatable.old.matrix.autoname) to FALSE. See point 5 under Bug Fixes for more context; this change will provide more internal consistency as well as more consistency with data.frame().
The behavior of week() will be changed in a future release to calculate weeks sequentially (days 1-7 as week 1), which is a potential breaking change. For now, the current “legacy” behavior, where week numbers advance every 7th day of the year (e.g., day 7 starts week 2), remains the default, and a deprecation warning will be issued when the old and new behaviors differ. Users can control this behavior with the temporary option options(datatable.week = "..."):
- "sequential": Opt-in to the new, sequential behavior (no warning).
- "legacy": Continue using the legacy behavior but suppress the deprecation warning. See #2611 for details. Thanks @MichaelChirico for the report and @venom1204 for the implementation.

NEW FEATURES

New sort_by() method for data.tables, #6662. It uses forder() to improve upon the data.frame method and also matches DT[order(...)] behavior with respect to locale. Thanks @rikivillalba for the suggestion and PR.
```
DT = data.table(a=c(1L, 2L, 1L), b=c(3L, 1L, 2L))
sort_by(DT, ~a + b)
#    a b
# 1: 1 2
# 2: 1 3
# 3: 2 1
```
melt() now supports using patterns() with id.vars, #6867. Thanks to Toby Dylan Hocking for the suggestion and PR.
print.data.table() now shows column classes at the bottom of large tables when class=TRUE and col.names="auto" (default) for tables with more than 20 rows, #6902. This follows the same behavior as column names at the bottom, making it easier to see column types for large tables without scrolling back to the top. Thanks to @TimTaylor for the suggestion and @Mukulyadav2004 for the PR.
as.Date() method for IDate no longer coerces to double #6922. Thanks @MichaelChirico for the report and PR. The only effect should be on overly-strict tests that assert Date objects have double storage, which is not in general true, especially from R 4.5.0.
as.data.table() is slightly more efficient at converting arrays to data.tables, #7019. Thanks @eliocamp.
between() gains the argument ignore_tzone=FALSE. Normally, a difference in time zone between lower and upper will produce an error, and a difference in time zone between x and either of the others will produce a message. Setting ignore_tzone=TRUE bypasses the checks, allowing both comparisons to proceed without error or message about time zones.

New helper function fctr as an extended version of factor(), #4837. Most notably, it supports (1) retaining input level ordering by default, i.e. levels=unique(x) as opposed to levels = sort(unique(x)); (2) rev= to reverse the levels; and (3) sort= to allow more feature parity with factor(). The choice of default is motivated by convenience in the common case when order of elements needs be preserved, for example when using dcast or adding a legend to a plot. This also matches the default sort ordering of groups in by=.

d = data.table(id1=rep(1:2, each=3L), id2=letters[c(4:3,5L,3:5)], v1=1:6)
dcast(d, id1 ~ factor(id2))
#      id1     c     d     e
# 1:     1     2     1     3
# 2:     2     4     5     6
dcast(d, id1 ~ fctr(id2))
#      id1     d     c     e
# 1:     1     1     2     3
# 2:     2     5     4     6
dcast(d, id1 ~ fctr(id2, sort=TRUE)) # same as factor()
#      id1     c     d     e
# 1:     1     2     1     3
# 2:     2     4     5     6
dcast(d, id1 ~ fctr(id2, rev=TRUE))
#      id1     e     c     d
# 1:     1     3     2     1
# 2:     2     6     4     5

groupingsets() gets a new argument enclos for use together with the jj argument in functions wrapping groupingsets(), including the existing wrappers rollup() and cube(), #5560. When forwarding a j-expression as groupingsets(jj = substitute(j)), make sure to pass enclos = parent.frame() as well, so that the j-expression will be evaluated in the right context. This makes it possible for j to refer to variables outside the data.table. Thanks @sindribaldur for the report and @aitap for the fix.
isoweek() is much faster (e.g. 20x) by re-using an implementation from {base}, #5111. Thanks @MichaelChirico for the report and PR.
data.table() and as.data.table() with keep.rownames=TRUE now extract row names from named vectors, matching data.frame() behavior. Names from the first named vector in the input are used to create the row names column (default name "rn" or custom name via keep.rownames="column_name"), #1916. Thanks to @richierocks for the feature request and @Mukulyadav2004 for the implementation.
New frev(x) as a faster analogue to base::rev() for atomic vectors/lists, #5885. Twice as fast as base::rev() on large inputs, and faster with more threads. Thanks to Benjamin Schwendinger for suggesting and implementing.
New cbindlist() and setcbindlist() for concatenating a list of data.tables column-wise, evocative of the analogous do.call(rbind, l) <-> rbindlist(l), #2576. setcbindlist() does so without making any copies. Thanks @MichaelChirico for the FR, @jangorecki for the PR, and @MichaelChirico for extensive reviews and fine-tuning.
```
l = list(
  data.table(id = 1:3, a = letters[1:3]),
  data.table(b = 4:6, c = 7:9)
)
cbindlist(l)
#    id a b c
# 1:  1 a 4 7
# 2:  2 b 5 8
# 3:  3 c 6 9
```

New mergelist() and setmergelist() similarly work a la Reduce() to recursively merge a list of data.tables, #599. Different join modes (left, inner, full, right, semi, anti, and cross) are supported through the how argument; duplicate handling goes through the mult argument. setmergelist() carefully avoids copies where one is not needed, e.g. in a 1:1 left join. Thanks Patrick Nicholson for the FR (in 2013!), @jangorecki for the PR, and @MichaelChirico for extensive reviews and fine-tuning.

l = list(
  data.table(id = c(1L, 2L, 3L), x = c("a", "b", "c")),
  data.table(id = c(1L, 2L, 4L), y = c("d", "e", "f")),
  data.table(id = c(1L, 3L, 4L), z = c("g", "h", "i"))
)

# Recursive inner join
mergelist(l, on = "id", how = "inner")
#    id x y z
# 1:  1 a d g

# Recursive left join (the default 'how')
mergelist(l, on = "id", how = "left")
#    id x    y    z
# 1:  1 a    d    g
# 2:  2 b    e <NA>
# 3:  3 c <NA>    h

fcoalesce() and setcoalesce() gain nan argument to control whether NaN values should be treated as missing (nan=NA, the default) or non-missing (nan=NaN), #4567. This provides full compatibility with nafill() behavior. Thanks to @ethanbsmith for the feature request and @Mukulyadav2004 for the implementation.
New function isoyear() has been implemented as a complement to isoweek(), returning the ISO 8601 year corresponding to a given date, #7154. Thanks to @ben-schwen and @MichaelChirico for the suggestion and @venom1204 for the implementation.
Multiple improvements have been added to rolling functions. Request came from @gpierard who needed left aligned, adaptive, rolling max, #5438. There was no frollmax function yet. Adaptive rolling functions did not have support for align="left". frollapply did not support adaptive=TRUE. Available alternatives were base R mapply or self-join using max and grouping by=.EACHI. As a follow up of his request, the following features have been added:
- new function frollmax, applies max over a rolling window.
- support for align="left" for adaptive rolling function.
- support for adaptive=TRUE in frollapply.
- partial argument to trim window width to available observations rather than returning NA whenever window is not complete.
- give.names argument that can be used to automatically give the names based on the names of x and n.
- frollmean and frollsum no longer treat Inf and -Inf as NAs as it used to be for algo="fast" (breaking change).
- hasNA argument has been renamed to has.nf to convey that it is not only related to NA/NaN but other non-finite values (Inf/-Inf) as well.
Thanks to @jangorecki for implementation and @MichaelChirico and others for work on splitting into smaller PRs and reviews. For a comprehensive description about all available features see ?froll manual.

Adaptive frollmax has observed to be around 80 times faster than second fastest solution (data.table self-join using max and grouping by=.EACHI). Note that important factor in performance is width of the rolling window. Code for the benchmark below has been taken from this SO answer.
```
set.seed(108)
setDTthreads(16)
x = data.table(
  value = cumsum(rnorm(1e6, 0.1)),
  end_window = 1:1e6 + sample(50:500, 1e6, TRUE),
  row = 1:1e6
)[, "end_window" := pmin(end_window, .N)
  ][, "len_window" := end_window-row+1L]
baser = function(x) x[, mapply(function(from, to) max(value[from:to]), row, end_window)]
sj = function(x) x[x, max(value), on=.(row >= row, row <= end_window), by=.EACHI]$V1
frmax = function(x) x[, frollmax(value, len_window, adaptive=TRUE, align="left", has.nf=FALSE)]
frapply = function(x) x[, frollapply(value, len_window, max, adaptive=TRUE, align="left")]
microbenchmark::microbenchmark(
  baser(x), sj(x), frmax(x), frapply(x),
  times=10, check="identical"
)
#Unit: milliseconds
#       expr        min         lq       mean     median         uq        max neval
#   baser(x) 3094.88357 3097.84966 3186.74832 3163.58050 3251.66753 3370.33785    10
#      sj(x) 2221.55456 2255.12083 2306.61382 2303.47883 2346.70293 2412.62975    10
#   frmax(x)   17.45124   24.16809   28.10062   28.58153   32.79802   34.83941    10
# frapply(x)  272.07830  316.47060  366.94771  396.23566  416.06699  421.38701    10
```
As of now, adaptive rolling max has no on-line implementation (algo="fast"), it uses a naive approach (algo="exact"). Therefore further speed up is still possible if algo="fast" gets implemented.

Function frollapply has been completely rewritten. Thanks to @jangorecki for implementation. Be sure to read frollapply manual before using the function. There are following changes:

all basic types are now supported on input/output, not only double. Users’ code could possibly break if it depends on forced coercion of input/output to double type.
new argument by.column allowing to pass a multi-column subset of a data.table into a rolling function, closes #4887.

x = data.table(v1=rnorm(120), v2=rnorm(120))
f = function(x) coef(lm(v2 ~ v1, data=x))
frollapply(x, 4, f, by.column=FALSE)
#     (Intercept)         v1
#           <num>      <num>
#  1:          NA         NA
#  2:          NA         NA
#  3:          NA         NA
#  4: -0.04648236 -0.6349687
#  5:  0.09208733 -0.4964023
#---
#116: -0.21169439  0.7421358
#117: -0.19729119  0.4926939
#118: -0.04217896  0.0452713
#119:  0.22472549 -0.5245874
#120:  0.54540359 -0.1638333

uses multiple CPU threads (on a decent OS); evaluation of UDF is inherently slow so this can be a great help.

x = rnorm(1e5)
n = 500
setDTthreads(1)
system.time(
  th1 <- frollapply(x, n, median, simplify=unlist)
)
#   user  system elapsed
#  3.078   0.005   3.084
setDTthreads(4)
system.time(
  th4 <- frollapply(x, n, median, simplify=unlist)
)
#   user  system elapsed
#  2.453   0.135   0.897
all.equal(th1, th4)
#[1] TRUE

New helper frolladapt to facilitate applying rolling functions over windows of fixed calendar-time width in irregularly-spaced data sets, thereby bypassing the need to “augment” such data with placeholder rows, #3241. Thanks to @jangorecki for implementation.

idx = as.Date("2025-09-05") + c(0,4,7,8,9,10,12,13,17)
dt = data.table(index=idx, value=seq_along(idx))
dt
#        index value
#       <Date> <int>
#1: 2025-09-05     1
#2: 2025-09-09     2
#3: 2025-09-12     3
#4: 2025-09-13     4
#5: 2025-09-14     5
#6: 2025-09-15     6
#7: 2025-09-17     7
#8: 2025-09-18     8
#9: 2025-09-22     9
dt[, c("rollmean3","rollmean3days") := list(
  frollmean(value, 3),
  frollmean(value, frolladapt(index, 3), adaptive=TRUE)
  )]
dt
#        index value rollmean3 rollmean3days
#       <Date> <int>     <num>         <num>
#1: 2025-09-05     1        NA            NA
#2: 2025-09-09     2        NA           2.0
#3: 2025-09-12     3         2           3.0
#4: 2025-09-13     4         3           3.5
#5: 2025-09-14     5         4           4.0
#6: 2025-09-15     6         5           5.0
#7: 2025-09-17     7         6           6.5
#8: 2025-09-18     8         7           7.5
#9: 2025-09-22     9         8           9.0

Other new rolling functions: frollmin, frollprod, frollmedian, frollvar and frollsd, have been implemented, resolving long standing issue #2778. Thanks to @jangorecki for implementation. Implementation of rolling median is based on a novel algorithm “sort-median” described by @suomela in his 2014 paper Median Filtering is Equivalent to Sorting. “sort-median” scales very well, not only for size of input vector but also for size of rolling window.

rollmedian = function(x, n) {
  ans = rep(NA_real_, nx<-length(x))
  if (n<=nx) for (i in n:nx) ans[i] = median(x[(i-n+1L):(i)])
  ans
}
library(data.table)
setDTthreads(8)
set.seed(108)
x = rnorm(1e5)

n = 100
system.time(rollmedian(x, n))
#   user  system elapsed
#  2.049   0.001   2.051
system.time(frollapply(x, n, median, simplify=unlist))
#   user  system elapsed
#  3.071   0.223   0.436
system.time(frollmedian(x, n))
#   user  system elapsed
#  0.013   0.000   0.004

n = 1000
system.time(rollmedian(x, n))
#   user  system elapsed
#  3.496   0.009   3.507
system.time(frollapply(x, n, median, simplify=unlist))
#   user  system elapsed
#  4.552   0.307   0.632
system.time(frollmedian(x, n))
#   user  system elapsed
#  0.015   0.000   0.004

n = 10000
system.time(rollmedian(x, n))
#   user  system elapsed
# 16.350   0.025  16.382
system.time(frollapply(x, n, median, simplify=unlist))
#   user  system elapsed
# 14.865   0.722   2.267
system.time(frollmedian(x, n))
#   user  system elapsed
#  0.028   0.000   0.005

fread() now supports the comment.char argument to skip trailing comments or comment-only lines, consistent with read.table(), #856. The default remains comment.char = "" (no comment parsing) for backward compatibility and performance, in contrast to read.table(comment.char = "#"). Thanks to @arunsrinivasan and many others for the suggestion and @ben-schwen for the implementation.

BUG FIXES

fread() no longer warns on certain systems on R 4.5.0+ where the file owner can’t be resolved, #6918. Thanks @ProfFancyPants for the report and PR.
Joins to extended data.frames, e.g. x[i, col := x.col1 + i.col2] where i is a tbl, can use the x. and i. prefix forms, #6998. Thanks @MichaelChirico for the bug and PR.
Out of sample type bumps now respect integer64= selection, #7032.
In rare cases, data.table failed to expand ALTREP columns when assigning a full column by reference. This could result in the target column getting modified unintentionally if the next call to the data.table was a modification by reference of the source column. E.g. in DT[, b := as.character(a)] the string conversion gets deferred and subsequent modification of column a would also modify column b, #5400. Thanks to @aquasync for the report and Václav Tlapák for the PR.
data.table() function is now more aligned with data.frame() with respect to the names of the output when one of its inputs is a single-column matrix object, #4124, #3193, and #5367. Thanks @PavoDive for the report, @jangorecki for the PR, and @MichaelChirico for a follow-up for back-compatibility.
Including an ITime object as a named input to data.frame() respects the provided name, i.e. data.frame(a = as.ITime(...)) will have column a, #4673. Thanks @shrektan for the report and @MichaelChirico for the fix.
fread() now handles the na.strings argument for quoted text columns, making it possible to specify na.strings = '""' and read empty quoted strings as NAs, #6974. Thanks to @AngelFelizR for the report and @aitap for the PR.
A data.table with a column of class vctrs_list_of (from package {vctrs}) prints as expected, #5948. Before, they could be printed messily, e.g. printing every entry in a nested data.frame. Thanks @jesse-smith for the report, @DavisVaughan and @r2evans for contributing, and @MichaelChirico for the PR.
Fixed incorrect sorting of merges where the first column of a key is a factor with non-sort()-ed levels (e.g. factor(1:2, 2:1) and it is joined to a character column, #5361. Thanks to @gbrunick for the report, Benjamin Schwendinger for the fix, and @MichaelChirico for a follow-up fix caught by revdep testing.
Spurious warnings from internal code in cube(), rollup(), and groupingsets() are no longer surfaced to the caller, #6964. Thanks @ferenci-tamas for the report and @venom1204 for the fix.
droplevels() works on 0-row data.tables, #7043. The result will have factor columns factor(character()), consistent with the data.frame method. Thanks @advieser for the report and @MichaelChirico for the fix.
print(..., col.names = 'none') now correctly adapts column widths to the data content, ignoring the original column names and producing a more compact output, #6882. Thanks to @brooksambrose for the report and @venom1204 for the PR.
Reference to .SD in ... arguments to lapply(), e.g. lapply(list_of_tables, `[`, j=.SD[1L]) is evaluated correctly, #2982. Thanks @franknarf1 for the report and @MichaelChirico for the fix.
Filling columns of class Date with POSIXct (and vice versa) using shift() now yields a clear, informative error message specifying the class mismatch, #5218. Thanks @ashbaldry for the report and @ben-schwen for the fix.
split.data.table() output list elements retain the S3 class of the generating data.table, e.g. in l=split(x, ...) if x has class my_class, so will l[[1]] and so on, #7105. Thanks @m-muecke for the bug report and @MichaelChirico for the fix.
between() is now more robust with integer64 arguments. Combining small integer x with certain large integer64 bounds no longer misinterprets the bounds as double; if a double bound cannot be losslessly converted into integer64 for comparison with integer64 x, an error is signalled instead of returning a wrong answer with a warning; #7164. Thanks @aitap for the bug report and the fix.
t1 - t2, where one is an IDate and the other is a Date, are now consistent with the case where both are IDate or both are Date, #4749. Thanks @George9000 for the report and @MichaelChirico for the fix.
fwrite now allows dec to be the same as sep for edge cases where only one will be written, e.g. 0-row or 1-column tables. #7227. Thanks @MichaelChirico for the report and @venom1204 for the fix.
Ellipsis elements like ..1 are correctly excluded when searching for variables in “up-a-level” syntax inside [, #5460. Thanks @ggrothendieck for the report and @MichaelChirico for the fix.
forderv could segfault on keys with long runs of identical bytes because the single-group branch tail-recursed radix-by-radix until the C stack ran out. This affected both integer/numeric sorting with many duplicate columns (#4300) and character sorting with long common prefixes (#7462). This is a major problem since sorting is extensively used in data.table. Thanks @quantitative-technologies and @DavisVaughan for the reports, and @ben-schwen for the fix.
[ now preserves existing key(s) when new columns are added before them, instead of incorrectly setting a new column as key, #7364. Thanks @czeildi for the bug report and the fix.
setDTthreads(percent=) and setDTthreads(threads=) now respect OMP_NUM_THREADS and omp_get_max_threads(), ensuring consistency with setDTthreads() (no arguments) when OpenMP environment variables are set, #7165. Previously, explicitly setting a thread count or percentage would ignore these OpenMP limits, potentially exceeding the user’s intended thread cap. Thanks to @bastistician for the report and @ben-schwen for the fix.
fread() auto-detects separators for single-column files consisting solely of quoted values (e.g. "this_that"\n"2025-01-01 00:00:01"), #7366. Thanks @arunsrinivasan for the report and @ben-schwen for the fix.
Rolling functions now ensure there is no nested parallelism. It could have happened for vectorized input and adaptive=TRUE, #7352. Thanks @jangorecki for the fix.
By-group operations on missing rows (e.g. foo[c(i, NA), bar, by=grp]) now avoid leaving in data from the previous groups, #7442. Thanks @aitap for the report and the fix.
Grouping by a factor with many groups is now fast again, fixing a timing regression introduced in #6890 where UTF-8 coercion and level remapping were performed unnecessarily, #7404. Thanks @ben-schwen for the report and fix.
dogroups() no longer reads beyond the resized end of over-allocated data.table list columns, #7486. While this didn’t crash in practice, it is now explicitly checked for in recent R versions (r89198+). Thanks @TimTaylor and @aitap for the report and @aitap for the fix.
rbindlist() now avoids the crash when working with many non-UTF-8 column names, #7452. Thanks @aitap for the report and the fix.

NOTES

The following in-progress deprecations have proceeded:
- Argument logicalAsInt to fwrite() has been removed.
- Argument autostart to fread() has been removed.
- Argument in.place to droplevels has been removed.
- It’s now an error to set datatable.nomatch, which has been warning since 1.15.0.
{data.table} now depends on R 3.4.0 (2017).
Changes to fread() output and errors:
- When the size of the file exceeds the size of the address space, fread() now signals an informative error instead of trying to map its size modulo the address space.
- On non-Windows systems, fread() now prints the reason why the file couldn’t be opened, which could also be due to it being too large to map.
- With verbose=TRUE, file sizes are now printed using correct binary SI prefixes (the sizes have always been reported as bytes denominated in powers of 2^10, so e.g. 1024*1024 bytes was reported as 1 MB where 1 MiB or 1.05 MB is correct).
The default format_list_item() method (and hence print.data.table()) annotates truncated list items with their length, #605. Thanks Matt Dowle for the original report (2012!) and @MichaelChirico for the fix.
A GitHub Actions workflow is now in place to warn the entire maintainer team, as well as any contributor following the GitHub repository, when the package is at risk of archival on CRAN #7008. Thanks @tdhock for the original report and @Bisaloo and @TysonStanley for the fix.
Using a double vector in set()’s i= and/or j= no longer throws a warning about preferring integer, #6594. While it may improve efficiency to use integer, there’s no guarantee it’s an improvement and the difference is likely to be minimal. The coercion will still be reported under datatable.verbose=TRUE. For package/production use cases, static analyzers such as lintr::implicit_integer_linter() can also report when numeric literals should be rewritten as integer literals.
In rare situations a data.table object may lose its internal attribute that holds a self-reference. New helper function .selfref.ok() tests just that. It is only intended for technical use cases. See manual for examples.
Retain important information in the error message about the source of the error when i= fails, e.g. pointing to charToDate() failing in DT[date_col == "20250101"], #7444. Thanks @jan-swissre for the report and @MichaelChirico for the fix.
Internal use of declared non-API R functions SETLENGTH, TRUELENGTH, SET_TRUELENGTH, and SET_GROWABLE_BIT has been eliminated. Most usages have been migrated to R’s experimental resizable vectors API (thanks to @ltierney, introduced in R 4.6.0, backported for older R versions), #7451. Uses of TRUELENGTH for marking seen items during grouping and binding operations (aka free hash table trick) have been replaced with proper hash tables, #6694. The new hash table implementation uses linear probing with power of 2 tables and automatic resizing. Additionally, chmatch() now hashes the needle (x) instead of the haystack (table) when length(table) >> length(x), significantly improving performance for lookups into large tables. We’ve benchmarked the refactored code and find the performance satisfactory, but please do report any edge case performance regressions we may have missed. Thanks to @aitap, @ben-schwen, @jangorecki and @HughParsonage for implementation and reviews.

data.table v1.17.8 (6 July 2025)

Internal functions used to signal errors are now marked as non-returning, silencing a compiler warning about potentially unchecked allocation failure. Thanks to Prof. Brian D. Ripley for the report and @aitap for the fix, #7070.

data.table v1.17.6 (15 June 2025)

On a heavily loaded machine, a forder thread could try to perform a zero-length copy from a null pointer, which was de-facto harmless but is against the C standard and was caught by additional CRAN checks, #7051. Thanks to @helske for the report and @aitap for the PR.

data.table v1.17.4 (25 May 2025)

The C code now avoids passing invalid data pointers from 0-length vectors to memcpy(), which previously caused undefined behaviour. Thanks to Prof. Brian D. Ripley for the report and Michael Chirico for the fix, #6911.

data.table v1.17.2 (7 May 2025)

BUG FIXES

fwrite(compress="gzip") once again produces a gzip header when the column names are missing or disabled, @6852. Thanks @maxscheiber for the report and @aitap for the fix.
fread(keepLeadingZeros=TRUE) now correctly parses dates with components with leading zeros as dates instead of strings, #6851. Thanks @TurnaevEvgeny for the report and @ben-schwen for the fix.
as.data.table() on x avoids an infinite loop if the output of the corresponding as.data.frame() method has the same class as the input, #6874. Concretely, we had class(x) = c('foo', 'data.frame') and class(as.data.frame(x)) = c('foo', 'data.frame'), so as.data.frame.foo wound up getting called repeatedly. Thanks @matschmitz for the report and @ben-schwen for the fix.
By-reference sub-assignments to factor columns now match the levels in UTF-8, preventing their duplication when the same level exists in different encodings, #6886. Thanks @iagogv3 for the report and @aitap for the fix.
fwrite() now avoids a crash when translating strings into a different encoding, #6883. Thanks @filipemsc for the report and @aitap for the fix.
Custom binary operators from the lubridate package now work with objects of class IDate as with a Date subclass, #6839. Thanks @emallickhossain for the report and @aitap for the fix.
as.data.table() now properly handles keys: specifying keys sets them, omitting keys preserves existing ones, and setting key=NULL clears them, #6859. Thanks @brookslogan for the report and @Mukulyadav2004 for the fix.

NOTES

Continued work to remove non-API C functions, #6180. Thanks Ivan Krylov for the PRs and for writing a clear and concise guide about the R API: https://aitap.codeberg.page/R-api/.

data.table v1.17.0 (20 Feb 2025)

POTENTIALLY BREAKING CHANGES

In DT[, variable := value], when value is class POSIXlt, we automatically coerce it to class POSIXct instead, #1724. Thanks to @linzhp for the report, and Benjamin Schwendinger for the fix.

NEW FEATURES

New function rowwiseDT() for creating a data.table object “row-wise”, often convenient for readability of small, literally-defined tables. Thanks to @shrektan for the suggestion and PR and @tdeenes for the idea of the name= syntax. Inspired by tibble::tribble().

library(data.table)
rowwiseDT(
  a=,b=,c=,  d=,
  1, 2, "a", 2:3,
  3, 4, "b", list("e"),
  5, 6, "c", ~a+b
)
#>        a     b      c      d
#>    <num> <num> <char> <list>
#> 1:     1     2      a    2,3
#> 2:     3     4      b      e
#> 3:     5     6      c ~a + b

Limited support for subsetting or aggregating columns of type expression, #5596. Thanks to @tsp for the report, and @ben-schwen for the fix.

groupingsets.data.table(), cube.data.table(), and rollup.data.table() gain a label argument, which allows the user to specify a label for each grouping variable, to be included in the grouping variable column in the output in rows where the variable has been aggregated, #5351. Thanks to @markseeto for the request, @jangorecki and @markseeto for specifying the desired behaviour, and @markseeto for implementing.

DT = data.table(V1 = rep(c("a1", "a2"), each = 5),
                V2 = rep(rep(c("b1", "b2"), c(3, 2)), 2),
                V3 = rep(c("c1", "c2"), c(3, 7)),
                V4 = rep(1:2, c(6, 4)),
                V5 = rep(1:2, c(9, 1)),
                V6 = rep(c(1.1, 1.2), c(2, 8)))

# Call groupingsets() and specify a label for V1, a different label for the other character grouping
# variables, a label for the integer grouping variables, and a label for the numeric grouping variable.

groupingsets(DT, .N, by = c("V1", "V2", "V3", "V4", "V5", "V6"),
             sets = list(c("V1", "V2", "V3"), c("V1", "V4"), c("V4", "V6"), "V2", "V5", character()),
             label = list(V1 = "All values", character = "Total", integer = 999L, numeric = NaN))

#             V1     V2     V3    V4    V5    V6     N
#         <char> <char> <char> <int> <int> <num> <int>
#  1:         a1     b1     c1   999   999   NaN     3
#  2:         a1     b2     c2   999   999   NaN     2
#  3:         a2     b1     c2   999   999   NaN     3
#  4:         a2     b2     c2   999   999   NaN     2
#  5:         a1  Total  Total     1   999   NaN     5
#  6:         a2  Total  Total     1   999   NaN     1
#  7:         a2  Total  Total     2   999   NaN     4
#  8: All values  Total  Total     1   999   1.1     2
#  9: All values  Total  Total     1   999   1.2     4
# 10: All values  Total  Total     2   999   1.2     4
# 11: All values     b1  Total   999   999   NaN     6
# 12: All values     b2  Total   999   999   NaN     4
# 13: All values  Total  Total   999     1   NaN     9
# 14: All values  Total  Total   999     2   NaN     1
# 15: All values  Total  Total   999   999   NaN    10

patterns() in melt() combines correctly with user-defined cols=, which can be useful to specify a subset of columns to reshape without having to use a regex, for example patterns("2", cols=c("y1", "y2")) will only give y2 even if there are other columns in the input matching 2, #6498. Thanks to @hongyuanjia for the report, and to @tdhock for the PR.
setcolorder() gains skip_absent to ignore unrecognized columns (i.e. columns included in neworder but not present in the data), #6044, #6068. Default behavior (skip_absent=FALSE) remains unchanged, i.e. unrecognized columns result in an error. Thanks to @sluga for the suggestion and @sluga & @Nj221102 for the PRs.
fread() gains logicalYN argument to read columns consisting only of strings Y, N as logical (as opposed to character), #4563. The default is controlled by option datatable.logicalYN, itself defaulting to FALSE, for back-compatibility – some smaller tables (especially sharded tables) might inadvertently read a “true” string column as logical and cause bugs. This is particularly important for tables with a column named y or n – automatic header detection under logicalYN=TRUE will see these values in the first row as being “data” as opposed to column names. A parallel option was not included for fwrite() at this time – users looking for a compact representation of logical columns can still use fwrite(logical01=TRUE). We also opted for now to check only Y, N and not Yes/No/YES/NO.
fwrite() with compress="gzip" produces compatible gz files when composed of multiple independent chunks owing to parallelization, #6356. Earlier fwrite() versions could have issues with HTTP upload using Content-Encoding: gzip and Transfer-Encoding: chunked. Thanks to @oliverfoster for report and @philippechataignon for the fix. Thanks also @aitap for pre-release testing that found some possible memory leaks in the initial fix.
fwrite() gains a new parameter compressLevel to control compression level for gzip, #5506. This parameter balances compression speed and total compression, and corresponds directly to the analogous command-line parameter, e.g. compressLevel=4 corresponds to passing -4; the default, 6, matches the command-line default, i.e. equivalent to passing -6. Thanks @mgarbuzov for the request and @philippechataignon for implementing.

BUG FIXES

fwrite() respects dec=',' for timestamp columns (POSIXct or nanotime) with sub-second accuracy, #6446. Thanks @kav2k for pointing out the inconsistency and @MichaelChirico for the PR.
The data.table-only attribute $.internal.selfref is no longer set for data.frames. #5286. Thanks @OfekShilon for the report and fix.
Tagging/naming arguments of c() in j=c() should now more closely follow base R conventions for concatenation of named lists during grouping, #2311. Naming an lapply(.SD, FUN) call as an argument of c() in j will now always cause that tag to get prepended (with a single dot separator) to the resulting column names. Additionally, naming a list() call as an argument of c() in j will now always cause that tag to get prepended to any names specified within the list call. This bug only affected queries with (1) by= grouping (2) getOption("datatable.optimize") >= 1L and (3) lapply(.SD, FUN) in j.

While the names returned by data.table when j=c() will now mostly follow base R conventions for concatenating lists, note that names which are completely unspecified will still be named positionally, matching the typical behavior in j and data.table(). according to position in j (e.g. V1, V2).

Thanks to @franknarf1 for reporting and @myoung3 for the PR.
```
# tag 'mean' prepended to lapply()-named columns
names(mtcars[, c(mean=lapply(.SD,sum)), by="cyl", .SDcols=c("am", "carb")])
# [1] "cyl" "mean.am" "mean.carb"

# tag 'mean' is prepended to the first named sublist, 'sum' to the second
names(mtcars[, c(mean=list(a=mean(hp), b=mean(wt)), sum=lapply(.SD, sum)), by="cyl", .SDcols=c("am", "carb")])
# [1] "cyl" "mean.a" "mean.b" "sum.am" "sum.carb"

# strict base naming would result in names c("", "b", "c") here
names(mtcars[, c(list(mean(hp), b=mean(wt)), c=list(mean(cyl)))])
# [1] "V1" "b" "c"
```
Queries like DT[, min(x):max(x)] now work as expected, i.e. the same as DT[, seq(min(x), max(x))] or with(DT, min(x):max(x)), #2069. Shorthand like DT[, a:b] meaning “select from columns a through b” still works. Thanks to @franknarf1 for reporting, @jangorecki for the fix, and @MichaelChirico for follow-ups ensuring back-compatibility.
fread() performance improves when specifying Date among colClasses, #6105. One implication of the change is that the column will be an IDate (which also inherits from Date), which may affect code strongly relying on the column class to be Date exactly; computations with IDate and Date columns should otherwise be the same. If you strongly prefer the Date class, run as.Date() explicitly following fread(). Thanks @scipima for the report and @MichaelChirico for the fix.
dt[, col] now returns a copy of col also when it is a list column, as in any other case, #4877. Thanks to @tlapak for reporting and the PR.
rbindlist and rbind binding bit64::integer64 columns with character/complex/list columns now works, #5504. Thanks to @MichaelChirico for the request and @ben-schwen for the PR.
Fixed possible segfault in setDT(df); attr(df, key) <- value; set(df, ...), i.e. adding columns to an object with set() that was converted to data.table with setDT() and later had attributes add with attr<-, #6410. Thanks to @hongyuanjia for the report and @ben-schwen for the PR. Note that setattr() should be preferred for adding attributes to a data.table.
DT[1, on=NULL] now works for returning the first row, #6579. Thanks to @Kodiologist for the report and @tdhock for the PR.
tables() now returns the correct size for data.tables over 2GiB, #6607. Thanks to @vlulla for the report and the PR.
rbindlist(l, use.names=TRUE) can now handle different encodings for the column names in different entries of l, #5452. Thanks to @MEO265 for the report, and Benjamin Schwendinger for the fix.
Added a data.frame method for format_list_item() to fix error printing data.tables with columns containing 1-column data.frames, #6592. Thanks to @r2evans for the bug report and fix.
Auto-printing gets some substantial improvements

Suppression in knitr documents is now done by implementing a method for knit_print instead of looking up the call stack, #6589. The old way was fragile and wound up broken by some implementation changes in {knitr}. Thanks to @jangorecki for the report #6509 and @aitap for the fix.
print() methods for S3 subclasses of data.table (e.g. an object of class c("my.table", "data.table", "data.frame")) no longer print where plain data.tables wouldn’t, e.g. myDT[, y := 2], #3029. The improved detection of auto-printing scenarios has the added benefit of allowing print in highly explicit statements like print(DT[, y := 2]), obviating our recommendation since v1.9.6 to append [] to signal “please print me”.

Joins of integer64 and double columns succeed when the double column has lossless integer64 representation, #4167 and #6625. Previously, this only worked when the double column had lossless 32-bit integer representation. Thanks @MichaelChirico for the reports and fix.
DT[order(...)] better matches base::order() behavior by (1) recognizing the method= argument (and erroring since this is not supported) and (2) accepting a vector of TRUE/FALSE in decreasing= as an alternative to using -a to convey “sort a decreasing”, #4456. Thanks @jangorecki for the FR and @MichaelChirico for the PR.
Assignment with := to an S4 slot of an under-allocated data.table now works, #6704. Thanks @MichaelChirico for the report and fix.
as.data.table() method for data.frames (especially those with extended classes) is more consistent with as.data.frame() with respect to rention of attributes, #5699. Thanks @jangorecki for the report and fix.
Grouped queries on keyed tables no longer return an incorrectly keyed result if the ad hoc by= list has some function call (in particular, a function which happens to return a strictly decreasing function of the keys), e.g. by=.(a = rev(a)), #5583. Thanks @AbrJA for the report and @MichaelChirico for the fix.
An integer overflow in fread() with lines longer than 2^(31/2) bytes is prevented, #6729. The typical impact was no worse than a wrong initial allocation size, corrected later. Thanks to @TaikiSan21 for the report and @aitap for the fix.
Fixed a memory issue causing segfaults in forder, #6797. Thanks @dkutner for the report and @MichaelChirico for the fix.
setDT(get0('var')) now correctly modifies var by reference, consistent with the long-standing behavior of setDT(get('var')), #6864. Thanks to @rikivillalba for the report and @venom1204 for the fix.
fread() could fail to read Mac CSV files (with \r line endings) if the file contained any \n character, such as a final \r\n. This was fixed by detecting the predominant line ending in a sample of the file, #4186. Thanks to @MPagel for the report and @ben-schwen for the fix.
By reference assignments (‘:=’) with functions that modified the data.table by reference e.g. (foo=function(DT){modify(DT);return(1L)}, DT[,a:=foo(DT)]) returned a malformed data.table due to the modification of the targeted named column index (“a”) during the j expression evaluation #6768. Thanks @AntonNM for the report and fix.

NOTES

There is a new vignette on joins! See vignette("datatable-joins"). Thanks to Angel Feliz for authoring it! Feedback welcome. This vignette has been highly requested since 2017: #2181.
Tests run again when some Suggests packages are missing, #6411. Thanks @aadler for the note and @MichaelChirico for the fix.
Some grouping operations run much faster under verbose=TRUE, #6286. Thanks @joshhwuu for the report and fix. This overhead was not present on Windows. As a rule, users should expect verbose=TRUE operations to run more slowly, as extra statistics might be calculated as part of the report; here was a case where the overhead was particularly high and the fix was particularly easy.
set() and := now provide some extra guidance for common incorrect approaches to assigning NULL to some rows of a list column. The correct way is to put list(list(NULL)) on the RHS of := (or .(.(NULL)) for short). Thanks to @MichaelChirico for the suggestion and @Nj221102 for the implementation.
Improved the error message when trying to write code like DT[, ":="(a := b, c := d)] (which should be DT[, ":="(a = b, c = d)]), #5296. Thanks @MichaelChirico for the suggestion & fix.
measurev() was implemented and documented in v1.15.0, for use within melt(), and it is now exported (dependent packages can now use without a NOTE from CRAN check).
The dcast() and melt() generics no longer attempt to redirect to {reshape2} methods when passed non-data.tables. If you’re still using {reshape2}, you must use namespace-qualification: reshape2::dcast(), reshape2::melt(). We have been warning about the deprecation since v1.12.4 (2019). Please note that {reshape2} is retired.
showProgress in [ is disabled for “trivial” grouping (.NGRP==1L), #6668. Thanks @MichaelChirico for the request and @joshhwuu for the PR.
key<-, marked as deprecated since 2012 and unusable since v1.15.0, has been fully removed.
The following in-progress deprecations have proceeded:

Using fwrite(logicalAsInt=) has been upgraded from a warning (since v1.15.0) to an error. It will be removed in the next release.
Using fread(autostart=) has been upgraded to an error. It has been warning since v1.11.0 (6 years ago). The argument will be removed in the next release.
Using droplevels(in.place=TRUE) (warning since v1.16.0) has been upgraded from warning to error. The argument will be removed in the next release.
Use of := and with=FALSE in [ has been upgraded from warning (since v1.15.0) to error. Long ago (before 2014), this was needed when, e.g., assigning to a vector of column names defined outside the table, but with=FALSE is no longer needed to do so: DT[, (cols) := ...] works fine.

Better handling of multibyte characters in print(), added in 1.16.0, has the side effect of possibly ignoring invisible characters like \n or \t for the purposes of counting width for datatable.prettyprint.char. That’s because we switched to using strtrim() over substring(), the latter of which is explicitly discouraged for the purposes of truncating strings, whereas the former of which has platform-dependent behavior for whether invisible characters count towards string width.

data.table v1.16.4 4 December 2024

BUG FIXES

Joins on multiple columns, such as x[y, on=c("x1==y1", "x2==y1")], could fail during implicit type coercions if x1 and x2 had different but still compatible types, #6602. This was particularly unexpected when columns x1, x2, and y1 were all of the same class, e.g. Date, but differed in their underlying storage types. Thanks to Benjamin Schwendinger for the report and the fix.

data.table v1.16.2 (9 October 2024)

BUG FIXES

Using print.data.table() with character truncation using datatable.prettyprint.char no longer errors with NA entries, #6441. Thanks to @r2evans for the bug report, and @joshhwuu for the fix.
Fixed a segfault in fcase(), #6448. Thanks @ethanbsmith for reporting with reprex, @aitap for finding the root cause, and @MichaelChirico for the PR.
fread() automatically detects timestamps with sub-second accuracy again, #6440. This was a regression due to interference with new dec='auto' support. Thanks @kav2k for the concise report and @MichaelChirico for the fix.
Using a namespace-qualified call on the RHS of by=, e.g. DT[,.N,by=base::mget(v)], works again, fixing #6493. Thanks to @mmoisse for the report and @MichaelChirico for the fix.
Restore some join operations on x and i (e.g. an anti-join x[!i]) where i is an extended data.frame, but not a data.table (e.g. a tbl), #6501. Thanks @MichaelChirico for the report and PR.

NOTES

Fixed a typo in the NEWS for the last release – that’s version 1.16.0, not 1.6.0; apologies. Thanks @r2evans for flagging, #6443.
Continued work to remove non-API C functions, #6180. Thanks Ivan Krylov for the PR and for writing a clear and concise guide about the R API: https://aitap.codeberg.page/R-api/.
data.table again properly detects OpenMP support when built using gcc on macOS, #6409. Thanks @barracuda156 for the report and @kevinushey for the fix.
The translations submitted for 1.16.0 are now actually shipped with the package – our deepest apologies to the translators for the omission. We have added a CI check to ensure that the .mo binaries which get shipped with the package are always up-to-date.

data.table v1.16.0 (25 August 2024)

BREAKING CHANGES

droplevels(in.place=TRUE) is deprecated in favor of calling setdroplevels(), #6014. Given the associated risks/pain points, we strongly prefer all in-place/by-reference behavior within data.table come from functions set* (and :=) to make it as clear as possible that inputs are mutable. See below and ?setdroplevels for more.
`[.data.table` is un-exported again. This was exported to support an experimental feature (DT() functional form of [) that never made it to release, but we forgot to claw back this export in the NAMESPACE; sorry about that. We didn’t find anyone calling the method directly (which is inadvisable to begin with).

NEW FEATURES

We continue to consider user feedback to prioritize development. See #3189 for the current list of most-requested issues. In this release we add five highly-requested features:
1. Using dt[, names(.SD) := lapply(.SD, fx)] now works to update all columns, #795. Of course this also works when .SD is only a subset of the columns: dt[, names(.SD) := lapply(.SD, fx), .SDcols = is.numeric]. Thanks to @brodieG for the report, 20 or so others for chiming in, and @ColeMiller1 for PR.
2. fread() now supports automatic detection of dec (as either . or ,, the latter being common in many places in Europe, Africa, and South America); this behavior is now the default, i.e. dec='auto', #2431. Thanks @mattdowle for the original issue, 50 or more others for expressing support, and @MichaelChirico for the fix.
3. fcase() supports vectors in default= (so the default can vary by row) and default= is now lazily evaluated, #4258. Thanks @sindribaldur for the feature request, @shrektan for doing most of the implementation, and @MichaelChirico for sewing things up. Thanks also to @DavisVaughan for some design guidance before release to remove an extraneous feature, #6352.
4. [.data.table gains argument showProgress, allowing users to toggle progress printing for slow “group by” operations, #3060. The progress bar reports information such as the number of groups processed, total groups, total time elapsed and estimated time until completion. This feature doesn’t apply to GForce-optimized operations. Thanks to @eatonya and @zachmayer for filing FRs, and to everyone else that up-voted/chimed in on the issue. Thanks to @joshhwuu for the PR.
5. rbindlist(l, use.names=TRUE) and rbind() now work correctly on columns with different class attributes across the inputs for certain classes such as Date, IDate, ITime, POSIXct and AsIs with matched columns of similar classes, e.g., rbind(data.table(d = Sys.Date()), data.table(d = as.IDate(Sys.Date()-1))). The conversion is done automatically and the class attribute of the final column is determined by the first class attribute encountered in the binding list, #5309, #4934, #5391.
rbindlist(l, ignore.attr=TRUE) and rbind() also gain argument ignore.attr (default FALSE) to manually deactivate the safety net preventing binding columns with different column classes, #3911, #5542. Thanks to @dcaseykc, @fox34, @adrian-quintario, @berg-michael, @arunsrinivasan, @statquant, @pkress, @jrausch12, @therosko, @OfekShilon, @iMissile, @tdhock for the requests and @ben-schwen for the PR.
print.data.table() shows empty (NULL) list column entries as [NULL] for emphasis. Previously they would just print nothing (indistinguishable from an empty string). Part of #4198. Thanks @sritchie73 for the proposal and fix.
```
data.table(a=list(NULL, ""))
#         a
#    <list>
# 1: [NULL]
# 2:
```
.datatable.aware = FALSE works correctly to signal it’s not safe to call data.table methods, #5654. Thanks @dvg-p4 for the request and PR. See vignette("datatable-importing") for more on this feature.
The split() method for data.tables is more consistent with that for base methods:
1. f can be a formula, #5392, mirroring the same in base::split.data.frame since R 4.1.0 (May 2021). Thanks to @XiangyunHuang for the request, and @ben-schwen for the PR.
2. sep= is recognized when splitting with by=, just like the default and data.frame methods #5417. Thanks @MichaelChirico for the request and PR.
Namespace-qualifying data.table::shift(), data.table::first(), or data.table::last() will not deactivate GForce, #5942. Thanks @MichaelChirico for the proposal and fix. Namespace-qualifying other calls like stats::sum(), base::prod(), etc., continue to work as an escape valve to avoid GForce, e.g. to ensure S3 method dispatch.
transpose() gains list.cols= argument (default FALSE), #5639. Use this to return output with list columns and avoid type promotion (an exception is factor columns which are promoted to character for consistency between list.cols=TRUE and list.cols=FALSE). This is convenient for creating a row-major representation of a table. Thanks to @MLopez-Ibanez for the request, and @ben-schwen for the PR.
fread()’s fill argument now also accepts an integer in addition to boolean values – an upper bound on the number of columns in the file. fread always guesses the number of columns based on reading a sample of rows in the file. When fill=TRUE, fread() stops reading and ignores subsequent rows when this estimate winds up too low, e.g. when the sampled rows happen to exclude some rows that are even wider, #2691, #4130, #3436, #1812 and #5378. The suggestion for fill to allow a manual estimate of the number of columns instead comes from #2727. Using fill=Inf reads the full file for estimating the number of columns. Thanks to @jangorecki, @christellacaze, @Yiguan, @alexdthomas, @ibombonato, @Befrancesco, @TobiasGold for reporting/requesting, and @ben-schwen for the PR.
Computations in j can return a matrix or array if it is one-dimensional, e.g. a row or column vector, when j is a list of columns during grouping, #783. Previously a matrix could be provided in DT[, expr, by] form, but not DT[, list(expr), by] form; this resolves that inconsistency. It is still an error to return a “true” array, e.g. a 2x3 matrix.
measure() helper for melt() now supports user-specified cols argument, which can be useful to specify a subset of columns to reshape without having to use a regex, #5063. Thanks to @UweBlock and @Henrik-P for reporting, and @tdhock for the PR.
setDT() is faster for data with many columns, thanks @MichaelChirico for reporting and fixing the issue, #5426.
dcast() gains value.var.in.dots, value.var.in.LHSdots and value.var.in.RHSdots arguments, #5824. This allows the value.var variable(s) in dcast() to be represented by ... in the formula (if not otherwise mentioned). Thanks to @iago-pssjd for the report and PR.
fread() loads .bgz files directly, #5461. Thanks to @TMRHarrison for the request with proposed fix, and @ben-schwen for the PR.
New setdroplevels() as a by-reference version of the droplevels() method, which returns a copy of its input, #6014. Thanks @MichaelChirico for the suggestion and implementation.
dcast(fill=NULL) only computes default fill value if necessary, which eliminates some previous warnings which were potentially confusing (for example, when fun.aggregate=min or max, warning was “NAs introduced by coercion to integer range”), #5512, #5390. Thanks to @tdhock for the report and fix.
patterns() helper for .SDcols now accepts arguments ignore.case, perl, fixed, and useBytes, which are passed to grep(), #5387. Thanks to @iago-pssjd for the feature request, and @tdhock for the implementation.
print() method for data.tables:
1. Now handles combination multibyte characters correctly when truncating wide string entries, #5096. Thanks to @MichaelChirico for the report and @joshhwuu for the fix.
2. Prints the indicator --- in every value column when truncation is needed and row.names = FALSE instead of adding a blank column where the rownames would have been just to include ---, #4083. Thanks @MichaelChirico for the report and @joshhwuu for the fix.
3. Honors na.print, as seen in print.default, allowing for string replacement of NA values when printing. Thanks @HughParsonage for the report and @joshhwuu for the fix.
4. Gains new argument show.indices (with corresponding option datatable.show.indices) that allows the user to print a data.table’s indices as columns without having to modify the data.table itself. Thanks @MichaelChirico for the report and @joshhwuu for the PR.
5. Displays integer64 columns correctly by loading {bit64} if needed, #6224. Thanks @renkun-ken for the report and @MichaelChirico for the fix.

BUG FIXES

unique() returns a copy when nrows(x) <= 1 instead of a mutable alias, #5932. This is consistent with existing unique() behavior when the input has no duplicates but more than one row. Thanks to @brookslogan for the report and @dshemetov for the fix.
dcast() handles coercion of fill to integer64 correctly, #4561. Thanks to @emallickhossain for the bug report and @MichaelChirico for the fix.
Optimized shift() per group produces the right results when simultaneously subsetting, for example, DT[i==1L, shift(x), by=group], #5962. Thanks to @renkun-ken for the report and @ben-schwen for the fix.
fwrite(x, row.names=TRUE) with x a matrix writes row.names when present, not row numbers, #5315. Thanks to @Liripo for the report, and @ben-schwen for the fix.
Adding a list column to an empty data.table works consistently with other column types, #5738. Thanks to Benjamin Schwendinger for the report and the fix.
In DT[,j,by], by retains its attributes (e.g. class) when j is GForce optimized, #5567. Thanks to @danwwilson for the report, and @ben-schwen for the PR.
dt[,,by=año] (i.e., using a column name containing a non-ASCII character in by as a plain symbol) no longer errors with “object ‘año’ not found”, #4708. Thanks @pfv07 for the report, and @MichaelChirico for the fix. Also thanks to @aitap for suggesting an improvement to the corresponding test, #6339.
Fixed some memory management issues in the C routines backing melt(), froll(), and GForce mean(), as identified by rchk. Thanks Tomas Kalibera and the CRAN team for setting up the rchk system, and @MichaelChirico for the fix.
data.table’s all.equal() method now dispatches to each column’s own all.equal() method as appropriate, #4543. Thanks @MichaelChirico for the report and fix. Note that this had two noteworthy changes to data.table’s own test suite that might affect you:
1. Comparisons of POSIXct columns compare absolute, not relative differences, meaning that millisecond-scale differences might trigger a “not equal” report that was hidden before.
2. Comparisons of integer64 columns could be totally wrong since they were being compared on the basis of their representation as doubles, not long integers.
The former might be a matter of preference requiring you to specify a different tolerance=, while the latter was clearly a bug.
rbindlist() and shift() could lead to a protection stack overflow when applied to a list containing many nested lists exceeding the pointer protection stack size, #4536. Thanks to @ProfFancyPants for reporting, and @ben-schwen (rbindlist) and @MichaelChirico (shift) for the fix.
fread(x, colClasses="POSIXct") now also works for columns containing only NA values, #6208. Thanks to @markus-schaffer for the report, and @ben-schwen for the fix.
fread() is more careful about detecting that a file is compressed in bzip2 format, #6304. In particular, we also check the 4th byte of the file is a digit; in rare cases, a legitimate uncompressed CSV file could match ‘BZh’ as the first 3 bytes. We think an uncompressed CSV file matching ‘BZh[1-9]’ is all the more rare and unlikely to be encountered in “real” examples. Other formats (zip, gzip) are friendly enough to use non-printable characters in their magic numbers. Thanks @grainnemcguire for the report and @MichaelChirico for the fix.
Selecting the key column like DT[, .(key1, key2)] will retain the key without a performance penalty, #4498. Thanks to @user9439449 on StackOverflow for the report and @MichaelChirico for the fix.
Passing functions programmatically with env= doesn’t produce an opaque error, e.g. DT[, f(b), env = list(f=sum)], #6026. Note that it’s much better to pass functions like f="sum" instead. Thanks to @MichaelChirico for the bug report and fix.

NOTES

transform() method for data.table sped up substantially when creating new columns on large tables. Thanks to @OfekShilon for the report and PR. The implemented solution was proposed by @ColeMiller1.
The documentation for the fill argument in rbind() and rbindlist() now notes the expected behaviour for missing list columns when fill=TRUE, namely to use NULL (not NA), #4198. Thanks @sritchie73 for the proposal and fix.
data.table now depends on R 3.3.0 (2016) instead of 3.1.0 (2014). Recent versions of R have good features that we would gradually like to incorporate, and we see next to no usage of these very old versions of R. We originally attempted to bump only to R 3.2.0 in this release, but our vignette engine {knitr} requiring 3.3.0 and R CMD check lacking an --ignore-vignettes option until 3.3.0 essentially forced our hands.
Erroneous assignment calls in [ with a trailing comma (e.g. DT[, `:=`(a = 1, b = 2,)]) get a friendlier error since this situation is common during refactoring and easy to miss visually. Thanks @MichaelChirico for the fix.
Input files are now kept open during mmap() when running under Emscripten, emscripten-core/emscripten#20459. This avoids an error in fread() when running in WebAssembly, #5969. Thanks to @maek-ies for the report and @georgestagg for the PR.
dcast() improves behavior for the situation that the fun.aggregate value of length() is used but not provided by the user.
1. This now triggers a warning, not a message, since relying on this default often signals unexpected duplicates in the data, #5386. The warning is classed as dt_missing_fun_aggregate_warning, allowing for more targeted handling in user code. Thanks @MichaelChirico for the suggestion and @Nj221102 for the fix.
2. The warning itself does better explaining the behavior and suggesting alternatives, #5217. Thanks @MichaelChirico for the suggestion and @Nj221102 for the fix.
Updated a test relying on operator > working for comparing language objects to a string, which will be deprecated by R, #5977; no user-facing effect. Thanks to R-core for continuously improving the language.
Improved OpenMP detection when building from source on Mac, #4348. Thanks @jameshester and @kevinushey for the request and @kevinushey for the PR, @jameslamb for the advice and @s-u of R-core for ensuring CRAN machines are configured to support the expected setup.
test.data.table() runs more robustly:
1. In sessions where the digits or warn options are not their defaults (7 and 0, respectively), #5285. Thanks @OfekShilon for the report and suggested fix and @MichaelChirico for the PR.
2. In locales where letters != sort(letters), e.g. Latvian, #3502. Thanks @minemR for the report and @MichaelChirico for the fix.
3. Initialises the numeric rounding value to 0 using setNumericRounding(0) to avoid failed tests if the user has set a different value, #6082. The user’s value is restored on exit. Thanks to @MichaelChirico for the report and for describing the solution, and @markseeto for implementing.
To enable this, setNumericRounding() now invisibly returns the old rounding value instead of NULL, which is consistent with similar behavior by setwd(), options(), etc. Thanks @MichaelChirico for the report and @joshhwuu for the fix.
The measure() and patterns() helpers for [ and melt() are now exported to ensure consistency with other non-standard evaluation (NSE) exports like .N and :=. This change addresses #5604, allowing package developers to import these names and avoid R CMD check NOTEs about undefined variables. Thanks to @MichaelChirico and @ylelkes for their suggestions, and to @Nj221102 for the implementation.

We plan to export similar placeholders for . and J in roughly one year (e.g. data.table 1.18.0), but excluded them from this release to avoid back-compatibility issues. Specifically, some packages doing import(plyr) and import(data.table), and/or with those packages in Depends, will error when data.table starts exporting . (and similarly for a potential conflict with rJava::J()). We discourage using data.table (or any package, really) in Depends; blanket import() of package is also generally best avoided. See vignette("datatable-importing").
fwrite() header rows are no longer quoted automatically when na argument is given, #2964. Thanks @jangorecki for the report and @joshhwuu for the fix.
Removed a warning about the now totally-obsolete option datatable.CJ.names, as discussed in previous releases.
Refactored some non-API calls in the package C code, #6180. There should be no user-visible change. Thanks to various R users, R core, and especially Luke Tierney for pushing to have a clearer definition of “API” for R and for offering clear documentation and suggested workarounds. Thanks @MichaelChirico and @TysonStanley for implementing changes for this release; more will follow.
C code is more unified in how failures to allocate memory (malloc()/calloc()) are handled, #1115. No OOM issues were reported, as these regions of code typically request relatively small blocks of memory, but it is good to handle memory pressure consistently. Thanks @elfring for the report and @MichaelChirico for the clean-up effort and future-proofing linter.
The internal routine for finding sort order (forder()) will now re-use any existing index. A similar optimization was already present in R code, but this has now been pushed to C and covers a wider range of use cases and collects more statistics about its input (e.g. whether any infinite entries were found), opening the possibility for more optimizations in other functions.

Functions setindex() (and setindexv()) will now compute groups’ positions as well. setindex() also collects the extra statistics alluded to above.

Finding sort order in other routines (for example subset d2[id==1L]) does not include those extra statistics so as not to impose a slowdown.
```
 d2 = data.table(id=2:1, v2=1:2)
setindexv(d2, "id")
str(attr(attr(d2, "index"), "__id"))
 # int [1:2] 2 1
# - attr(*, "starts")= int [1:2] 1 2
# - attr(*, "maxgrpn")= int 1
# - attr(*, "anyna")= int 0
# - attr(*, "anyinfnan")= int 0
# - attr(*, "anynotascii")= int 0
# - attr(*, "anynotutf8")= int 0

d2 = data.table(id=2:1, v2=1:2)
invisible(d2[id==1L])
str(attr(attr(d2, "index"), "__id"))
# int [1:2] 2 1
```
This feature also enables re-use of sort index during joins, in cases where one of the calls to find sort order is made from C code.
```
d1 = data.table(id=1:2, v1=1:2)
d2 = data.table(id=2:1, v2=1:2)
setindexv(d2, "id")
d1[d2, on="id", verbose=TRUE]
#...
#Starting bmerge ...
#forderReuseSorting: using existing index: __id
#forderReuseSorting: opt=2, took 0.000s
#...
```
This feature resolves #4387, #2947, #4380, and #1321. Thanks to @jangorecki, @jan-glx, and @MichaelChirico for the reports and @jangorecki for implementing.
set() now adds new columns even if no rows are updated, #5409. This behavior is now consistent with :=, thanks to @mb706 for the report and @joshhwuu for the fix.
The internal init() function in the fread.c module has been marked as static, #6328. This obviates name collisions, and the resulting segfaults, with other libraries visible to the R process that might expose the same symbol name. This was observed in Cray HPE environments where the libsci library providing LAPACK to R already has an init symbol. Thanks to @rtobar for the report and fix.
?melt has long documented that the returned variable column should contain integer column indices when measure.vars is a list, but when the list length is 1, variable is actually a character column name, which is inconsistent with the documentation, #5209. To increase consistency in the next release, we plan to change variable to integer, so users who were relying on this behavior should change measure.vars=list("col_name") (variable currently is a column name but will be a column index/integer after this planned change) to measure.vars="col_name" (variable is column name before and after the planned change). For now, relying on this undocumented behavior throws a new warning.
?dcast has always required fun.aggregate to return a single value, and when fill=NULL, dcast would indeed error if a vector with length!=1 was returned, but an undefined result was silently returned for non-NULL fill. Now dcast() will additionally warn that this is undefined behavior when fill is not NULL, #6032. In particular, this will warn for fun.aggregate=identity, which was observed in several revdeps. We may change this to an error in a future release, so revdeps should fix their code as soon as possible. Thanks to @tdhock for the PR, and @MichaelChirico for analysis of GitHub revdeps.

TRANSLATIONS

Fix a typo in a Mandarin translation of an error message that was hiding the actual error message, #6172. Thanks @trafficfan for the report and @MichaelChirico for the fix.
data.table is now translated into Brazilian Portuguese (pt_BR) and Spanish (es) as well as Mandarin (zh_CN). Thanks to the new translation teams consisting initially of @rffontenelle, @leofontenelle, and @italo-07 for Portuguese; and @rikivallalba, @rivaquiroga, and @MaraDestefanis for Spanish. The teams are open if you’d also like to join and support maintenance of these translations.
A more helpful error message for using := inside the first argument (i) of [.data.table is now available in translation, #6293. Previously, the code to display this assumed an earlier message was printed in English. The solution is for calling := directly (i.e., outside the second argument j of [.data.table) to throw an error of class dt_invalid_let_error. Thanks to Spanish translator @rikivillalba for spotting the issue and @MichaelChirico for the fix.

data.table v1.15.4 (27 March 2024)

BUG FIXES

Optimized shift per group produced wrong results when simultaneously subsetting, for example, DT[i==1L, shift(x), by=group], #5962. Thanks to @renkun-ken for the report and Benjamin Schwendinger for the fix.

NOTES

Updated a test relying on > working for comparing language objects to a string, which will be deprecated by R, #5977; no user-facing effect. Thanks to R-core for continuously improving the language.

data.table v1.15.2 (27 Feb 2024)

BUG FIXES

An error in fwrite() is more robust across platforms – CRAN found the use of PRId64 does not always match the output of xlength(), e.g. on some Mac M1 builds #5935. Thanks CRAN for identifying the issue and @ben-schwen for the fix.
shift() of a vector in grouped queries (under GForce) returns a vector, consistent with shift() in other contexts, #5939. Thanks @shrektan for the report and @MichaelChirico for the fix.

data.table v1.15.0 (30 Jan 2024)

BREAKING CHANGE

shift and nafill will now raise error input must not be matrix or array when matrix or array is provided on input, rather than giving useless result, #5287. Thanks to @ethanbsmith for reporting.

NEW FEATURES

nafill() now applies fill= to the front/back of the vector when type="locf|nocb", #3594. Thanks to @ben519 for the feature request. It also now returns a named object based on the input names. Note that if you are considering joining and then using nafill(...,type='locf|nocb') afterwards, please review roll=/rollends= which should achieve the same result in one step more efficiently. nafill() is for when filling-while-joining (i.e. roll=/rollends=/nomatch=) cannot be applied.
mean(na.rm=TRUE) by group is now GForce optimized, #4849. Thanks to the h2oai/db-benchmark project for spotting this issue. The 1 billion row example in the issue shows 48s reduced to 14s. The optimization also applies to type integer64 resulting in a difference to the bit64::mean.integer64 method: data.table returns a double result whereas bit64 rounds the mean to the nearest integer.
fwrite() now writes UTF-8 or native csv files by specifying the encoding= argument, #1770. Thanks to @shrektan for the request and the PR.

data.table() no longer fills empty vectors with NA with warning. Instead a 0-row data.table is returned, #3727. Since data.table() is used internally by .(), this brings the following examples in line with expectations in most cases. Thanks to @shrektan for the suggestion and PR.

DT = data.table(A=1:3, B=letters[1:3])
DT[A>3,   .(ITEM='A>3', A, B)]  # (1)
DT[A>3][, .(ITEM='A>3', A, B)]  # (2)
# the above are now equivalent as expected and return:
Empty data.table (0 rows and 3 cols): ITEM,A,B
# Previously, (2) returned :
      ITEM     A      B
   <char> <int> <char>
1:    A>3    NA   <NA>
Warning messages:
1: In as.data.table.list(jval, .named = NULL) :
  Item 2 has 0 rows but longest item has 1; filled with NA
2: In as.data.table.list(jval, .named = NULL) :
  Item 3 has 0 rows but longest item has 1; filled with NA

DT = data.table(A=1:3, B=letters[1:3], key="A")
DT[.(1:3, double()), B]
# new result :
character(0)
# old result :
[1] "a" "b" "c"
Warning message:
In as.data.table.list(i) :
  Item 2 has 0 rows but longest item has 3; filled with NA

%like% on factors with a large number of levels is now faster, #4748. The example in the PR shows 2.37s reduced to 0.86s on a factor length 100 million containing 1 million unique 10-character strings. Thanks to @statquant for reporting, and @shrektan for implementing.
keyby= now accepts TRUE/FALSE together with by=, #4307. The primary motivation is benchmarking where by= vs keyby= is varied across a set of queries. Thanks to Jan Gorecki for the request and the PR.
```
DT[, sum(colB), keyby="colA"]
DT[, sum(colB), by="colA", keyby=TRUE]   # same
```
fwrite() gains a new datatable.fwrite.sep option to change the default separator, still "," by default. Thanks to Tony Fischetti for the PR. As is good practice in R in general, we usually resist new global options for the reason that a user changing the option for their own code can inadvertently change the behaviour of any package using data.table too. However, in this case, the global option affects file output rather than code behaviour. In fact, the very reason the user may wish to change the default separator is that they know a different separator is more appropriate for their data being passed to the package using fwrite but cannot otherwise change the fwrite call within that package.
melt() now supports NA entries when specifying a list of measure.vars, which translate into runs of missing values in the output. Useful for melting wide data with some missing columns, #4027. Thanks to @vspinu for reporting, and @tdhock for implementing.
melt() now supports multiple output variable columns via the variable_table attribute of measure.vars, #3396 #2575 #2551, #4998. It should be a data.table with one row that describes each element of the measure.vars vector(s). These data/columns are copied to the output instead of the usual variable column. This is backwards compatible since the previous behavior (one output variable column) is used when there is no variable_table. New functions measure() and measurev() which use either a separator or a regex to create a measure.vars list/vector with variable_table attribute; useful for melting data that has several distinct pieces of information encoded in each column name. See new ?measure and new section in reshape vignette. Thanks to Matthias Gomolka, Ananda Mahto, Hugh Parsonage, Mark Fairbanks for reporting, and to Toby Dylan Hocking for implementing. Thanks to @keatingw for testing before release, requesting measure() accept single groups too #5065, and Toby for implementing.

A new interface for programming on data.table has been added, closing #2655 and many other linked issues. It is built using base R’s substitute-like interface via a new env argument to [.data.table. For details see the new vignette programming on data.table, and the new ?substitute2 manual page. Thanks to numerous users for filing requests, and Jan Gorecki for implementing.

DT = data.table(x = 1:5, y = 5:1)

# parameters
in_col_name = "x"
fun = "sum"
fun_arg1 = "na.rm"
fun_arg1val = TRUE
out_col_name = "sum_x"

# parameterized query
#DT[, .(out_col_name = fun(in_col_name, fun_arg1=fun_arg1val))]

# desired query
DT[, .(sum_x = sum(x, na.rm=TRUE))]

# new interface
DT[, .(out_col_name = fun(in_col_name, fun_arg1=fun_arg1val)),
  env = list(
    in_col_name = "x",
    fun = "sum",
    fun_arg1 = "na.rm",
    fun_arg1val = TRUE,
    out_col_name = "sum_x"
  )]

DT[, if (...) .(a=1L) else .(a=1L, b=2L), by=group] now returns a 1-column result with warning j may not evaluate to the same number of columns for each group, rather than error 'names' attribute [2] must be the same length as the vector, #4274. Thanks to @robitalec for reporting, and Michael Chirico for the PR.
Typo checking in i available since 1.11.4 is extended to work in non-English sessions, #4989. Thanks to Michael Chirico for the PR.
fifelse() now coerces logical NA to other types and the na argument supports vectorized input, #4277 #4286 #4287. Thanks to @michaelchirico and @shrektan for reporting, and @shrektan for implementing.
.datatable.aware is now recognized in the calling environment in addition to the namespace of the calling package, dtplyr#184. Thanks to Hadley Wickham for the idea and PR.
New convenience function %plike% maps to like(..., perl=TRUE), #3702. %plike% uses Perl-compatible regular expressions (PCRE) which extend TRE, and may be more efficient in some cases. Thanks @KyleHaynes for the suggestion and PR.
fwrite() now accepts sep="", #4817. The motivation is an example where the result of paste0() needs to be written to file but paste0() takes 40 minutes due to constructing a very large number of unique long strings in R’s global character cache. Allowing fwrite(, sep="") avoids the paste0 and saves 40 mins. Thanks to Jan Gorecki for the request, and Ben Schwen for the PR.
data.table printing now supports customizable methods for both columns and list column row items, part of #1523. format_col is S3-generic for customizing how to print whole columns and by default defers to the S3 format method for the column’s class if one exists; e.g. format.sfc for geometry columns from the sf package, #2273. Similarly, format_list_item is S3-generic for customizing how to print each row of list columns (which lack a format method at a column level) and also by default defers to the S3 format method for that item’s class if one exists. Thanks to @mllg who initially filed #3338 with the seed of the idea, @franknarf1 who earlier suggested the idea of providing custom formatters, @fparages who submitted a patch to improve the printing of timezones for #2842, @RichardRedding for pointing out an error relating to printing wide expression columns in #3011, @JoshOBrien for improving the output for geometry columns, and @MichaelChirico for implementing. See ?print.data.table for examples.
tstrsplit(,type.convert=) now accepts a named list of functions to apply to each part, #5094. Thanks to @Kamgang-B for the request and implementing.
as.data.table(DF, keep.rownames=key='keyCol') now works, #4468. Thanks to Michael Chirico for the idea and the PR.
dcast() now supports complex values in value.var, #4855. This extends earlier support for complex values in formula. Thanks Elio Campitelli for the request, and Michael Chirico for the PR.
melt() was pseudo generic in that melt(DT) would dispatch to the melt.data.table method but melt(not-DT) would explicitly redirect to reshape2. Now melt() is standard generic so that methods can be developed in other packages, #4864. Thanks to @odelmarcelle for suggesting and implementing.
DT[i, nomatch=NULL] where i contains row numbers now excludes NA and any outside the range [1,nrow], #3109 #3666. Before, NA rows were returned always for such values; i.e. nomatch=0|NULL was ignored. Thanks Michel Lang and Hadley Wickham for the requests, and Jan Gorecki for the PR. Using nomatch=0 in this case when i is row numbers generates the warning Please use nomatch=NULL instead of nomatch=0; see news item 5 in v1.12.0 (Jan 2019).
```
DT = data.table(A=1:3)
DT[c(1L, NA, 3L, 5L)]  # default nomatch=NA
#        A
#    <int>
# 1:     1
# 2:    NA
# 3:     3
# 4:    NA
DT[c(1L, NA, 3L, 5L), nomatch=NULL]
#        A
#    <int>
# 1:     1
# 2:     3
```
DT[, head(.SD,n), by=grp] and tail are now optimized when n>1, #5060 #523. n==1 was already optimized. Thanks to Jan Gorecki and Michael Young for requesting, and Benjamin Schwendinger for the PR.
setcolorder() gains before= and after=, #4358. Thanks to Matthias Gomolka for the request, and both Benjamin Schwendinger and Xianghui Dong for implementing. Also thanks to Manuel López-Ibáñez for testing dev and mentioning needed documentation before release.
base::droplevels() gains a fast method for data.table, #647. Thanks to Steve Lianoglou for requesting, Boniface Kamgang and Martin Binder for testing, and Jan Gorecki and Benjamin Schwendinger for the PR. fdroplevels() for use on vectors has also been added.

shift() now also supports type="cyclic", #4451. Arguments that are normally pushed out by type="lag" or type="lead" are re-introduced at this type at the first/last positions. Thanks to @RicoDiel for requesting, and Benjamin Schwendinger for the PR.

# Usage
shift(1:5, n=-1:1, type="cyclic")
# [[1]]
# [1] 2 3 4 5 1
#
# [[2]]
# [1] 1 2 3 4 5
#
# [[3]]
# [1] 5 1 2 3 4

# Benchmark
x = sample(1e9) # 3.7 GB
microbenchmark::microbenchmark(
  shift(x, 1, type="cyclic"),
  c(tail(x, 1), head(x,-1)),
  times = 10L,
  unit = "s"
)
# Unit: seconds
#                          expr  min   lq mean  median   uq  max neval
#  shift(x, 1, type = "cyclic") 1.57 1.67 1.71    1.68 1.70 2.03    10
#    c(tail(x, 1), head(x, -1)) 6.96 7.16 7.49    7.32 7.64 8.60    10

fread() now supports “0” and “1” in na.strings, #2927. Previously this was not permitted since “0” and “1” can be recognized as boolean values. Note that it is still not permitted to use “0” and “1” in na.strings in combination with logical01 = TRUE. Thanks to @msgoussi for the request, and Benjamin Schwendinger for the PR.
setkey() now supports type raw as value columns (not as key columns), #5100. Thanks Hugh Parsonage for requesting, and Benjamin Schwendinger for the PR.

shift() is now optimized by group, #1534. Thanks to Gerhard Nachtmann for requesting, and Benjamin Schwendinger for the PR. Thanks to @neovom for testing dev and filing a bug report, #5547 which was fixed before release. This helped also in improving the logic for when to turn on optimization by group in general, making it more robust.

N = 1e7
DT = data.table(x=sample(N), y=sample(1e6,N,TRUE))
shift_no_opt = shift  # different name not optimized as a way to compare
microbenchmark(
  DT[, c(NA, head(x,-1)), y],
  DT[, shift_no_opt(x, 1, type="lag"), y],
  DT[, shift(x, 1, type="lag"), y],
  times=10L, unit="s")
# Unit: seconds
#                                       expr     min      lq    mean  median      uq     max neval
#                DT[, c(NA, head(x, -1)), y]  8.7620  9.0240  9.1870  9.2800  9.3700  9.4110    10
#  DT[, shift_no_opt(x, 1, type = "lag"), y] 20.5500 20.9000 21.1600 21.3200 21.4400 21.5200    10
#         DT[, shift(x, 1, type = "lag"), y]  0.4865  0.5238  0.5463  0.5446  0.5725  0.5982    10

Example from stackoverflow

set.seed(1)
mg = data.table(expand.grid(year=2012:2016, id=1:1000),
                value=rnorm(5000))
microbenchmark(v1.9.4  = mg[, c(value[-1], NA), by=id],
               v1.9.6  = mg[, shift_no_opt(value, n=1, type="lead"), by=id],
               v1.14.4 = mg[, shift(value, n=1, type="lead"), by=id],
               unit="ms")
# Unit: milliseconds
#     expr     min      lq    mean  median      uq    max neval
#   v1.9.4  3.6600  3.8250  4.4930  4.1720  4.9490 11.700   100
#   v1.9.6 18.5400 19.1800 21.5100 20.6900 23.4200 29.040   100
#  v1.14.4  0.4826  0.5586  0.6586  0.6329  0.7348  1.318   100

rbind() and rbindlist() now support fill=TRUE with use.names=FALSE instead of issuing the warning use.names= cannot be FALSE when fill is TRUE. Setting use.names=TRUE., #5444. Thanks to @sindribaldur, @dcaseykc, @fox34, @adrian-quintario and @berg-michael for testing dev and filing a bug report which was fixed before release.

DT1
#        A     B
#    <int> <int>
# 1:     1     5
# 2:     2     6

DT2
#      foo
#    <int>
# 1:     3
# 2:     4

rbind(DT1, DT2, fill=TRUE)   # no change
#        A     B   foo
#    <int> <int> <int>
# 1:     1     5    NA
# 2:     2     6    NA
# 3:    NA    NA     3
# 4:    NA    NA     4

rbind(DT1, DT2, fill=TRUE, use.names=FALSE)

# was:
#        A     B   foo
#    <int> <int> <int>
# 1:     1     5    NA
# 2:     2     6    NA
# 3:    NA    NA     3
# 4:    NA    NA     4
# Warning message:
# In rbindlist(l, use.names, fill, idcol) :
#   use.names= cannot be FALSE when fill is TRUE. Setting use.names=TRUE.

# now:
#        A     B
#    <int> <int>
# 1:     1     5
# 2:     2     6
# 3:     3    NA
# 4:     4    NA

fread() already made a good guess as to whether column names are present by comparing the type of the fields in row 1 to the type of the fields in the sample. This guess is now improved when a column contains a string in row 1 (i.e. a potential column name) but all blank in the sample rows, #2526. Thanks @st-pasha for reporting, and @ben-schwen for the PR.
fread() can now read .zip and .tar directly, #3834. Moreover, if a compressed file name is missing its extension, fread() now attempts to infer the correct filetype from its magic bytes. Thanks to Michael Chirico for the idea, and Benjamin Schwendinger for the PR.
DT[, let(...)] is a new alias for the functional form of :=; i.e. DT[, ':='(...)], #3795. Thanks to Elio Campitelli for requesting, and Benjamin Schwendinger for the PR.
```
DT = data.table(A=1:2)
DT[, let(B=3:4, C=letters[1:2])]
DT
#        A     B      C
#    <int> <int> <char>
# 1:     1     3      a
# 2:     2     4      b
```
weighted.mean() is now optimized by group, #3977. Thanks to @renkun-ken for requesting, and Benjamin Schwendinger for the PR.
as.xts.data.table() now supports non-numeric xts coredata matrixes, 5268. Existing numeric only functionality is supported by a new numeric.only parameter, which defaults to TRUE for backward compatibility and the most common use case. To convert non-numeric columns, set this parameter to FALSE. Conversions of data.table columns to a matrix now uses data.table::as.matrix, with all its performance benefits. Thanks to @ethanbsmith for the report and fix.
unique.data.table() gains cols to specify a subset of columns to include in the resulting data.table, #5243. This saves the memory overhead of subsetting unneeded columns, and provides a cleaner API for a common operation previously needing more convoluted code. Thanks to @MichaelChirico for the suggestion & implementation.
:= is now optimized by group, #1414. Thanks to Arun Srinivasan for suggesting, and Benjamin Schwendinger for the PR. Thanks to @clerousset, @dcaseykc, @OfekShilon, @SeanShao98, and @ben519 for testing dev and filing detailed bug reports which were fixed before release and their tests added to the test suite.

.I is now available in by for rowwise operations, #1732. Thanks to Rafael H. M. Pereira for requesting, and Benjamin Schwendinger for the PR.

DT
#       V1    V2
#    <int> <int>
# 1:     3     5
# 2:     4     6

DT[, sum(.SD), by=.I]
#        I    V1
#    <int> <int>
# 1:     1     8
# 2:     2    10

New functions yearmon() and yearqtr give a combined representation of year() and month()/quarter(). These and also yday, wday, mday, week, month and year are now optimized for memory and compute efficiency by removing the POSIXlt dependency, #649. Thanks to Matt Dowle for the request, and Benjamin Schwendinger for the PR. Thanks to @berg-michael for testing dev and filing a bug report for special case of missing values which was fixed before release.
New function %notin% provides a convenient alternative to !(x %in% y), #4152. Thanks to Jan Gorecki for suggesting and Michael Czekanski for the PR. %notin% uses half the memory because it computes the result directly as opposed to ! which allocates a new vector to hold the negated result. If x is long enough to occupy more than half the remaining free memory, this can make the difference between the operation working, or failing with an out-of-memory error.
tables() is faster by default by excluding the size of character strings in R’s global cache (which may be shared) and excluding the size of list column items (which also may be shared). mb= now accepts any function which accepts a data.table and returns a higher and better estimate of its size in bytes, albeit more slowly; e.g. mb = utils::object.size.

BUG FIXES

by=.EACHI when i is keyed but on= different columns than i’s key could create an invalidly keyed result, #4603 #4911. Thanks to @myoung3 and @adamaltmejd for reporting, and @ColeMiller1 for the PR. An invalid key is where a data.table is marked as sorted by the key columns but the data is not sorted by those columns, leading to incorrect results from subsequent queries.
print(DT, trunc.cols=TRUE) and the corresponding datatable.print.trunc.cols option (new feature 3 in v1.13.0) could incorrectly display an extra column, #4266. Thanks to @tdhock for the bug report and @MichaelChirico for the PR.
fread(..., nrows=0L) now works as intended and the same as nrows=0; i.e. returning the column names and typed empty columns determined by the large sample, #4686, #4029. Thanks to @hongyuanjia and @michaelpaulhirsch for reporting, and Benjamin Schwendinger for the PR. Also thanks to @HughParsonage for testing dev and reporting a bug which was fixed before release.
Passing .SD to frankv() with ties.method='random' or with na.last=NA failed with .SD is locked, #4429. Thanks @smarches for the report.
Filtering data.table using which=NA to return non-matching indices will now properly work for non-optimized subsetting as well, closes #4411.
When j returns an object whose class "X" inherits from data.table; i.e. class c("X", "data.table", "data.frame"), the derived class "X" is no longer incorrectly dropped from the class of the data.table returned, #4324. Thanks to @HJAllen for reporting and @shrektan for the PR.
as.data.table() failed with .subset2(x, i, exact = exact): attempt to select less than one element in get1index when passed an object inheriting from data.table with a different [[ method, such as the class dfidx from the dfidx package, #4526. Thanks @RicoDiel for the report, and Michael Chirico for the PR.
rbind() and rbindlist() of length-0 ordered factors failed with Internal error: savetl_init checks failed, #4795 #4823. Thanks to @shrektan and @dbart79 for reporting, and @shrektan for fixing.
data.table(NULL)[, firstCol:=1L] created data.table(firstCol=1L) ok but did not update the internal row.names attribute, causing Error in '$<-.data.frame'(x, name, value) : replacement has 1 row, data has 0 when passed to packages like ggplot which use DT as if it is a data.frame, #4597. Thanks to Matthew Son for reporting, and Cole Miller for the PR.
X[Y, .SD, by=] (joining and grouping in the same query) could segfault if i) by= is supplied custom data (i.e. not simple expressions of columns), and ii) some rows of Y do not match to any rows in X, #4892. Thanks to @Kodiologist for reporting, @ColeMiller1 for investigating, and @tlapak for the PR.
Assigning a set of 2 or more all-NA values to a factor column could segfault, #4824. Thanks to @clerousset for reporting and @shrektan for fixing.
as.data.table(table(NULL)) now returns data.table(NULL) rather than error attempt to set an attribute on NULL, #4179. The result differs slightly to as.data.frame(table(NULL)) (0-row, 1-column) because 0-column works better with other data.table functions like rbindlist(). Thanks to Michael Chirico for the report and fix.
melt with a list for measure.vars would output variable inconsistently between na.rm=TRUE and FALSE, #4455. Thanks to @tdhock for reporting and fixing.
by=...get()... could fail with object not found, #4873 #4981. Thanks to @sindribaldur for reporting, and @OfekShilon for fixing.
print(x, col.names='none') now removes the column names as intended for wide data.tables whose column names don’t fit on a single line, #4270. Thanks to @tdhock for the report, and Michael Chirico for fixing.
DT[, min(colB), by=colA] when colB is type character would miss blank strings ("") at the beginning of a group and return the smallest non-blank instead of blank, #4848. Thanks to Vadim Khotilovich for reporting and for the PR fixing it.
Assigning a wrong-length or non-list vector to a list column could segfault, #4166 #4667 #4678 #4729. Thanks to @fklirono, Kun Ren, @kevinvzandvoort and @peterlittlejohn for reporting, and to Václav Tlapák for the PR.
as.data.table() on xts objects containing a column named x would return an index of type plain integer rather than POSIXct, #4897. Thanks to Emil Sjørup for reporting, and Jan Gorecki for the PR.
A fix to as.Date(c("", ...)) in R 4.0.3, 17909, has been backported to data.table::as.IDate() so that it too now returns NA for the first item when it is blank, even in older versions of R back to 3.1.0, rather than the incorrect error character string is not in a standard unambiguous format, #4676. Thanks to Arun Srinivasan for reporting, and Michael Chirico both for the data.table PR and for submitting the patch to R that was accepted and included in R 4.0.3.
uniqueN(DT, by=character()) is now equivalent to uniqueN(DT) rather than internal error 'by' is either not integer or is length 0, #4594. Thanks Marco Colombo for the report, and Michael Chirico for the PR. Similarly for unique(), duplicated() and anyDuplicated().
melt() on a data.table with list columns for measure.vars would silently ignore na.rm=TRUE, #5044. Now the same logic as is.na() from base R is used; i.e. if list element is scalar NA then it is considered missing and removed. Thanks to Toby Dylan Hocking for the PRs.
fread(fill=TRUE) could segfault if the input contained an improperly quoted character field, #4774 #5041. Thanks to @AndeolEvain and @e-nascimento for reporting and to Václav Tlapák for the PR.
fread(fill=TRUE, verbose=TRUE) would segfault on the out-of-sample type bump verbose output if the input did not contain column names, 5046. Thanks to Václav Tlapák for the PR.
.SDcols=-V2:-V1 and .SDcols=(-1) could error with xcolAns does not pass checks and argument specifying columns specify non existing column(s), #4231. Thanks to Jan Gorecki for reporting and the PR. Thanks Toby Dylan Hocking for tracking down an error caused by the initial fix and Michael Chirico for fixing it.
.SDcols=<logical vector> is now documented in ?data.table and it is now an error if the logical vector’s length is not equal to the number of columns (consistent with data.table’s no-recycling policy; see new feature 1 in v1.12.2 Apr 2019), #4115. Thanks to @Henrik-P for reporting and Jan Gorecki for the PR.
melt() now outputs scalar logical NA instead of NULL in rows corresponding to missing list columns, for consistency with non-list columns when using na.rm=TRUE, #5053. Thanks to Toby Dylan Hocking for the PR.
as.data.frame(DT), setDF(DT) and as.list(DT) now remove the "index" attribute which contains any indices (a.k.a. secondary keys), as they already did for other data.table-only attributes such as the primary key stored in the "sorted" attribute. When indices were left intact, a subsequent subset, assign, or reorder of the data.frame by data.frame-code in base R or other packages would not update the indices, causing incorrect results if then converted back to data.table, #4889. Thanks @OfekShilon for the report and the PR.
dplyr::arrange(DT) uses vctrs::vec_slice which retains data.table’s class but uses C to bypass [ method dispatch and does not adjust data.table’s attributes containing the index row numbers, #5042. data.table’s long-standing .internal.selfref mechanism to detect such operations by other packages was not being checked by data.table when using indexes, causing data.table filters and joins to use invalid indexes and return incorrect results after a dplyr::arrange(DT). Thanks to @Waldi73 for reporting; @avimallu, @tlapak, @MichaelChirico, @jangorecki and @hadley for investigating and suggestions; and @mattdowle for the PR. The intended way to use data.table is data.table::setkey(DT, col1, col2, ...) which reorders DT by reference in parallel, sets the primary key for automatic use by subsequent data.table queries, and permits rowname-like usage such as DT["foo",] which returns the now-contiguous-in-memory block of rows where the first column of DT’s key contains "foo". Multi-column-rownames (i.e. a primary key of more than one column) can be looked up using DT[.("foo",20210728L), ]. Using == in i is also optimized to use the key or indices, if you prefer using column names explicitly and ==. An alternative to setkey(DT) is returning a new ordered result using DT[order(col1, col2, ...), ].
A segfault occurred when nrow/throttle < nthread, #5077. With the default throttle of 1024 rows (see ?setDTthreads), at least 64 threads would be needed to trigger the segfault since there needed to be more than 65,535 rows too. It occurred on a server with 256 logical cores where data.table uses 128 threads by default. Thanks to Bennet Becker for reporting, debugging at C level, and fixing. It also occurred when the throttle was increased so as to use fewer threads; e.g. at the limit setDTthreads(throttle=nrow(DT)).
fread(file=URL) now works rather than error does not exist or is non-readable, #4952. fread(URL) and fread(input=URL) worked before and continue to work. Thanks to @pnacht for reporting and @ben-schwen for the PR.
fwrite(DF, row.names=TRUE) where DF has specific integer rownames (e.g. using rownames(DF) <- c(10L,20L,30L)) would ignore the integer rownames and write the row numbers instead, #4957. Thanks to @dgarrimar for reporting and @ColeMiller1 for the PR. Further, when quote='auto' (default) and the rownames are integers (either default or specific), they are no longer quoted.
test.data.table() would fail on test 1894 if the variable z was defined by the user, #3705. The test suite already ran in its own separate environment. That environment’s parent is no longer .GlobalEnv to isolate it further. Thanks to Michael Chirico for reporting, and Matt Dowle for the PR.
fread(text="a,b,c") (where input data contains no \n but text= has been used) now works instead of error file not found: a,b,c, #4689. Thanks to @trainormg for reporting, and @ben-schwen for the PR.
na.omit(DT) did not remove NA in nanotime columns, #4744. Thanks Jean-Mathieu Vermosen for reporting, and Michael Chirico for the PR.

DT[, min(intCol, na.rm=TRUE), by=grp] would return Inf for any groups containing all NAs, with a type change from integer to numeric to hold the Inf, and with warning. Similarly max would return -Inf. Now NA is returned for such all-NA groups, without warning or type change. This is almost-surely less surprising, more convenient, consistent, and efficient. There was no user request for this, likely because our desire to be consistent with base R in this regard was known (base::min(x, na.rm=TRUE) returns Inf with warning for all-NA input). Matt Dowle made this change when reworking internals, #5105. The old behavior seemed so bad, and since there was a warning too, it seemed appropriate to treat it as a bug.

DT
#         A     B
#    <char> <int>
# 1:      a     1
# 2:      a    NA
# 3:      b     2
# 4:      b    NA

DT[, min(B,na.rm=TRUE), by=A]  # no change in behavior (no all-NA groups yet)
#         A    V1
#    <char> <int>
# 1:      a     1
# 2:      b     2

DT[3, B:=NA]                   # make an all-NA group
DT
#         A     B
#    <char> <int>
# 1:      a     1
# 2:      a    NA
# 3:      b    NA
# 4:      b    NA

DT[, min(B,na.rm=TRUE), by=A]  # old result
#         A    V1
#    <char> <num>              # V1's type changed to numeric (inconsistent)
# 1:      a     1
# 2:      b   Inf              # Inf surprising
# Warning message:             # warning inconvenient
# In gmin(B, na.rm = TRUE) :
#   No non-missing values found in at least one group. Coercing to numeric
#   type and returning 'Inf' for such groups to be consistent with base

DT[, min(B,na.rm=TRUE), by=A]  # new result
#         A    V1
#    <char> <int>              # V1's type remains integer (consistent)
# 1:      a     1
# 2:      b    NA              # NA because there are no non-NA, naturally
                               # no inconvenient warning

On the same basis, min and max methods for empty IDate input now return NA_integer_ of class IDate, rather than NA_double_ of class IDate together with base R’s warning no non-missing arguments to min; returning Inf, #2256. The type change and warning would cause an error in grouping, see example below. Since NA was returned before it seems clear that still returning NA but of the correct type and with no warning is appropriate, backwards compatible, and a bug fix. Thanks to Frank Narf for reporting, and Matt Dowle for fixing.

DT
#             d      g
#        <IDat> <char>
# 1: 2020-01-01      a
# 2: 2020-01-02      a
# 3: 2019-12-31      b

DT[, min(d[d>"2020-01-01"]), by=g]

# was:

# Error in `[.data.table`(DT, , min(d[d > "2020-01-01"]), by = g) :
#   Column 1 of result for group 2 is type 'double' but expecting type
#   'integer'. Column types must be consistent for each group.
# In addition: Warning message:
# In min.default(integer(0), na.rm = FALSE) :
#   no non-missing arguments to min; returning Inf

# now :

#         g         V1
#    <char>     <IDat>
# 1:      a 2020-01-02
# 2:      b       <NA>

DT[, min(int64Col), by=grp] (and max) would return incorrect results for bit64::integer64 columns, #4444. Thanks to @go-see for reporting, and Michael Chirico for the PR.

fread(dec=',') was able to guess sep=',' and return an incorrect result, #4483. Thanks to Michael Chirico for reporting and fixing. It was already an error to provide both sep=',' and dec=',' manually.

fread('A|B|C\n1|0,4|a\n2|0,5|b\n', dec=',')  # no problem

#        A     B      C
#    <int> <num> <char>
# 1:     1   0.4      a
# 2:     2   0.5      b

fread('A|B,C\n1|0,4\n2|0,5\n', dec=',')

#       A|B     C    # old result guessed sep=',' despite dec=','
#    <char> <int>
# 1:    1|0     4
# 2:    2|0     5

#        A   B,C     # now detects sep='|' correctly
#    <int> <num>
# 1:     1   0.4
# 2:     2   0.5

IDateTime() ignored the tz= and format= arguments because ... was not passed through to submethods, #2402. Thanks to Frank Narf for reporting, and Jens Peder Meldgaard for the PR.

IDateTime("20171002095500", format="%Y%m%d%H%M%S")

# was :
# Error in charToDate(x) :
#   character string is not in a standard unambiguous format

# now :
#         idate    itime
#        <IDat>  <ITime>
# 1: 2017-10-02 09:55:00

DT[i, sum(b), by=grp] (and other optimized-by-group aggregates: mean, var, sd, median, prod, min, max, first, last, head and tail) could segfault if i contained row numbers and one or more were NA, #1994. Thanks to Arun Srinivasan for reporting, and Benjamin Schwendinger for the PR.

identical(fread(text="A\n0.8060667366\n")$A, 0.8060667366) is now TRUE, #4461. This is one of 13 numbers in the set of 100,000 between 0.80606 and 0.80607 in 0.0000000001 increments that were not already identical. In all 13 cases R’s parser (same as read.table) and fread straddled the true value by a very similar small amount. fread now uses /10^n rather than *10^-n to match R identically in all cases. Thanks to Gabe Becker for requesting consistency, and Michael Chirico for the PR.

for (i in 0:99999) {
  s = sprintf("0.80606%05d", i)
  r = eval(parse(text=s))
  f = fread(text=paste0("A\n",s,"\n"))$A
  if (!identical(r, f))
    cat(s, sprintf("%1.18f", c(r, f, r)), "\n")
}
#        input    eval & read.table         fread before            fread now
# 0.8060603509 0.806060350899999944 0.806060350900000055 0.806060350899999944
# 0.8060614740 0.806061473999999945 0.806061474000000056 0.806061473999999945
# 0.8060623757 0.806062375699999945 0.806062375700000056 0.806062375699999945
# 0.8060629084 0.806062908399999944 0.806062908400000055 0.806062908399999944
# 0.8060632774 0.806063277399999945 0.806063277400000056 0.806063277399999945
# 0.8060638101 0.806063810099999944 0.806063810100000055 0.806063810099999944
# 0.8060647118 0.806064711799999944 0.806064711800000055 0.806064711799999944
# 0.8060658349 0.806065834899999945 0.806065834900000056 0.806065834899999945
# 0.8060667366 0.806066736599999945 0.806066736600000056 0.806066736599999945
# 0.8060672693 0.806067269299999944 0.806067269300000055 0.806067269299999944
# 0.8060676383 0.806067638299999945 0.806067638300000056 0.806067638299999945
# 0.8060681710 0.806068170999999944 0.806068171000000055 0.806068170999999944
# 0.8060690727 0.806069072699999944 0.806069072700000055 0.806069072699999944
#
# remaining 99,987 of these 100,000 were already identical

dcast(empty-DT) now returns an empty data.table rather than error Cannot cast an empty data.table, #1215. Thanks to Damian Betebenner for reporting, and Matt Dowle for fixing.
DT[factor("id")] now works rather than error i has evaluated to type integer. Expecting logical, integer or double, #1632. DT["id"] has worked forever by automatically converting to DT[.("id")] for convenience, and joins have worked forever between char/fact, fact/char and fact/fact even when levels mismatch, so it was unfortunate that DT[factor("id")] managed to escape the simple automatic conversion to DT[.(factor("id"))] which is now in place. Thanks to @aushev for reporting, and Matt Dowle for the fix.
All-NA character key columns could segfault, #5070. Thanks to @JorisChau for reporting and Benjamin Schwendinger for the fix.
In v1.13.2 a version of an old bug was reintroduced where during a grouping operation list columns could retain a pointer to the last group. This affected only attributes of list elements and only if those were updated during the grouping operation, #4963. Thanks to @fujiaxiang for reporting and @avimallu and Václav Tlapák for investigating and the PR.
shift(xInt64, fill=0) and shift(xInt64, fill=as.integer64(0)) (but not shift(xInt64, fill=0L)) would error with INTEGER() can only be applied to a 'integer', not a 'double' where xInt64 conveys bit64::integer64, 0 is type double and 0L is type integer, #4865. Thanks to @peterlittlejohn for reporting and Benjamin Schwendinger for the PR.

DT[i, strCol:=classVal] did not coerce using the as.character method for the class, resulting in either an unexpected string value or an error such as To assign integer64 to a target of type character, please use as.character() for clarity. Discovered during work on the previous issue, #5189.

DT
#         A
#    <char>
# 1:      a
# 2:      b
# 3:      c
DT[2, A:=as.IDate("2021-02-03")]
DT[3, A:=bit64::as.integer64("4611686018427387906")]
DT
#                      A
#                 <char>
# 1:                   a
# 2:          2021-02-03  # was 18661
# 3: 4611686018427387906  # was error 'please use as.character'

tables() failed with argument "..." is missing when called from within a function taking ...; e.g. function(...) { tables() }, #5197. Thanks @greg-minshall for the report and @michaelchirico for the fix.
DT[, prod(int64Col), by=grp] produced wrong results for bit64::integer64 due to incorrect optimization, #5225. Thanks to Benjamin Schwendinger for reporting and fixing.
fintersect(..., all=TRUE) and fsetdiff(..., all=TRUE) could return incorrect results when the inputs had columns named x and y, #5255. Thanks @Fpadt for the report, and @ben-schwen for the fix.
fwrite() could produce not-ISO-compliant timestamps such as 2023-03-08T17:22:32.:00Z when under a whole second by less than numerical tolerance of one microsecond, #5238. Thanks to @avraam-inside for the report and Václav Tlapák for the fix.
merge.data.table() silently ignored the incomparables argument, #2587. It is now implemented and any other ignored arguments (e.g. misspellings) are now warned about. Thanks to @GBsuperman for the report and @ben-schwen for the fix.
DT[, c('z','x') := {x=NULL; list(2,NULL)}] now removes column x as expected rather than incorrectly assigning 2 to x as well as z, #5284. The x=NULL is superfluous while the list(2,NULL) is the final value of {} whose items correspond to c('z','x'). Thanks @eutwt for the report, and @ben-schwen for the fix.
as.data.frame(DT, row.names=) no longer silently ignores row.names, #5319. Thanks to @dereckdemezquita for the fix and PR, and @ben-schwen for guidance.
data.table(...) unnamed arguments are deparsed in an attempt to name the columns but when called from do.call() the input data itself was deparsed taking a very long time, #5501. Many thanks to @OfekShilon for the report and fix, and @michaelchirico for guidance. Unnamed arguments to data.table(...) may now be faster in other cases not involving do.call() too; e.g. expressions spanning a lot of lines or other function call constructions that led to the data itself being deparsed.
```
DF = data.frame(a=runif(1e6), b=runif(1e6))
DT1 = data.table(DF)                 # 0.02s before and after
DT2 = do.call(data.table, list(DF))  # 3.07s before, 0.02s after
identical(DT1, DT2)                  # TRUE
```
fread(URL) with https: and ftps: could timeout if proxy settings were not guessed right by curl::curl_download, #1686. fread(URL) now uses download.file() as default for downloading files from urls. Thanks to @cderv for the report and Benjamin Schwendinger for the fix.
split.data.table() works for downstream methods that don’t implement DT[i] form (i.e., requiring DT[i, j] form, like plain data.frames), for example sf’s [.sf, #5365. Thanks @barryrowlingson for the report and @michaelchirico for the fix.

NOTES

New feature 29 in v1.12.4 (Oct 2019) introduced zero-copy coercion. Our thinking is that requiring you to get the type right in the case of 0 (type double) vs 0L (type integer) is too inconvenient for you the user. So such coercions happen in data.table automatically without warning. Thanks to zero-copy coercion there is no speed penalty, even when calling set() many times in a loop, so there’s no speed penalty to warn you about either. However, we believe that assigning a character value such as "2" into an integer column is more likely to be a user mistake that you would like to be warned about. The type difference (character vs integer) may be the only clue that you have selected the wrong column, or typed the wrong variable to be assigned to that column. For this reason we view character to numeric-like coercion differently and will warn about it. If it is correct, then the warning is intended to nudge you to wrap the RHS with as.<type>() so that it is clear to readers of your code that a coercion from character to that type is intended. For example :
```
x = c(2L,NA,4L,5L)
nafill(x, fill=3)                 # no warning; requiring 3L too inconvenient
nafill(x, fill="3")               # warns in case either x or "3" was a mistake
nafill(x, fill=3.14)              # warns that precision has been lost
nafill(x, fill=as.integer(3.14))  # no warning; the as.<type> conveys intent
```
CsubsetDT exported C function has been renamed to DT_subsetDT. This requires R_GetCCallable("data.table", "CsubsetDT") to be updated to R_GetCCallable("data.table", "DT_subsetDT"). Additionally there is now a dedicated header file for data.table C exports include/datatableAPI.h, #4643, thanks to @eddelbuettel, which makes it easier to import data.table C functions.
In v1.12.4, fractional fread(..., stringsAsFactors=) was added. For example if stringsAsFactors=0.2, any character column with fewer than 20% unique strings would be cast as factor. This is now documented in ?fread as well, #4706. Thanks to @markderry for the PR.
cube(DT, by="a") now gives a more helpful error that j is missing, #4282.
v1.13.0 (July 2020) fixed a segfault/corruption/error (depending on version of R and circumstances) in dcast() when fun.aggregate returned NA (type logical) in an otherwise character result, #2394. This fix was the result of other internal rework and there was no news item at the time. A new test to cover this case has now been added. Thanks Vadim Khotilovich for reporting, and Michael Chirico for investigating, pinpointing when the fix occurred and adding the test.
DT[subset] where DT[(subset)] or DT[subset==TRUE] was intended; i.e., subsetting by a logical column whose name conflicts with an existing function, now gives a friendlier error message, #5014. Thanks @michaelchirico for the suggestion and PR, and @ColeMiller1 for helping with the fix.
Grouping by a list column has its error message improved stating this is unsupported, #4308. Thanks @sindribaldur for filing, and @michaelchirico for the PR. Please add your vote and especially use cases to the #1597 feature request.
OpenBSD 6.9 released May 2021 uses a 16 year old version of zlib (v1.2.3 from 2005) plus cherry-picked bug fixes (i.e. a semi-fork of zlib) which induces Compress gzip error: -9 from fwrite(), #5048. Thanks to Philippe Chataignon for investigating and fixing. Matt asked on OpenBSD’s mailing list if zlib could be upgraded to 4 year old zlib 1.2.11 but forgot his tin hat: https://marc.info/?l=openbsd-misc&m=162455479311886&w=1.
?".", ?"..", ?".(", and ?".()" now point to ?data.table, #4385 #4407. To help users find the documentation for these convenience features available inside DT[...]. Recall that . is an alias for list, and ..var tells data.table to look for var in the calling environment as opposed to a column of the table.
DT[, lhs:=rhs] and set(DT, , lhs, rhs) no longer raise a warning on zero length lhs, #4086. Thanks to Jan Gorecki for the suggestion and PR. For example, DT[, grep("foo", names(dt)) := NULL] no longer warns if there are no column names containing "foo".
melt()’s internal C code is now more memory efficient, #5054. Thanks to Toby Dylan Hocking for the PR.
?merge and ?setkey have been updated to clarify that the row order is retained when sort=FALSE, and why NAs are always first when sort=TRUE, #2574 #2594. Thanks to Davor Josipovic and Markus Bonsch for the reports, and Jan Gorecki for the PR.

For nearly two years, since v1.12.4 (Oct 2019) (note 11 below in this NEWS file), using options(datatable.nomatch=0) has produced the following message :

The option 'datatable.nomatch' is being used and is not set to the default NA. This option
is still honored for now but will be deprecated in future. Please see NEWS for 1.12.4 for
detailed information and motivation. To specify inner join, please specify `nomatch=NULL`
explicitly in your calls rather than changing the default using this option.

The message is now upgraded to warning that the option is now ignored.

The options datatable.print.class and datatable.print.keys are now TRUE by default. They have been available since v1.9.8 (Nov 2016) and v1.11.0 (May 2018) respectively.
Thanks to @ssh352, Václav Tlapák, Cole Miller, András Svraka and Toby Dylan Hocking for reporting and bisecting a significant performance regression in dev. This was fixed before release thanks to a PR by Jan Gorecki, #5463.
key(x) <- value is now fully deprecated (from warning to error). Use setkey() to set a table’s key. We started warning not to use this approach in 2012, with a stronger warning starting in 2019 (1.12.2). This function will be removed in the next release.
Argument logicalAsInt to fwrite() now warns. Use logical01 instead. We stated the intention to begin removing this option in 2018 (v1.11.0). It will be upgraded to an error in the next release before being removed in the subsequent release.
Option datatable.CJ.names no longer has any effect, after becoming TRUE by default in v1.12.2 (2019). Setting it now gives a warning, which will be dropped in the next release.
In the NEWS for v1.11.0 (May 2018), section “NOTICE OF INTENDED FUTURE POTENTIAL BREAKING CHANGES” item 2, we stated the intention to eventually change logical01 to be TRUE by default. After some consideration, reflection, and community input, we have decided not to move forward with this plan, i.e., logical01 will remain FALSE by default in both fread() and fwrite(). See discussion in #5856; most importantly, we think changing the default would be a major disruption to reading “sharded” CSVs where data with the same schema is split into many files, some of which could be converted to logical while others remain integer. We will retain the option datatable.logical01 for users who wish to use a different default – for example, if you are doing input/output on tables with a large number of logical columns, where writing ‘0’/‘1’ to the CSV many millions of times is preferable to writing ‘TRUE’/‘FALSE’.
Some clarity is added to ?GForce for the case when subtle changes to j produce different results because of differences in locale. Because data.table always uses the “C” locale, small changes to queries which activate/deactivate GForce might cause confusingly different results when sorting is involved, #5331. The inspirational example compared DT[, .(max(a), max(b)), by=grp] and DT[, .(max(a), max(tolower(b))), by=grp] – in the latter case, GForce is deactivated owing to the ad-hoc column, so the result for max(a) might differ for the two queries. An example is added to ?GForce. As always, there are several options to guarantee consistency, for example (1) use namespace qualification to deactivate GForce: DT[, .(base::max(a), base::max(b)), by=grp]; (2) turn off all optimizations with options(datatable.optimize = 0); or (3) set your R session to always sort in C locale with Sys.setlocale("LC_COLLATE", "C") (or temporarily with e.g. withr::with_locale()). Thanks @markseeto for the example and @michaelchirico for the improved documentation.

data.table news and updates

data.table v1.18.2.1 22 Jan 2026

BUG FIXES

Notes

data.table v1.18.0 23 December 2025

BREAKING CHANGE

NOTICE OF INTENDED FUTURE POTENTIAL BREAKING CHANGES

NEW FEATURES

BUG FIXES

NOTES

data.table v1.17.8 (6 July 2025)

data.table v1.17.6 (15 June 2025)

data.table v1.17.4 (25 May 2025)

data.table v1.17.2 (7 May 2025)

BUG FIXES

NOTES

data.table v1.17.0 (20 Feb 2025)

POTENTIALLY BREAKING CHANGES

NEW FEATURES

BUG FIXES

NOTES

data.table v1.16.4 4 December 2024

BUG FIXES

data.table v1.16.2 (9 October 2024)

BUG FIXES

NOTES

data.table v1.16.0 (25 August 2024)

BREAKING CHANGES

NEW FEATURES

BUG FIXES

NOTES

TRANSLATIONS

data.table v1.15.4 (27 March 2024)

BUG FIXES

NOTES

data.table v1.15.2 (27 Feb 2024)

BUG FIXES

data.table v1.15.0 (30 Jan 2024)

BREAKING CHANGE

NEW FEATURES

BUG FIXES

NOTES

data.table v1.14.10 (Dec 2023) back to v1.10.0 (Dec 2016) has been moved to NEWS.1.md