Primitive Presto data types to R types

In this vignette, we introduce how primitive Presto data types are translated into R types in the RPresto package.

Category	Presto data type	R type
Boolean	BOOLEAN	logical
Integer	TINYINT	integer
Integer	SMALLINT	integer
Integer	INTEGER	integer
Integer	BIGINT	{integer\|integer64\|numeric\|character}
Floating-point	REAL	numeric
Floating-point	DOUBLE	numeric
Fixed-precision	DECIMAL	character
String	VARCHAR	character
String	CHAR	character
String	VARBINARY	raw
String	JSON	not supported
Date and Time	DATE	Date (S3 class)
Date and Time	TIME	difftime (S3 class)
Date and Time	TIME WITH TIME ZONE	difftime (S3 class)
Date and Time	TIMESTAMP	POSIXct (S3 class)
Date and Time	TIMESTAMP WITH TIME ZONE	POSIXct (S3 class)
Date and Time	INTERVAL YEAR TO MONTH	Duration (S4 class)
Date and Time	INTERVAL DAY TO SECOND	Duration (S4 class)

Data types mapping to R

Boolean

Translating boolean values from Presto to R is fairly straightforward. true and false values are mapped to TRUE and FALSE in R and null is mapped to NA which is by default a logical (i.e., boolean) type in R.

(
  df.boolean <- dbGetQuery(con, "SELECT boolean FROM presto_primitive_types")
)
#> # A tibble: 3 x 1
#>   boolean
#>   <lgl>  
#> 1 TRUE   
#> 2 FALSE  
#> 3 NA

We can verify that all three values in R are logical.

purrr::map_chr(df.boolean$boolean, class)
#> [1] "logical" "logical" "logical"

Integers

Presto has 4 integer data types.

TINYINT is 8-bit and ranges from -2^7 to 2^7-1.
SMALLINT is 16-bit and ranges from -2^15 to 2^15-1.
INTEGER is 32-bit and ranges from -2^31 to 2^31-1.
BIGINT is 64-bit and ranges from -2^63 to 2^63-1.

Presto data type	Bits	Minimum value	Maximum value
TINYINT	8	-128	127
SMALLINT	16	-32,768	32,767
INTEGER	32	-2,147,483,648	2,147,483,647
BIGINT	64	-9,223,372,036,854,775,808	9,223,372,036,854,775,807

In comparison, base R only ships with one 32-bit integer type, but the range is slightly different from Presto. R’s integer type ranges from -2^31+1 = -2,147,483,647 to 2^31-1 = 2,147,483,647. The one number (out of ~4.3 billion 32-bit integer numbers) that’s a valid INTEGER in Presto but not in R is the range lower bound -2,147,483,648.

Non-BIGINT integers

Since all TINYINT, SMALLINT, and INTEGER numbers (except -2,147,483,648) can be fit into R’s integer type, we translate all of them to integers.

(
  df.non_bigint_int <- dbGetQuery(
    con, "SELECT tinyint, smallint, integer FROM presto_primitive_types"
  )
)
#> # A tibble: 3 x 3
#>   tinyint smallint     integer
#>     <int>    <int>       <int>
#> 1    -128   -32768 -2147483647
#> 2     127    32767  2147483647
#> 3      NA       NA          NA

We can verify that all three columns in R are integer types.

purrr::map_chr(df.non_bigint_int, class)
#>   tinyint  smallint   integer 
#> "integer" "integer" "integer"

We can also show that mapping the one exception -2,147,483,648 to R results in an error while the query runs correctly in Presto. Users whose data contains the exceptional integer should consider using the BIGINT type rather than INTEGER to encode the data in Presto.

dbGetQuery(con, "SELECT CAST('-2147483648' AS INTEGER) AS non_bigint_exception")
#> # A tibble: 1 x 1
#>   non_bigint_exception
#>                  <int>
#> 1                   NA

BIGINT integers

Since the Presto BIGINT type is 64-bit, its range is clearly much wider than base R’s integer type.

However, by default we still map BIGINT integers from Presto to integer in R, for two reasons.

R doesn’t have native 64-bit integer support yet. Mapping BIGINT to a 64-bit integer in R will require additional package and add extra dependencies.
The 32-bit integer range covers 2+ billion positive integers and 4.3 billion integers in total. Most integer values used in data analysis and modeling can be well covered in that range.

Besides the default mapping to integer (albeit 32-bit) type, we offer three other options when it comes BIGINT mapping.

Firstly, using the bigint = "character" argument instructs RPresto to cast the BIGINT value to a character type. This is particularly useful when BIGINT is used to store long IDs rather than large numbers (i.e., the numbers are not used in arithmetic computations).
Using bigint = "integer64" makes RPresto translate BIGINT values to a value of the integer64 S3 class from the bit64 package.
bigint = "numeric" makes RPresto store the BIGINT value in a numeric ( i.e., double) type.

Whenever we map an integer to another numeric type, we always need to consider the precision of the mapping, that is, whether there’s any precision loss during the translation. Among the three translation options above, the first one (i.e., casting BIGINT to character) doesn’t involve any precision translation, so we will focus the precision discussion on the other two translations ( bit64::integer64 and numeric).

On the receiving end, the bit64::integer64 type has a range from -2^63+1 = -9,223,372,036,854,775,807 to 2^63-1 = 9,223,372,036,854,775,807.

bit64::lim.integer64()
#> integer64
#> [1] -9223372036854775807 9223372036854775807

Comparing the range to Presto’s BIGINT range, it seems that in theory the bit64::integer64 range is only infinitesimally smaller than the BIGINT range by 1 number (again, the lower bound number). However, in practice, the range of BIGINT values that can be translated into bit64::integer64 without precision loss is much smaller.

The limitation comes from how the Presto REST API communicates data with R. It uses the JSON format to encode the query result data and sends it to R for RPresto to process.

JSON by default encodes integers as double numbers, so its precision is limited to +/-(2^53-1) = +/-9,007,199,254,740,991. Any integers outside of this range will lose precision during the translation.

Since the translation limitation is caused by the JSON format encoding integers using double, the same limitation applies when BIGINT values are mapped to numeric types in R.

bigint =	R type	Range without precision loss	Range with possible precision loss
character	character	NA	NA
integer64	bit64::integer64	+/-9,007,199,254,740,991	+/- 9,223,372,036,854,775,807
numeric	numeric	+/-9,007,199,254,740,991	[-9,223,372,036,854,775,808, 9,223,372,036,854,775,807]

The table created by the create_primitive_types_table() function has the values from the no-precision-lost range. Below we show how different bigint arguments change the output R types.

bigint = "character"

dbGetQuery(
  con, "SELECT bigint FROM presto_primitive_types", bigint = "character"
)
#> # A tibble: 3 x 1
#>   bigint           
#>   <chr>            
#> 1 -9007199254740991
#> 2 9007199254740991 
#> 3 <NA>

bigint = "integer64"

dbGetQuery(
  con, "SELECT bigint FROM presto_primitive_types", bigint = "integer64"
)
#> # A tibble: 3 x 1
#>              bigint
#>             <int64>
#> 1 -9007199254740991
#> 2  9007199254740991
#> 3                NA

bigint = "numeric"

dbGetQuery(
  con, "SELECT bigint FROM presto_primitive_types", bigint = "numeric"
)
#> # A tibble: 3 x 1
#>              bigint
#>               <dbl>
#> 1 -9007199254740991
#> 2  9007199254740991
#> 3                NA

If you attempt to translate integers outside of the no-precision-loss range, a warning message will show up.

dbGetQuery(
  con,
  "
  SELECT SIGN(bigint) * (ABS(bigint) + 1) AS bigint_precision_loss
  FROM presto_primitive_types
  ",
  bigint = "numeric"
)
#> Warning in as.double.integer64(x): integer precision lost while converting to
#> double
#> # A tibble: 3 x 1
#>   bigint_precision_loss
#>                   <dbl>
#> 1     -9007199254740992
#> 2      9007199254740992
#> 3                    NA

Floating numbers

R only has one floating point data type, double (its class is numeric). All floating numbers are stored in double precision format (i.e., 64-bit). This matches Presto’s DOUBLE type, so translation between Presto and R is straightforward.

(
  df.floating_point <- dbGetQuery(
    con,
    "SELECT real, double FROM presto_primitive_types"
  )
)
#> # A tibble: 3 x 2
#>    real double
#>   <dbl>  <dbl>
#> 1     1      1
#> 2     2      2
#> 3    NA     NA

We can verify that both floating point types are translated to numeric.

purrr::map_chr(df.floating_point, class)
#>      real    double 
#> "numeric" "numeric"

Fixed-precision numbers

Presto offers a DECIMAL data type that has fixed precision up to 38 digits. This means that it can be used to represent a very large integer which is obviously outside of the precision limits mentioned above. We can’t decide one precise R class to translate the DECIMAL data type into, so RPresto currently translates the type into a string (i.e., character) in R.

(
  df.fixed_precision <- dbGetQuery(
    con,
    "SELECT decimal FROM presto_primitive_types"
  )
)
#> # A tibble: 3 x 1
#>   decimal            
#>   <chr>              
#> 1 -9007199254740991.5
#> 2 9007199254740991.5 
#> 3 <NA>

Strings

`VARCHAR` and `CHAR` types

VARCHAR and CHAR data types in Presto are mapped to R’s character type.

(
  df.characters <- dbGetQuery(
    con,
    "SELECT varchar, char FROM presto_primitive_types"
  )
)
#> # A tibble: 3 x 2
#>   varchar char 
#>   <chr>   <chr>
#> 1 abc     a    
#> 2 def     b    
#> 3 <NA>    <NA>

We can verify the resulting R types to be character.

purrr::map_chr(df.characters, class)
#>     varchar        char 
#> "character" "character"

`VARBINARY`

Presto’s VARBINARY type stores string data in raw bytes. It can be nicely mapped to R’s raw type.

(
  df.bytes <- dbGetQuery(
    con,
    "SELECT varbinary FROM presto_primitive_types"
  )
)
#> # A tibble: 3 x 1
#>   varbinary
#>   <list>   
#> 1 <raw [3]>
#> 2 <raw [3]>
#> 3 <raw [0]>

We can verify all elements in the resulting column are of raw data type.

purrr::map_chr(df.bytes$varbinary, class)
#> [1] "raw" "raw" "raw"

We can also convert the bytes data back to the string values.

dplyr::mutate(df.bytes, string = purrr::map_chr(varbinary, rawToChar))
#> # A tibble: 3 x 2
#>   varbinary string
#>   <list>    <chr> 
#> 1 <raw [3]> "abc" 
#> 2 <raw [3]> "def" 
#> 3 <raw [0]> ""

Date and Time

In R, there are three types of date/time data that refer to an instant in time.

A date. Tibbles print this as <date>.
A time within a day. Tibbles print this as <time>.
A date-time is a date plus a time: it uniquely identifies an instant in time (typically to the nearest second). Tibbles print this as <dttm>. Elsewhere in R these are called POSIXct.

We prefer to use the lubridate package to handle date and date-time objects. Time objects are not very commonly used and R lacks a strong native support for it. In RPresto, we uses the hms package to handle time objects.

`DATE`

The DATE type is by far the most used date-and-time types. Note that a DATE value isn’t tied to any particular time zone, so it isn’t associated with a unique point in time (i.e., seconds or microseconds since epoch).

We use base R’s Date S3 class to translate Presto’s DATE type.

(
  df.date <- dbGetQuery(
    con,
    "SELECT date FROM presto_primitive_types"
  )
)
#> # A tibble: 3 x 1
#>   date      
#>   <date>    
#> 1 2000-01-01
#> 2 2000-01-02
#> 3 NA

We can verify the R type of the column.

purrr::map_chr(df.date, class)
#>   date 
#> "Date"

`TIMESTAMP`

The POSIXct type values, on the other hand, are associated with a unique point in time. That is, they can be translated to a unique numeric value that refers to the number of time units (usually seconds, milliseconds, or microseconds) elapsed since epoch (i.e., the beginning of time). This is why the mode of a POSIXct value in R is numeric and you can call the as.integer() function on it.

foo <- lubridate::ymd_hms("2000-01-01 01:02:03", tz = "America/New_York")
mode(foo)
#> [1] "numeric"
as.integer(foo)
#> [1] 946706523

In this case, the integer value refers to the number of seconds elapsed since 1970-01-01 00:00:00 in the UTC time zone.

Presto’s TIMESTAMP and TIMESTAMP WITH TIME ZONE types follow the same logic and thus can be mapped to the POSIXct type. To ensure consistency, RPresto always translates the timestamp to match the Presto session’s timezone.

You can check the session time zone by printing the session.timezone slot of the PrestoConnection object.

con@session.timezone
#> [1] "UTC"

Here we get the TIMESTAMP and TIMESTAMP WITH TIME ZONE values from the table.

(
  df.posixct <- dbGetQuery(
    con,
    "SELECT timestamp, timestamp_with_tz FROM presto_primitive_types"
  )
)
#> # A tibble: 3 x 2
#>   timestamp           timestamp_with_tz  
#>   <dttm>              <dttm>             
#> 1 2000-01-01 01:02:03 2000-01-01 06:02:03
#> 2 2000-01-02 02:03:04 2000-01-02 07:03:04
#> 3 NA                  NA

We can check the R types of the two columns.

purrr::map(df.posixct, class)
#> $timestamp
#> [1] "POSIXct" "POSIXt" 
#> 
#> $timestamp_with_tz
#> [1] "POSIXct" "POSIXt"

We can also verify that the time zone attributes match the session time zone.

purrr::map_chr(df.posixct$timestamp, lubridate::tz)
#> [1] "UTC" "UTC" "UTC"
purrr::map_chr(df.posixct$timestamp_with_tz, lubridate::tz)
#> [1] "UTC" "UTC" "UTC"

`INTERVAL`s

Presto has two INTERVAL types: INTERVAL YEAR TO MONTH and INTERVAL DAY TO SECOND. We map both of them to lubridate::Duration-class.

(
  df.duration <- dbGetQuery(
    con,
    "
    SELECT
      interval_year_to_month,
      interval_day_to_second
    FROM presto_primitive_types
    "
  )
)
#> # A tibble: 3 x 2
#>   interval_year_to_month  interval_day_to_second
#>   <Duration>              <Duration>            
#> 1 36817200s (~1.17 years) 187506.5s (~2.17 days)
#> 2 73634400s (~2.33 years) 284889.6s (~3.3 days) 
#> 3 NA                      NA

Primitive Presto data types to R types

Package setup

Overview

Walkthrough preparation

Local Presto server in memory

Presto connection in R

Primitive data types table in Presto

Data types mapping to R

Boolean

Integers

Non-BIGINT integers

BIGINT integers

Floating numbers

Fixed-precision numbers

Strings

`VARCHAR` and `CHAR` types

`VARBINARY`

Date and Time

`DATE`

`TIMESTAMP`

`INTERVAL`s

Walkthrough wrap-up

Primitive Presto data types to R types

Package setup

Overview

Walkthrough preparation

Local Presto server in memory

Presto connection in R

Primitive data types table in Presto

Data types mapping to R

Boolean

Integers

Non-BIGINT integers

BIGINT integers

Floating numbers

Fixed-precision numbers

Strings

VARCHAR and CHAR types

VARBINARY

Date and Time

DATE

TIMESTAMP

INTERVALs

Walkthrough wrap-up

`VARCHAR` and `CHAR` types

`VARBINARY`

`DATE`

`TIMESTAMP`

`INTERVAL`s