Extracting Variables from Cost Reports

Robert J. Gambrel

2017-04-02

This vignette will show a brief example of how to use this package to extract data from Medicare cost reports. It will specify the parts that this package can handle automatically for you, and the parts that require the user to consult the cost report documentation or the actual worksheets themselves.

Loading Cost Report Data

Medicare cost reports for skilled nursing facilities, hospitals, and home health agencies are available at CMS’s website. There is a separate site for hospice cost reports. Documentation of how each sector reports, and copies of the actual cost report worksheets that will help you determine which worksheet, row, and column you need to extract a particular variable of interest, are here.

Hospices and HHA’s have used the same reporting forms since the mid 1990’s. Skilled nursing facilities and hospitals, however, switched their reporting guidelines in 2010. Therefore, writing extracts for variables from the 1996 forms will not work for data from the 2010 form. Crosswalks are available on the sidebar here. Once you find the variable you want in the 1996 form, it’s easy to translate the worksheet number, row, and column to the newer format. It’s good practice to check a few facilities’ results from both sources to make sure that the data is consistent across the reporting switch; this will ensure your extracts are working for both periods.

In order to extract the variables, you’ll have to visit the actual forms the facilities fill out. In my experience, the appropriate documentation files are:

In this vignette, we’ll focus on data from the hospice cost reports. That is one of the smaller datasets, and it s documentation is relatively straightforward. It also doesn’t change reporting rules over time, so we could download all yearly data and run the same extract for each year’s data if we wanted to.

Demo data

I’ve included cost report data for 500 hospices in 2014. The data is raw and identical to what you get when importing from the downloaded CSV, so it has no headers or names and is initially pretty unweildy.

library(medicare)
library(dplyr)
library(magrittr)
# optional for final maps
library(ggplot2)
library(maps)
alpha_14 <- hospiceALPHA
nmrc_14 <- hospiceNMRC
rpt_14 <- hospiceRPT

These are pretty indiscernable at first glance, and they don’t have variable names by default. Those are all available in the documentation, but I’ve made a wrapper to make it quick and painless to name. Still, it’s hard to know what to make of the data.

names(alpha_14) <- cr_alpha_names()
names(nmrc_14) <- cr_nmrc_names()
names(rpt_14) <- cr_rpt_names()

lapply(list(alpha_14, nmrc_14, rpt_14), head)
## [[1]]
##   rpt_rec_num wksht_cd line_num clmn_num
## 1       34033  A000000      100        0
## 2       34033  A000000      200        0
## 3       34033  A000000      300        0
## 4       34033  A000000      400        0
## 5       34033  A000000      500        0
## 6       34033  A000000      600        0
##                 itm_alphanmrc_itm_txt
## 1   0100CAPITAL REL COSTS-BLDG & FIXT
## 2 0200CAPITAL REL COSTS-MOVABLE EQUIP
## 3 0300PLANT OPERATION AND MAINTENANCE
## 4            0400TRANSPORTATION-STAFF
## 5  0500VOLUNTEER SERVICE COORDINATION
## 6      0600ADMINISTRATIVE AND GENERAL
## 
## [[2]]
##   rpt_rec_num wksht_cd line_num clmn_num itm_val_num
## 1       34033  A000000    00400     0300          52
## 2       34033  A000000    00400     0600          52
## 3       34033  A000000    00400     0800          52
## 4       34033  A000000    00400     1000          52
## 5       34033  A000000    00500     0500           1
## 6       34033  A000000    00500     0600           1
## 
## [[3]]
##   rpt_rec_num prvdr_ctrl_type_cd prvdr_num npi rpt_stus_cd  fy_bgn_dt
## 1       34033                  4    111714  NA           1 11/26/2013
## 2       34071                  4    341598  NA           1 10/23/2013
## 3       34375                  5     31621  NA           1 10/11/2013
## 4       35065                  4    361664  NA           1 10/01/2013
## 5       35167                  6    671777  NA           1 11/01/2013
## 6       35451                  5    671784  NA           1 11/11/2013
##    fy_end_dt    proc_dt initl_rpt_sw last_rpt_sw trnsmtl_num fi_num
## 1 12/31/2013 04/23/2014           NA        <NA>         941  11001
## 2 12/31/2013 05/05/2014           NA        <NA>         941  11001
## 3 12/31/2013 06/04/2014           NA        <NA>         941   6014
## 4 12/31/2013 06/24/2014           NA        <NA>         942  11001
## 5 12/31/2013 06/24/2014           NA        <NA>         942  11001
## 6 12/31/2013 06/26/2014           NA        <NA>         942  11001
##   adr_vndr_cd fi_creat_dt util_cd npr_dt spec_ind fi_rcpt_dt
## 1         C05  04/10/2014      NA   <NA>       NA 04/08/2014
## 2         C05  05/01/2014      NA   <NA>       NA 04/23/2014
## 3         C05  05/30/2014      NA   <NA>       NA 05/27/2014
## 4         C05  06/16/2014      NA   <NA>       NA 05/22/2014
## 5         C05  06/17/2014      NA   <NA>       NA 05/29/2014
## 6         C05  06/18/2014      NA   <NA>       NA 06/02/2014

You’d be correct in surmising that rpt_rec_num is the internal link between the three files. The rpt file has one entry per hospice submission (usually just one per year, but sometimes more). The alpha and nmrc files, though, have many. They do this becaues they have to collapse data from multiple spreadsheets into one uniform format. Each row points to a cell on a given worksheet.

ALPHA and NMRC data

To subset a variable, you’ll need to look through the actual worksheets that facilities fill out. If you download the documentation linked above for hospice, you’ll find an Excel spreadsheet file with multiple pages. Some have address and location info. Others report patient counts and treatment days. Still others have staffing information and revenue / cost annual totals.

First, we can see that the hospice name in on worksheet S-1. Lines are numbered, and it’s on row 1; similar for columns, we can see that it’s in column 1. The file convention is that the worksheet is always 6 characters, with no punctuation, with trailing 0’s. Rows and columns are always multipled by 100. Since the name is an alphanumeric value, we should expect to find it in the alpha file. Note what happens if we try to extract it from the nmrc file.

hospice_names <- cr_extract(alpha_14, "S100000", 100, 100, "hospice_name")
nrow(hospice_names)
## [1] 500
hospice_names_nmrc <- cr_extract(nmrc_14, "S100000", 100, 100, "hospice_name")
## Warning in subset_row(worksheet_subset, row): No data found with specified
## row number.
## Warning in subset_column(row_subset, column): No data found with specified
## column number.
## Warning in cr_extract(nmrc_14, "S100000", 100, 100, "hospice_name"): Final
## result has no data. Double-check parameters or consider switching alpha/
## nmrc extraction dataset.

Several warnings are thrown for the attempted numeric extract. We can do similar extracts for the hospice address, state, zip code, and patient count.

hospice_address <- cr_extract(alpha_14, "S100000", 100, 200, "address")
hospice_state <- cr_extract(alpha_14, "S100000", 100, 400, "state")
hospice_zip <- cr_extract(alpha_14, "S100000", 100, 500, "zip")
hospice_ownership <- cr_extract(nmrc_14, "S100000", 700, 100, "ownership")
hospice_benes <- cr_extract(nmrc_14, "S100000", 1600, 600, "benes")
hospice_costs <- cr_extract(nmrc_14, "G200002", 1500, 200, "costs")
hospice_revenues <- cr_extract(nmrc_14, "G200001", 600, 100, "revenues")
hospice_net_income <- cr_extract(nmrc_14, "G200002", 1600, 200, "net_income")

The zip codes were found in the alpha file, when you might expect them to be strictly numeric. Some of the ambiguous ones won’t be clear and might require you to check both sources. In this case, 9-digit zips were saved with a - after the first 5 digits, so it’s a character variable.

All the files can be linked by rpt_rec_num, so let’s merge them.

hospice_data <- Reduce(full_join, list(hospice_names, hospice_address, 
                                       hospice_state, hospice_zip, hospice_ownership,
                                       hospice_benes, hospice_costs, 
                                       hospice_revenues, hospice_net_income))
head(hospice_data)
##   rpt_rec_num               hospice_name                           address
## 1       34033     MT BERRY HOSPICE  INC.     4300 MARTHA BERRY HIGHWAY  NE
## 2       34071      AMEDISYS HOSPICE CARE              1072 US HIGHWAY 64 W
## 3       34375            HORIZON HOSPICE 7500 DREAMY DRAW DRIVE  SUITE 225
## 4       35065     QUEEN CITY HOSPICE LLC 4055 EXECUTIVE PARK DR  SUITE 240
## 5       35167   BEYONDFAITH HOSPICE  LLC           604 OAK STREET  STE 105
## 6       35451 HARBOR HOSPICE OF BAY CITY       12808 AIRPORT BLVD  STE 335
##   state        zip ownership benes  costs revenues net_income
## 1    GA      30165         4     1     NA       NA         NA
## 2    NC 27962-9396         4    13 156786    42822    -113964
## 3    AZ      85282         5    10 126576    55432     -71144
## 4    OH 45241-2019         4    29 526940   332545    -194395
## 5    TX 76450-3070         6     5 102770    45546     -57224
## 6    TX      77478         5    NA  10539       NA     -10539

rpt data

The rpt dataset has one entry per cost report filing. It includes the facility’s CMS provider ID as well as its NPI, which can be used to link to other data sources. It also has the fiscal year start and end dates, so you know whether the data is current as of the end of the year vs. after a mid-year fiscal end date. Many of the variables aren’t that useful, but it’s worth skimming the documentation to see what you need. For now, we’ll keep a few key variables and merge them with the rest of the data.

hospice_rpt_info <- rpt_14 %>% select(rpt_rec_num, prvdr_num, fy_bgn_dt, fy_end_dt)
hospice_all <- full_join(hospice_rpt_info, hospice_data)

Analyses and Takeaways

We now have a working dataset capable of some initial analyses. For starters, recode the ownership variable to collapse into for-profit, nonprofit, and government-run.

hospice_all <- hospice_all %>%
  mutate(
    profit_group = ifelse(ownership <= 2, "nonprofit", 
                          ifelse(ownership > 2 & ownership <= 6, "for-profit",
                                 "government"))
  ) %>%
  mutate(
    profit_group = factor(profit_group, levels = c("for-profit", "nonprofit", "government")),
    per_bene_margin = net_income / benes
  )

# drop extreme outliers
upper_bound <- quantile(hospice_all$per_bene_margin, 0.99, na.rm = T)
lower_bound <- quantile(hospice_all$per_bene_margin, 0.01, na.rm = T)

graph_data <- hospice_all %>%
  filter(
    !is.na(per_bene_margin), 
    per_bene_margin <= upper_bound, 
    per_bene_margin >= lower_bound
  )

ggplot() +
  geom_boxplot(data = graph_data, aes(profit_group, per_bene_margin))

It looks like government-run agencies have very little variance in per-beneficiary profit rates. Overall, it looks like for-profit agencies have higher average profit rates than nonprofit agencies, but the both show high variation.

# use the state geometry files from the 'data' package
state_map = map_data("state")

# make lower, to conform to state_map values
states <- data.frame(state.abb, state.name)
names(states) <- c("state", "state_name")
states$state <- as.character(states$state)
states$state_name <- tolower(states$state_name)

graph_data %<>% full_join(states, by = "state")

mean_by_state <- graph_data %>%
  filter(!is.na(state_name)) %>%
  group_by(state_name, profit_group) %>%
  summarize(
    mean_profits = mean(per_bene_margin, na.rm = T)
  )

ggplot() +
  geom_map(data = mean_by_state, 
           aes(map_id = state_name, fill = mean_profits),
           map = state_map) +
  expand_limits(x = state_map$long, y = state_map$lat) +
  facet_wrap(~profit_group) +
  scale_fill_gradient(low = "red", high = "blue")

Here, the sample size is limiting our ability to draw any meaningful conclusions from the maps. The demo data only has 500 of 2700+ observations available in the cost reports, so there are many gaps. Still, this illustrates some of the potential of this data.