Specifying Column Types when Importing xlsx Data to R with Package readxl

jackw19 Source

I'm importing xlsx 2007 tables into R 3.2.1patched using package readxl 0.1.0 under Windows 7 64. The tables' size is on the order of 25,000 rows by 200 columns.

Function read_excel() works a treat. My only problem is with its assignment of column class (datatype) to sparsely populated columns. For example, a given column may be NA for 20,000 rows and then will take a character value on row 20,001. read_excel() appears to default to column type numeric when scanning the first n rows of a column and finding NAs only. The data causing the problem are chars in a column assigned numeric. When the error limit is reached, execution halts. I actually want the data in the sparse columns, so setting the error limit higher isn't a solution.

I can identify the troublesome columns by reviewing the warnings thrown. And read_excel() has an option for asserting a column's datatype by setting argument col_types according to the package docs:

Either NULL to guess from the spreadsheet or a character vector containing blank,numeric, date or text.

But does this mean I have to construct a vector of length 200 populated in almost every position with blank and text in handful of positions corresponding to the offending columns?

There's probably a way of doing this in a couple lines of R code. Create a vector of the required length and fill it with blanks. Maybe another vector containing the numbers of the columns to be forced to text, and then ... Or maybe it's possible to call out for read_excel() just the columns for which its guesses aren't as desired.

I'd appreciate any suggestions.

Thanks in advance.

rreadxl

Answers

answered 3 years ago jeremycg #1

Reading the source, it looks like column types are guessed by the functions xls_col_types or xlsx_col_types, which are implemented in Rcpp, but have the defaults:

xls_col_types <- function(path, na, sheet = 0L, nskip = 0L, n = 100L, has_col_names = FALSE) {
    .Call('readxl_xls_col_types', PACKAGE = 'readxl', path, na, sheet, nskip, n, has_col_names)
}

xlsx_col_types <- function(path, sheet = 0L, na = "", nskip = 0L, n = 100L) {
    .Call('readxl_xlsx_col_types', PACKAGE = 'readxl', path, sheet, na, nskip, n)
}

My C++ is very rusty, but it looks like the n=100L is the command telling how many rows to read.

As these are non exported functions, paste in:

fixInNamespace("xls_col_types", "readxl")
fixInNamespace("xlsx_col_types", "readxl")

And in the pop-up, change the n = 100L to a larger number. Then rerun your file import.

answered 3 years ago Jack Wasey #2

It depends on whether your data is sparse in different places in different columns, and how sparse it is. I found that having more rows didn't improve the parsing: the majority were still blank, and interpreted as text, even if later on they become dates, etc..

One work-around is to generate the first data row of your excel table to include representative data for every column, and use that to guess column types. I don't like this because I want to leave the original data intact.

Another workaround, if you have complete rows somewhere in the spreadsheet, is to use nskip instead of n. This gives the starting point for the column guessing. Say data row 117 has a full set of data:

readxl:::xlsx_col_types(path = "a.xlsx", nskip = 116, n = 1)

Note that you can call the function directly, without having to edit the function in the namespace.

You can then use the vector of spreadsheet types to call read_excel:

col_types <- readxl:::xlsx_col_types(path = "a.xlsx", nskip = 116, n = 1)
dat <- readxl::read_excel(path = "a.xlsx", col_types = col_types)

Then you can manually update any columns which it still gets wrong.

answered 3 years ago Stanislav #3

I have encountered a similar problem.

In my case empty rows and columns were used as separators. And there were a lot of of tables (with different formats) contained in the sheets. So, {openxlsx} and {readxl} packages do not suit in this situation, cause openxlsx remove empty columns (and there is not parameter to change this behavior). Readxl package works as you described, and some data may be lost.

In the result, I think, that the best solution, if you want to automatically handle big amounts of excel data, is to read sheets without changes in the 'text' format and then process data.frames according to your rules.

This function can read sheets without changes (thanks to @jack-wasey):

loadExcelSheet<-function(excel.file, sheet)
{
  require("readxl")
  sheets <- readxl::excel_sheets(excel.file)
  sheet.num <- match(sheet, sheets) - 1
  num.columns <- length(readxl:::xlsx_col_types(excel.file, sheet =   sheet.num,
                                              nskip = 0, n = 1))

  return.sheet <- readxl::read_excel(excel.file, sheet = sheet,
                                col_types = rep("text", num.columns),
                                col_names = F)
  return.sheet 
}

answered 2 years ago Stefan Jansson #4

The internal funcitons for guessing column types can be set to any number of rows to scan. But read_excel()doesn't implement that (yet?).

The solution below is just a rewrite of the orignal function read_excel() with argument n_max that defaults to all rows. Due to lack of imagination, this extended function is named read_excel2.

Just replace read_excel with read_excel2 to evaluate column types by all rows.

# Inspiration: https://github.com/hadley/readxl/blob/master/R/read_excel.R 
# Rewrote read_excel() to read_excel2() with additional argument 'n_max' for number
# of rows to evaluate in function readxl:::xls_col_types and
# readxl:::xlsx_col_types()
# This is probably an unstable solution, since it calls internal functions from readxl.
# May or may not survive next update of readxl. Seems to work in version 0.1.0
library(readxl)

read_excel2 <- function(path, sheet = 1, col_names = TRUE, col_types = NULL,
                       na = "", skip = 0, n_max = 1050000L) {

  path <- readxl:::check_file(path)
  ext <- tolower(tools::file_ext(path))

  switch(readxl:::excel_format(path),
         xls =  read_xls2(path, sheet, col_names, col_types, na, skip, n_max),
         xlsx = read_xlsx2(path, sheet, col_names, col_types, na, skip, n_max)
  )
}
read_xls2 <- function(path, sheet = 1, col_names = TRUE, col_types = NULL,
                     na = "", skip = 0, n_max = n_max) {

  sheet <- readxl:::standardise_sheet(sheet, readxl:::xls_sheets(path))

  has_col_names <- isTRUE(col_names)
  if (has_col_names) {
    col_names <- readxl:::xls_col_names(path, sheet, nskip = skip)
  } else if (readxl:::isFALSE(col_names)) {
    col_names <- paste0("X", seq_along(readxl:::xls_col_names(path, sheet)))
  }

  if (is.null(col_types)) {
    col_types <- readxl:::xls_col_types(
      path, sheet, na = na, nskip = skip, has_col_names = has_col_names, n = n_max
    )
  }

  readxl:::xls_cols(path, sheet, col_names = col_names, col_types = col_types, 
                    na = na, nskip = skip + has_col_names)
}

read_xlsx2 <- function(path, sheet = 1L, col_names = TRUE, col_types = NULL,
                       na = "", skip = 0, n_max = n_max) {
  path <- readxl:::check_file(path)
  sheet <-
    readxl:::standardise_sheet(sheet, readxl:::xlsx_sheets(path))

  if (is.null(col_types)) {
    col_types <-
      readxl:::xlsx_col_types(
        path = path, sheet = sheet, na = na, nskip = skip + isTRUE(col_names), n = n_max
      )
  }

  readxl:::read_xlsx_(path, sheet, col_names = col_names, col_types = col_types, na = na,
             nskip = skip)
}

You might get an evil performance hit because of this extended guessing. Haven't tried on really big data sets yet, just tried on smaller data enought to verify function.

answered 2 years ago C8H10N4O2 #5

Reviewing the source, we see that there is an Rcpp call that returns the guessed column types:

xlsx_col_types <- function(path, sheet = 0L, na = "", nskip = 0L, n = 100L) {
    .Call('readxl_xlsx_col_types', PACKAGE = 'readxl', path, sheet, na, nskip, n)
}

You can see that by default, nskip = 0L, n = 100L checks the first 100 rows to guess column type. You can change nskip to ignore the header text and increase n(at the cost of a much slower runtime) by doing:

col_types <-  .Call( 'readxl_xlsx_col_types', PACKAGE = 'readxl', 
                     path = file_loc, sheet = 0L, na = "", 
                     nskip = 1L, n = 10000L )

# if a column type is "blank", no values yet encountered -- increase n or just guess "text"
col_types[col_types=="blank"] <- "text"

raw <- read_excel(path = file_loc, col_types = col_types)

Without looking at the .Rcpp, it's not immediately clear to me whether nskip = 0L skips the header row (the zeroth row in c++ counting) or skips no rows. I avoided the ambiguity by just using nskip = 1L, since skipping a row of my dataset doesn't impact the overall column type guesses.

answered 9 months ago R Yoda #6

New solution since readxl version 1.x:

The solution in the currently preferred answer does no longer work with newer versions than 0.1.0 of readxl since the used package-internal function readxl:::xlsx_col_types does no longer exist.

The new solution is to use the newly introduced parameter guess_max to increase the number of rows used to "guess" the appropriate data type of the columns:

read_excel("My_Excel_file.xlsx", sheet = 1, guess_max = 1048576)

The value 1,048,576 is the maximum number of lines supported by Excel currently, see the Excel specs: https://support.office.com/en-us/article/Excel-specifications-and-limits-1672b34d-7043-467e-8e27-269d656771c3

PS: If you care about performance using all rows to guess the data type: read_excel seems to read the file only once and the guess is done in-memory then so the performance penalty is very small compared to the saved work.

comments powered by Disqus