Reading and Writing an Excel File in R - Paul Oldham's Analytics Blog

This post was updated in 2018 and you can read it here

The CRAN Project has the following to say about importing Excel files into R.

“The first piece of advice is to avoid doing so if possible! If you have access to Excel, export the data you want from Excel in tab-delimited or comma-separated form, and use read.delim or read.csv to import it into R. (You may need to use read.delim2 or read.csv2 in a locale that uses comma as the decimal point.).”

This is very sound advice. The best option when dealing with Excel is generally to use save as to save the file as a .csv and then import it into R. However, there are a number of ways of reading an Excel file into R. We will deal with two of them in this walk through focusing on the patent datasets in our open access patent datasets folder. Download the GitHub .zip file here. Feel free to use your own dataset.

One challenge with R and Excel files is that no one package seems to do everything that you want. In particular, reading from URLs is a bit of a minefield particularly on secure connections (https:). If this walk through doesn’t meet your needs then try this R-bloggers overview on the range of available packages. The R-bloggers excel topic listing also has lots of useful articles covering working with Excel in more depth than this short article. To find additional help try stackoverflow. We will focus on:

Using the xlsx package
Testing the new readxl package

To read an Excel file into R first install the package or tick the box in the Packages list to load it or load the library.

install.packages("xlsx")

Load the library

library(xlsx)

## Loading required package: rJava

## Loading required package: xlsxjars

You can use your own local excel file but we will use the file wipotrends in the patent dataset folder for this example. Other test Excel datasets in the folder are ewaste and solarcooking. Download the file and save it to your computer. Then copy the local file path.

Reading a local file

We will use a file called wipotrends

Let’s open the file up to inspect it briefly. We will see that it contains one worksheet and that the column headings begin at row 5. To load it into R we will use the read.xlxs function and specify arguments to tell R where to look for and handle the data.

wipotrends <- read.xlsx("/Users/pauloldham17inch/Desktop/open_source_master/2_datasets/wipo/wipotrends.xlsx", sheetIndex = 1, startRow = 5, endRow = 23, as.data.frame = TRUE, header=TRUE)

sheetIndex = n tells R to import the first worksheet (working numerically). startRow = n tells R where to start reading the data (if not the first row). endRow = n tells R where to stop reading the data. Note that in this case the data stops at row 23 from the first row. You do not need to specify this value but in some cases R will read in NA values for extra rows below the actual data (try excluding endRow = and reimport the data to test this) as.data.frame = tells R whether to convert the data into a data frame. Generally this is a good thing. The default will import the data as a list. header = TRUE tells R whether or not there are column headings in the start row.

In general it is good practice in your work to create Excel workbooks with 1 sheet and headings in the first row. However, as we can see from the WIPO example, reality tends to be different. That means that it is important to inspect and clean the data before hand. Keep a copy of the original file for reference by creating a .zip file. Other things to consider are:

Checking for corrupted characters and correcting them using find and replace in Excel or Open Office (see this video).
Tidy up column names by removing characters such as ‘' or brackets that could cause problems (for example R will generally import inventor(s) as inventor.s). Consider removing blank spaces in column titles or replacing with’_’ and regularising the case (e.g. all lower case ). This will make life easier later.
Dealing with any leading or trailing spaces using TRIM() in Excel or Open Office.
Filling blank cells with NA (see this quick video)
Any formulas, such as column or row sum functions, may not be wanted and could cause confusion when you run your own calculations.

The above preparation steps will generally take a few minutes but can save a lot of work later on. Jeff Leek provides a very good guide to preparatory steps in The Elements of Data Analytic Style and we will be following these steps in our patent analysis work.

Let’s take a look at the other available arguments by calling up the description.

?read.xlsx()

The range of arguments is below.

read.xlsx(file, sheetIndex, sheetName=NULL, rowIndex=NULL, startRow=NULL, endRow=NULL, colIndex=NULL, as.data.frame=TRUE, header=TRUE, colClasses=NA, keepFormulas=FALSE, encoding="unknown", ...)

Because Excel workbooks often contain more than one sheet, R needs to know where to find the right sheet. This is generally easy to do by number rather than name using sheetName =. Row index = will indicate the rows that you want to extract (if there are specific rows).

startRow = will indicate whether to start reading into R from the first row or from a later row. Quite often there are spaces or explanatory text in the top row or rows. It pays to examine the dataset first and count the rows. As a matter of good practice use the first rows for column headings only and put other material elsewhere (a readme text file or a new worksheet called readme).

The endRow = argument specifies where to stop reading the data.

colIndex - indicates the columns that you want to extract. NULL is the default and will import all columns.

as.data.frame = TRUE helpfully tells R to create a data frame. If not then a List will be created.

header = TRUE or FALSE specifies whether the columns have names. In this case if we had not started at startRow = 5, the header would have appeared as “Figure.A.1.1.1.Trend.in.patent.applications.worldwide” followed by more text. To try this for yourself change the startRow to 1 and reimport the data giving wipotrends a different name.

Let’s take a look at wipotrends

wipotrends

##    Year Applications Growth.rate....
## 1  1995      1047700              NA
## 2  1996      1088800             3.9
## 3  1997      1163400             6.9
## 4  1998      1214900             4.4
## 5  1999      1269000             4.5
## 6  2000      1377800             8.6
## 7  2001      1456500             5.7
## 8  2002      1443300            -0.9
## 9  2003      1485800             2.9
## 10 2004      1570100             5.7
## 11 2005      1703600             8.5
## 12 2006      1794300             5.3
## 13 2007      1866000             4.0
## 14 2008      1914800             2.6
## 15 2009      1846800            -3.6
## 16 2010      1987600             7.6
## 17 2011      2149000             8.1
## 18 2012      2347700             9.2

In reviewing wipotrends note that the row numbers refer to data rows (we have excluded the padding in rows 1 -4). If we were spending time with this data we might also want to turn the columns to lowercase and growth rate to growth_rate (but see below on readxl).

Writing Excel Files

It is generally better to write a .csv file rather than an Excel file because the results can be used in a wider range of tools (including Excel) and will be cleaner (see below). However, to write an Excel file with the new data frame use the write.xlsx() function. Before running the command it is generally a good idea to use the command getwd() to display the working directory you are in so that you know where the file will be saved. To change the directory to a new location use setwd("yourpathtofile").

write.xlsx(wipotrends, "yourfilenamepath_new.xlsx", sheetName="Sheet1", col.names = TRUE, row.names = TRUE, append = FALSE, showNA = TRUE)

This will create a new file called wipotrends_new. Note three points here:

Give your file a new name if writing into the same directory. Otherwise R will overwrite your existing file. Assuming you don’t want to overwrite the original give the new file a sensible name.
If you select row.names = FALSE R will write a new column with row numbers (in this case)
Selecting showNA = TRUE will fill any blank cells with NA. That is useful when coming back into R to tidy up and select data. Blank cells are the enemy of calculations and it is better to fill the cells with a value where possible.

Writing Excel to CSV

While Excel is popular in reality it is better to use .csv when using or sharing data across a range of software tools. To write results into .csv use write.csv(). Call up the description for write.csv with ?write.csv in console. See the .csv walk through for further details.

write.csv(wipotrends, file = "yourfilenamepath_new.csv", row.names = FALSE)

Using the Readxl package

readxl is a new package from RStudio and is still a work in progress. We will cover it here because as the package develops it will become more popular and you are more likely to use it.

install.packages("readxl")

library(readxl)

At the moment readxl version 0.1.0 has two functions.

excel_sheets(path) where path is the path to the xls/xlsx file. This function will list all the sheets in an excel spreadsheet to help you select the sheet that you want to import.

For example, if we add a couple of random sheets to wipotrends and then use excel_sheets("myfilenamepath") will provide names that look something like this:

[1] “Sheet1” “my sheet” “another sheet”

This is very helpful if you don’t know how many sheets are in a workbook or you want to call them by name. 2. read_excel()

library(readxl)
read_wipo <- read_excel("/Users/pauloldham17inch/Desktop/WIPO_Training/training_datasets/wipo/wipotrends.xlsx", col_names = TRUE, na = "",  skip = 5)

When we read in this file and print it to the console we will notice something unusual.

read_wipo

## # A tibble: 17 x 3
##    `1995` `1047700`   X__1
##     <dbl>     <dbl>  <dbl>
##  1  1996.  1088800.  3.90 
##  2  1997.  1163400.  6.90 
##  3  1998.  1214900.  4.40 
##  4  1999.  1269000.  4.50 
##  5  2000.  1377800.  8.60 
##  6  2001.  1456500.  5.70 
##  7  2002.  1443300. -0.900
##  8  2003.  1485800.  2.90 
##  9  2004.  1570100.  5.70 
## 10  2005.  1703600.  8.50 
## 11  2006.  1794300.  5.30 
## 12  2007.  1866000.  4.00 
## 13  2008.  1914800.  2.60 
## 14  2009.  1846800. -3.60 
## 15  2010.  1987600.  7.60 
## 16  2011.  2149000.  8.10 
## 17  2012.  2347700.  9.20

That is while we have specified that col_names = TRUE and skip = 5 the function has not returned the column names in the dataset. While this is a bit puzzling ( the package was released less than a month ago), it suggests that it is still a work in progress. Unless this is a glitch with our data then one option would be to specify col_names = FALSE and then rename the X0 X1 X2 column names that are generated. readr is under active development and you can follow its progress here.

This is a useful reminder of one of the important principles of clean and tidy data. The first row should contain the column names.

Bear in mind that readxl may struggle with reading dates correctly, but expect that to also change in the future.

At the time of writing there is no write_excel function but expect that to change.

The main advantage of read_excel (as with read_csv in the readr package) is that the data imports into an easy to print object with three attributes a tbl_df a tbl and a data.frame.

If you are using dplyr and tidyr (and we assume that you will be) then the creation of a tbl_df makes life much easier.

In summary, readxl is a welcome development for those who prefer using Excel (or are forced too), but it is very recent. It’s main strength is the ability to easily see what worksheets are in a workbook and also the automatic creation of a data frame or data frame table at the time of import. The absence of a write function will hopefully encourage hardened Excel uses to adopt comma separated or tab delimited files as their standard and to take advantage of the fuller functionality of the readr package. You know it makes sense.

Reading Excel files from URL locations

It is faster to simply download the file to your drive, or swim the Atlantic ocean, than to successfully download an excel file on http: or, in particular https:. So maybe ask yourself what is the path of least resistance and run with that.

Getting Help and Further Resources

For additional functionality experiment with the very useful XLConnect package in Packages. Read the documentation on CRAN. This adds a lot of functionality in working with Excel files in R.
See the R-bloggers overview article on the range of packages for working with Excel files.

Paul Oldham
Updated 13/05/2015