Importing and exporting data

Overview

Teaching: 15 min
Exercises: 15 min
Questions
  • What problems are there with Excel files?

  • How can these problems be solved?

Objectives
  • Understand the difference between spreadsheets and tables.

  • Understand reasons for storing data in the CSV format

  • Export data from a spreadsheet to a CSV file.

  • Import data from a CSV file into a spreadsheet.

The Excel file format

By default, Excel, stores spreadsheet data in the Excel file format (*.xls or *.xlsx - depending on the Excel version).

The CSV format for data tables

When archiving, publishing or sharing data, it’s a good idea to consider an alterative format – one which is universal, open and static (no formulas, scripts, etc.) format.

The de facto standard format in these cases would be the CSV format (comma-separated values).

When it comes to data exchange and long-term storage, simplicity is king and CSV is perfect! But CSV files can only contain data tables (one per file), nothing of the other stuff you can have in a spreadsheet like formulas, merged cells, formatting, comments, figures, etc.

Options for CSV files

A CSV file comes with some options that you need to be aware of when saving data as CSV or importing data from CSV. The most important are:

Saving a table in Excel as CSV

All spreadsheet programs can save to CSV, but they differ a bit in how well they do it.

  1. From the top menu select ‘File’ and ‘Save as’.
  2. In the ‘Format’ field, from the list, select ‘Comma Separated Values’ (*.csv).
  3. Double check the file name and the location where you want to save it and hit ‘Save’.

Saving an Excel file to CSV

Warning: Excel does not give you the option to choose character encoding, but instead selects a legacy character encoding based on which system and language you’r using. Before sharing a CSV file from Excel, you should convert it to Unicode (UTF-8) – this can be done in a text editor like Notepad++.

Oddly enough, the Mac version of Excel provides a separate option in the file format menu called “CSV UTF-8 (Comma delimited)”, but this is not available in the Windows version.

CSV export is more straightforward from LibreOffice or Google Sheets. LibreOffice provides you with all the options you need, while Google Sheets just use sensible defaults.

Importing CSV files

Some software are quite smart when it comes to detecting the above-mentioned CSV options upon import. Excel is not, but it provides you with an import wizard where you can select the settings yourself by trying and failing. If you see question marks like this it means the file is encoded using a different encoding than the one you have selected:

Wrong encoding during import

In this case, the file was encoded as a legacy encoding called “Windows (ANSI)” or “CP-1252”, which is limited to the Latin alphabet with some extensions.

In the next step of the import, you have to select delimiter and text qualifier. This is easier since there are fewer options to choose from, and Excel will preview the result live. For the text qualifier setting, the default value is also usually the correct one.

Correct delimiter during import

Challenge: Importing a CSV file

  1. Download the CSV file with a list of all countries from DataHub: https://datahub.io/core/country-list (go to the ‘Data Files’ on the web page and look for “csv” under “Download” to the right).
  2. To import the csv-file, go to the ‘Data’ tab, click From text and locate the csv-file that you downloaded.
  3. Use the info from the ‘Importing CSV files’ section to import it properly.

A final note on cross-platform operability

(or, how typewriters are ruining your work)

By default, most coding and statistical environments expect UNIX-style line endings (\n) as representing line breaks. However, Windows uses an alternate line ending signifier (\r\n) by default for legacy compatibility with Teletype-based systems. As such, when exporting to CSV using Excel, your data will look like this:

data1,data2\r\n1,2\r\n4,5\r\n…

which, upon passing into most environments (which split on \n), will parse as:

data1
data2\r
1
2\r

thus causing terrible things to happen to your data. For example, 2\r is not a valid integer, and thus will throw an error (if you’re lucky) when you attempt to operate on it in R or Python. Note that this happens on Excel for OSX as well as Windows, due to legacy Windows compatibility.

There are a handful of solutions for enforcing uniform UNIX-style line endings on your exported CSVs:

  1. When exporting from Excel, save as a “Windows comma separated (.csv)” file
  2. If you store your data file under version control (which you should be doing!) using Git, edit the .git/config file in your repository to automatically translate \r\n line endings into \n. Add the follwing to the file (see the detailed tutorial):

    [filter "cr"]
    clean = LC_CTYPE=C awk '{printf(\"%s\\n\", $0)}' | LC_CTYPE=C tr '\\r' '\\n'
    smudge = tr '\\n' '\\r'` 
    

    and then create a file .gitattributes that contains the line:

    *.csv filter=cr
    
  3. Use dos2unix (available on OSX, *nix, and Cygwin) on local files to standardize line endings.

A note on Python and xls

There are Python packages that can read xls files (as well as Google spreadsheets). It is even possible to access different worksheets in the xls documents.

But

Key Points

  • Use the CSV file format for data storage and exchange