It's Time to Retire the CSV
Written by Alex Rasmussen on August 16, 2021
If you’ve worked with data for any length of time, you’ve come across the Comma-Separated Values (CSV) format. Its simplicity and ubiquity make CSV an extremely popular way for organizations to exchange data both internally and externally. While many programs can’t read or write Excel spreadsheets, almost anything can read and write CSVs, and a human can open a CSV file in any text editor and understand roughly what it contains.
Despite this ubiquity and ease of access, CSV is a wretched way to exchange data. The CSV format itself is notoriously inconsistent, with myriad competing and mutually-exclusive formats that often coexist within a single dataset (or, if you’re particularly unlucky, a single file). Exporting a dataset as a CSV robs it of a wealth of metadata that is very hard for the reader to reconstruct accurately, and many programs’ naïve CSV parsers ignore the metadata reconstruction problem entirely as a result. In practice, CSV’s human-readability is more of a liability than an asset.
CSVs often begin life as exported spreadsheets or table dumps from legacy databases, and often end life as a pile of undifferentiated files in a data lake, awaiting the restoration of their precious metadata so they can be organized and mined for insights. Much of the job of data preparation products is restoring the metadata lost when a spreadsheet is exported as a CSV. The world has lost a mind-boggling amount of time and information to CSV’s imprecision and under-specification.
I’ve spent many years battling CSVs in various capacities. As an engineer at Trifacta, a leading data preparation tool vendor, I saw numerous customers struggling under the burden of myriad CSVs that were the product of even more numerous Excel spreadsheet exports and legacy database dumps. As an engineering leader in biotech, my team struggled to ingest CSVs from research institutions and hospital networks who gave us sample data as CSVs more often than not. As a consultant, I’ve written more than one internal system that attempts to reconstruct the metadata of a CSV of unknown provenance using a combination of heuristics and brute force. In short, CSV is a scourge that has followed me throughout my career.
In recent years, it’s become much easier for organizations to exchange data using metadata-rich, clearly-defined, and well-structured file formats. Some organizations even directly share their live datasets with one another. Despite these advances, CSV is still the path of least resistance and, as a consequence, it remains the lingua franca of intra- and inter-organizational data exchange. The time has long passed to retire CSV and replace it with something better.
The trouble with CSV
The CSV format itself dates at least as far back as the early 1970s. Tired of having to manually align their input data to punch-cards’ columns, FORTRAN users quickly embraced “list-directed input/output”, a format in which (according to the FORTRAN IV manual for VM/370) “input entries are specified by blanks or commas, with successive commas indicating values to be omitted”. Most of us don’t use punch-cards anymore, but that ease of authorship remains one of CSV’s most attractive qualities. CSVs can be read and written by just about anything, even if that thing doesn’t know about the CSV format itself.
CSVs are easy to read and write because the format - if it can even be called a format - is extremely simple. At its most basic, CSVs have only two delimiting characters: a comma to delimit columns, and a newline to delimit rows. In principle, reading such a file could hardly be more simple: scan the file one character at a time, building up a value’s characters in a buffer until you encounter a comma, and adding completed values to a row until you encounter a newline.
In practice, unfortunately, it’s a lot more complicated than that.
One name, many formats
The CSV format is extremely spartan and largely informally specified. As long as you separated values with commas and terminated rows with newlines, you could say you were writing a CSV. It wasn’t until 2005 that a developer at a now-defunct startup in New York wrote RFC-4180, the most formal specification of CSV that I’m aware of. As a result of this under-specification, a whole slew of different CSV variants have arisen over the decades.
The first big source of variation among CSV formats is which character(s) they expect to represent a newline. Usually, these variants use the operating system’s notion of a newline in that operating system’s virtual teletype: Linux, macOS and other Unix-like operating systems use a “line feed” character to denote a newline, while Windows uses a “carriage return” character followed by a line feed. Older Macs and a host of Zilog Z80-based microcomputers just used a single carriage return. If you work at an organization that’s old enough, you’ve probably encountered all of these variants at least once.
The second major source of variation between formats is how delimiter characters are escaped: that is, how the literal occurrence of a comma or newline (often called a “literal” for short) is differentiated from its delimiter equivalent. Some variants escape delimiters by replacing them with escape sequences, characters that signal to the parser “these characters are the literal version of a delimiter character”. One popular escape sequence is a \
followed by a single letter, with \,
, \r
and \n
representing a literal comma, carriage return, and newline, respectively. Some variants (including the one Excel uses) avoid escape sequences altogether and instead enclose all values within quotes, treating every character between the quotes as a literal. Of course, in this case the quote character is itself also a delimiter, and any literal quotes need to be escaped somehow. Excel uses two quotation marks (""
) as an escaped version of the quote character. This has the unfortunate side-effect of making it easy for a manual editor to mistake an escaped quote for an empty quoted value and make an edit within the wrong pair of quotes. To account for these kinds of errors, many CSV parsers make quotes optional and only use them when values need to be escaped. This additional ambiguity introduces a whole new set of ways that manual editors can cause parse errors.
If you're lucky enough to deal with a CSV that's just been exported, you may be able to avoid these problems. If the file you're working with has been around for a while and has ever been edited manually (which happens far more often than you might think) you're likely about to play a prolonged game of missing delimiter whack-a-mole before the file will parse properly.
Maddeningly missing metadata
Another big drawback of CSV is its almost complete lack of metadata. While a human can often intuit what a file contains by looking at it, it’s much harder for software to do that without being given a lot of hints.
Some variants of CSV (including RFC-4180) allow for the first row to be optionally used as a header, but few variants explicitly identify headers, leaving the parser either to be told there are headers or to try to make an educated guess.
A column’s data type is even more difficult to determine automatically. Values in CSVs are just sequences of characters. They might represent more complex types, but there’s no way for a parser to know that type information just from looking at the file.
You might think that this fuzziness isn’t a big deal. After all, an expert user can examine the CSV, figure out if headers exist and where they are, determine each column’s type, and give all that information to the parser. This is a significant time sink for the user, of course, but it can be done. Unfortunately, when you’re dealing with CSVs in the wild, it’s rarely that simple. Often, you’ve got to reconcile hundreds or thousands of files from many different sources with one another. Each CSV’s schema may be subtly different from the last, and the only way to know for sure is to determine each file’s metadata in turn or (and this is much more common) just throw a bunch of similar CSVs at the same parser and hope for the best.
Even if you resign yourself to manually determining the CSV’s schema, even a trained user may not be able to determine the CSV’s schema for certain just by looking at it. For example, suppose you had a column of values that looks like this:
12/7/2021
5/11/2021
6/1/2021
1/4/2021
This column obviously contains dates, but which dates? Most of the world uses day-first dates, and would read the column as July 12th, November 5th, January 6th and April 1st. In marked contrast, Americans use month-first dates and would read these dates as December 7th, May 11th, June 1st, and January 4th. Even if you know what part of the world the CSV comes from, it’s impossible to know the author’s intended representation with certainty. Sometimes you’ll get lucky and run across a date like 31/6/2021
that makes it clear that the dates are day-first instead of month-first, but even in those cases, how do you know that someone didn’t try to enter 3/16/2021
manually and make a typo?
A third major piece of metadata missing from CSVs is information about the file’s character encoding. Automatically detecting character encodings is impossible in the general case for much the same reason that we can’t be certain whether the dates in the example above are day-first or month-first: there’s often no conclusive way to exclude all but one candidate encoding. We’ve had pretty good auto-detectors for a while (e.g. chardet), but even if an auto-detector reads the entire file to attempt to determine its encoding, the detector can’t make any guarantees about being correct. This makes encoding detection a slow and error-prone process, and a lot of systems will just assume a common encoding like UTF-8 or ASCII and once again hope for the best.
CSVs are built for humans, not machines
If you’ve made it this far, I’ve hopefully convinced you that CSV has a whole host of problems that are difficult or impossible to work around. These problems are manifold, but they share a common root cause: CSV’s human-readability. CSV has to worry about escaping delimiters because it’s using the same character set for both representing and delimiting data. That common character set makes it extremely easy to read and write CSVs, but it comes with a steep penalty. CSV’s laissez-faire attitude on character encodings is partly due to the format’s age (it predates UTF-8 by more than a decade), but it’s partly an acquiescence to the desire to make the format readable and writable by as many programs as possible with zero friction.
In the early 1970s, when computers were room-sized and mostly used by universities and research labs, these problems were easy to ignore. After all, it was a lot nicer than manually aligning columns on punch-cards, and data in its raw form was rarely shared outside the team of experts that had generated it. As with many things in software, however, CSV’s usage grew and spread in ways and at speeds that its authors could hardly have anticipated. By the time we realized just how much trouble we were in, the CSV format was omnipresent and entrenched.
If not CSV, then what?
I’m not going to advocate for any one successor file format here, but there are certainly a lot of contenders. The big data ecosystem gave us formats like Avro, Parquet, and Arrow that are widely used as an intermediate representation when transferring data between systems. HDF51 is widely used in the scientific computing community. An enormous amount of structured information is stored in SQLite databases, collections of XML files, and any number of domain-specific binary formats.
Each potential successor format has its own strengths and weaknesses, but all successors have a few things in common.
First, files are self-describing. They contain metadata in a well-known location (either within the file itself or via a permalink to a separate schema file) that software can use to easily and unambiguously determine the file’s schema. This description sometimes doubles as an index into the file, which makes random reads a lot easier to do.
Second, values stored in the files are typed. The sophistication of the type system varies from format to format, but there’s always at least support for integers, floating-point numbers, booleans, strings, dates, and times. Many formats also allow for repeated and nested structures.
Third, and most importantly, these formats trade human readability and writability for precision. You can’t look at the bytes of an Avro or Parquet file and know at a glance what’s in it, nor can you edit that file without a specialized editor, but if you modify a value you can rest assured that you did so in a safe way. You can also assume that if you write a file and pass it to someone else, even someone you’ve never met in some far distant time, that person will be able to read exactly the data that you wrote.
You’ll notice that I didn’t mention either JSON or YAML as potential replacements for CSV. Despite (or, perhaps, as a contributing factor to) their overwhelming popularity, JSON and YAML exist in a fuzzy intermediate point between CSV and structured formats like Avro, carrying some of the benefits and disadvantages of each. Both JSON and YAML have a type system, can specify complex structures, and are reasonably self-describing (more so if you factor in things like JSON Schema). However, like CSV, JSON and YAML sacrifice precision in favor of human readability and writability. A YAML file can express far more complicated and metadata-rich structures than a CSV can, but you’re no less likely to corrupt a YAML file by using the wrong number of spaces to indent a block or forgetting quotes around a string that contains delimiting characters.
Ultimately, even if it could muster the collective will, the industry doesn’t have to unify around a single successor format. We only need to move to some collection of formats that are built with machine-readability and clarity as first-order design principles. But how do we get there?
How we get out of this mess
Truly retiring CSV as a format is going to take a lot of work. Much like any other wide-spread and serious problem, our approach to getting rid of CSV is two-fold: keep new CSVs from getting created, and actively port existing CSVs to better formats.
To prevent the problem from getting any worse, we have to stop normalizing CSV as an output or input format for data transfer. If no programs allowed users to export data as a CSV - opting for some other format(s) that meets the criteria described above instead - we could at least rest assured that the world isn’t producing any more CSVs that future unfortunates will have to wrangle. Of course, that’s an easy thing to say and a hard thing to do. Many organizations, particularly those in regulated industries like finance and healthcare, can hold onto software for decades without upgrading it. That said, someone has to draw the line in the sand sometime, and that time might as well be now.
In order to move the world off of CSVs, we also need to make its successor format(s) as easy for people to read and write as possible. If someone has to write a script, download a JAR, or pay for a single vendor’s product to read and write CSV’s replacement, that replacement will never take hold. To usurp CSV as the path of least resistance, its successor must have equivalent or superior support for both editing and viewing.
The biggest and most thorny problem to solve is the people problem: how do you convince people to stop creating new CSVs when they’ve never done things any other way? Fortunately - for this problem, anyway - most of the world’s business data is born in one of a handful of programs that are owned by an increasingly small number of companies. If Microsoft and Salesforce were somehow convinced to move away from CSV support in Excel and Tableau, a large portion of business users would move to a successor format as a matter of course. Of course, it’s debatable whether that kind of a change is in the best interest of those companies, but I’m cautiously optimistic.
Even if we can’t stop CSVs from being created, we can at least aggressively decrease their number by converting them to better formats. In my (admittedly biased) view, this conversion process can only be accomplished through democratized, machine-assisted data transformation with a human in the loop. The process has to be democratized (in the sense that it can’t be the sole responsibility of a data engineering team) and machine-assisted because of the sheer size of the problem in many organizations. If you’ve only got a few dozen CSVs, you might be able to push through the conversion process in a week or two. If you’ve got tens of thousands or more, however, you may never finish unless you’ve got a lot of help and that help isn’t editing CSVs and writing conversion scripts by hand. CSV is ambiguous enough that humans need to be involved somewhere to help resolve that ambiguity, whether that human is training machine learning models that convert CSVs or performing the conversion task themselves with assistance from software.
A proactive move away from CSVs is the ideal, but ultimately much of this work will likely be accomplished by the passage of time. Data has a shelf-life, after all. As technology changes and companies come and go, there may come a time when you just don’t see many CSVs anymore. That change will not happen on its own, however. Authors of data-centric software will have to actively push for better file formats, both for their own well-being and that of their customers. Companies who are serious about “digital transformation” will have to recognize the problems that CSVs pose and actively work to address them. With enough concerted effort and a little luck, future generations may view CSV as a curious relic of a bygone age rather than an active nuisance. For my part, I hope to see that glorious future and will continue to do my part to bring it about, but we may be stuck with CSV for a long, long time.
Thanks for reading! If your organization is struggling with its CSV problem, I'd love to talk to you more about it - feel free to contact me, or subscribe to the Bits on Disk newsletter by entering your email address in the form at the bottom of this page.
-
The visual similarity of HDF5’s format name to the name of Hadoop’s HDFS distributed storage system is proof that the universe has a sense of humor. ↩