The ever painful truth about CSV

Data Thistle
3 min readFeb 11, 2020

--

In the last five years, we have gathered information related to over 7,000,000 performances of around 700,000 live events held at one of over 80,000 venues, so while we claim significant domain experience, we still face challenging problems related to certain types of data.

Over the years, we have significantly increased the processing of structured feeds of data. Typically, these come from a range of box office systems, theatre chains, festival organisers and music promoters. As recently as 2010 manual data entry provided 90% of our live events data, but now over 90% of our event records come from structured feeds. Each week, we load files related to thousands of venues, acquiring all the new information and merging the overlapping data coming from more than one source.

But nearly 10% of our events, or around 24,000 in 2018, arrive in other ways. Our website provides an online event submission form which we have steadily developed over the years, adding formatting tips, venue lookups and additional guidance, and we do end up with data in our structure. Unlike data from feeds, where we can be sure of the authority of the source, this data requires checks before we can pass it for publication and distribution. The online form is a robust way of gathering the most fragmented data.

Within this final 10%, alongside the website submission, we also work with organisers, such as festival marketing teams, for whom their most common data management practice is comma-separated value (CSV) files. These remain challenging but relevant listings for us to gather. It represents several thousand events in a year.

To date, we have provided a service for event submitters to send us CSV files above a minimum quantity. We tried to address file format variation and consistency issues with a CSV template which we make available. However, event organisers often have a variety of intended uses for the sheet, or may not see we provide a model until they contact us with a file in their format.

Arriving at the view that CSV files are not a fruitful solution for the interchange of data is hardly a revelation, and not restricted to events data. There have been attempts by quite a few companies over the last decade to create smart CSV data mapping and management services, a notable example is Google’s Needlebase, which became Refine, and then OpenRefine.

None has succeeded in solving the central problem, which is that CSV files are unstructured, and requires that you clean the data for it to be mapped, and this cleaning is slow and hard work.

In our online form, we implement online ‘smarts’ to help input.

In the form we:

  • ensure people only enter a single numeric data point under minimum price and provide a free text box for the exceptions
  • perform a lookup on the venue, and ensure a close duplicate is not created
  • easily allow for repeating performances without manual entry of each one
  • provide an integrated capability to add images

We recently analysed the work involved in a sample of CSV files we have received recently. We considered the work of the provider, our manual data editing and the CSV file load process finalisation for each case. CSV files are often consuming 2–4 hours of effort across the parties, usually for just 40 events.

We conclude that there is nothing better than structured data, and we want to encourage event submitters to make this their format of choice. We will supply event promoters with their data, submitted via our form, back to them in a structured format for them to use with other third parties — a no-charge additional element of our listings service for event organisers.

We have also concluded that we should restrict our use of CSV to files with more than 60 events, directing submitters with fewer events to the form with the functionality we have described in painful detail.

Postscript: 20th April 2020:
A story by Ian Watt about collecting COVID-19 data that magnificently illustrates the problems of unstructured data, including CSV.

--

--