This blog post is Human-Centered Content: Written by humans for humans.
When I’m doing data engineering work for a client, it’s very common to work with data that’s been exported from a source system to a Comma Separated Variable (CSV) file.
For a lot of data transfer scenarios, these simple text-based files can work well. Where I have seen issues with CSV and other text based file formats, the problem often boils down to an attempt to store a free text column, without taking account of all the special characters that the column might contain.
CSV files separate columns and rows using commas and newline characters. If a free text column itself contains a comma or new line character, this breaks the structure of the file.
Example – CSV
id,name,comment,date 1,Florence,Hello, world!,2025-03-02 2,Dougal,Great work!, 2025-03-02
When the file is read, the “Hello, “world”!” comment is split into two, with “world!” loaded into the date column. Any other columns after the comment column would also then be thrown off.
JSON files are more resilient as they use key-value pairs, and routinely encode fields in double-quotes, but free text can still cause problems.
Example – JSON
{ "id": 1, "name": "Florence", "comment": "Hello "world"!" }
In this case, the free text field “Hello World”!” contains quotation marks. This could be handled by the process writing the JSON file prefixing quotes with escape characters (\” ). You just need to be sure that this is done before the text is added to the JSON document.
When using CSV files, one approach is to specify a delimiter other than the standard comma. It’s common to use tabs, pipes (|), semicolons or tildes to separate values, and to enclose each text field in double quotes. Many data platforms include options to specify these kind of format variations, but none of these provide a robust solution for unfiltered free text. If you’re dealing with enough free text, then whatever delimiter you pick, someone will have used (or will soon use) just that combination of characters in a free text string.
Once the file has been written, the damage is already done. In a large data file, these kind of issues can be devilishly hard to unpick. What we need is an alternative way of storing free text data in text files, without getting into a tangle trying to find ever more obscure delimiters.
Enter Base64
Base64 is an encoding format where any data can be represented as a sequence of “safe” ASCII characters (just A-Z, a-z, 0-9, +, / and =).
Free text encoded in base64 can be stored as a field in a CSV, TSV, JSON, XML or any other text based file format, and then later decoded back to its original form retaining any special characters.
Example conversion:
Original text: “Hello, world!”
Base64 encoded: “SGVsbG8sIHdvcmxkIQ==”
In a CSV file:
id,name,comment 1,Florence,SGVsbG8sIHdvcmxkIQ== 2,Dougal,R3J1YXQgd29yayE=
Whatever the contents of the text, there are no commas, new- lines or quotation marks to interfere with the CSV file structure.
In JSON:
{ "id": 1, "name": "Florence", "comment": "SGVsbG8sIHdvcmxkIQ==" }
The JSON remains fully valid with no need for escape codes.
What Does This Look Like in Code?
Base64 is widely supported across lots of platforms and programming languages. Here’s a couple of examples showing base64 coding and decoding in python and Snowflake SQL.
Python
import base64 original_text = """ "To be, or not to be", that is the question: Whether 'tis nobler in the mind to suffer The slings and arrows of outrageous fortune, Or to take arms against a sea of troubles And by opposing end them. """ # Encode to Base64 encoded_text = base64.b64encode(original_text.encode("utf- 8")).decode("utf-8") print("Encoded:", encoded_text) # Now decode the Base64 text decoded_text = base64.b64decode(encoded_text).decode("utf-8") print("Decoded:", decoded_text)
Snowflake SQL
-- Encoding text to Base64 (no need for TO_BINARY) SELECT BASE64_ENCODE(' "To be, or not to be", that is the question: Whether ''tis nobler in the mind to suffer The slings and arrows of outrageous fortune, Or to take arms against a sea of troubles And by opposing end them."') AS encoded_text; -- Decoding Base64 back to text SELECT BASE64_DECODE_STRING('CiJUbyBiZSwgb3Igbm90IHRvIGJlIiwgdGhhdCBpcyB0aGUgcXVlc3Rpb246IApXaGV0aGVyICd0aXMgbm9ibGVyIGluIHRoZSBtaW5kIHRvIHN1ZmZlcgpUaGUgc2xpbmdzIGFuZCBhcnJvd3Mgb2Ygb3V0cmFnZW91cyBmb3J0dW5lLApPciB0byB0YWtlIGFybXMgYWdhaW5zdCBhIHNlYSBvZiB0cm91YmxlcwpBbmQgYnkgb3Bwb3NpbmcgZW5kIHRoZW0uIg==');
Base64 is not Encryption
Just to be 100% clear for anyone who has skim read this post: Base64 is not a way of protecting sensitive data. Anyone with a copy of the encoded text can easily decode it back to the original with no additional information. If you need to make sure that your data is encrypted, in transit and at rest, there are lots of options available. Talk to us about your particular requirements, and we can help you find an appropriate solution.