eric | May 7, 2021, 4:47 p.m.
Dealing with different file encodings for a set of data can be a bit of a pain [1], but there is one tool that is really useful in this situation. Using the readr-package[2] with its guess_encoding-function for reading files works most of the time. A few additions can make it even better.
So what is the problem with encodings? Say you want to read a CSV-file. If you don’t specify encoding when using read.csv(), you might find yourself in trouble:
df1_2016 <- read.csv( "~/Git/data_science/Natura2000/data-original/2016/BIOREGION.csv", header=TRUE, sep=",") Error in make.names(col.names, unique = TRUE) : invalid multibyte string at '<0a>' In addition: Warning messages: 1: In read.table(file = file, header = header, sep = sep, quote = quote, : line 1 appears to contain embedded nulls 2: In read.table(file = file, header = header, sep = sep, quote = quote, : line 2 appears to contain embedded nulls 3: In read.table(file = file, header = header, sep = sep, quote = quote, : line 3 appears to contain embedded nulls 4: In read.table(file = file, header = header, sep = sep, quote = quote, : line 4 appears to contain embedded nulls 5: In read.table(file = file, header = header, sep = sep, quote = quote, : line 5 appears to contain embedded nulls 6: In read.table(file = file, header = header, sep = sep, quote = quote, : line 1 appears to contain embedded nulls
What is going on? This looks like a typical encoding problem. If we check the encoding for the file from the command line, we get “UTF-16LE”. This needs to be specified in the fileEncoding attribute of read.csv. From the file help page:
The encodings ‘”UCS-2LE”’ and ‘”UTF-16LE”’ are treated specially, as they are appropriate values for Windows ‘Unicode’ text files.
In addition, we should specify the dec attribute. We the get:
> df1_2016 <- read.csv( "~/Git/data_science/Natura2000/data-original/2016/BIOREGION.csv", header=TRUE, sep=",", fileEncoding="UTF-16LE",dec=".")
This works, but so far it is a hopelessly manual process that will not work well in a reproducible data pipeline. How can we create a fully automated process for reading CSVs with different encodings?
To automate the reading process, we need the encoding of each file to be read. If you know the possible set of encodings, you can combine that knowledge with guess_encodings to get a much more reliable, and possibly fully automated solution. It is not absolutely fail-safe though, so proper error handling must be a part of the script. One possible set of rules for the guessing process:
Typically, a file-read that does not use the correct encoding will take a long time, and timeout could therefore be used to discard such reads as not successful. This is a high-risk strategy, however, as the reading of big files and / or reading from shared resources could lead to time-outs.
First of all, we need a function that reads a CSV, and give us some meaningful feedback if it fails. It could look something like this:
read_encoded_csv <- function(file_to_read,file_encoding) { out <- tryCatch( { message("Trying encoding") read.csv(file_to_read, header=TRUE, sep=",", fileEncoding=file_encoding, dec=".") }, error=function(cond) { message(paste("Encoding doesn't seem to work:", file_encoding)) message("Here's the original error message:") message(cond) # Return value in case of error return(NA) }, warning=function(cond) { message(paste("Encoding caused a warning:", file_encoding)) message("Here's the original warning message:") message(cond) # Return value in case of warning return(NULL) }, finally={ message("\n") message(paste("Processed file:", file_to_read)) message(paste("Processed encoding:", file_encoding)) } ) return(out) }
If we use guess_encoding from the readr-package, combined with knowledge of the encoding that the files are most likely to have, we can use something like this:
determine_encoding <- function(file_to_read,encodings_probable) { message(paste("DETERMINE ENCODING FOR NEW FILE: "), file_to_read) encoding_used <- "" encodings_guessed <- guess_encoding(file_to_read)[["encoding"]] encodings_all <- c("UTF-8","UTF-16BE","UTF-16LE","UTF-32BE","UTF-32LE", "Shift_JIS","ISO-2022-JP","ISO-2022-CN","ISO-2022-KR", "GB18030","Big5","EUC-JP","EUC-KR","ISO-8859-1","ISO-8859-2", "ISO-8859-5","ISO-8859-6","ISO-8859-7","ISO-8859-8", "ISO-8859-9","windows-1250","windows-1251","windows-1252", "windows-1253","windows-1254","windows-1255","windows-1256", "KOI8-R","IBM420","IBM424") for(encoding_guessed in encodings_guessed) { print(encoding_guessed) print(encodings_guessed) if(encoding_guessed %in% encodings_probable){ t <- read_encoded_csv(file_to_read,encoding_guessed) if(reading_makes_sense(t)) { encoding_used <- encoding_guessed break() } } } if (encoding_used == "") { for (encoding_probable in encodings_probable) { t <- read_encoded_csv(file_to_read,encoding_probable) if(reading_makes_sense(t)) { encoding_used <- encoding_probable break() } } } if (encoding_used == "") { for (encoding_all in encodings_all) { t <- read_encoded_csv(file_to_read,encoding_all) if(reading_makes_sense(t)) { encoding_used <- encoding_all break() } } } return(encoding_used) }
If the initial guess fails, we try all the probable encodings. If that fails, we try all possible encodings. We find all possible encodings with iconvlist(). However, the readr-package uses stringi::stri_enc_detect to get guess_encoding to work. This function has its limits in terms of the encodings it can detect, so it only makes sense to check for those encodings: UTF-8 -- UTF-16BE -- UTF-16LE -- UTF-32BE -- UTF-32LE -- Shift_JIS Japanese ISO-2022-JP Japanese ISO-2022-CN Simplified Chinese ISO-2022-KR Korean GB18030 Chinese Big5 Traditional Chinese EUC-JP Japanese EUC-KR Korean ISO-8859-1 Danish, Dutch, English, French, German, Italian, Norwegian, Portuguese, Swedish ISO-8859-2 Czech, Hungarian, Polish, Romanian ISO-8859-5 Russian ISO-8859-6 Arabic ISO-8859-7 Greek ISO-8859-8 Hebrew ISO-8859-9 Turkish windows-1250 Czech, Hungarian, Polish, Romanian windows-1251 Russian windows-1252 Danish, Dutch, English, French, German, Italian, Norwegian, Portuguese, Swedish windows-1253 Greek windows-1254 Turkish windows-1255 Hebrew windows-1256 Arabic KOI8-R Russian IBM420 Arabic IBM424 Hebrew
We should check that the reading makes sense, even when an encoding is found and read.csv() reads the CSV-file without a hitch. Using what is known about the CSVs, pertinent rules can be; (1) the result of the file read is a data frame, (2) that the data frame has rows, and that (3) the data frame has columns:
reading_makes_sense <- function(content_read) { out <- ( is.data.frame(content_read) && nrow(content_read) > 0 && ncol(content_read) > 0 ) return(out) }
The main loop could look something like this:
files <- list.files(path="data-original", pattern="*.csv", full.names=T, recursive=TRUE) encodings_probable <- c("UTF-8","UTF-16LE") lapply(files, function(x) { file_encoding <- determine_encoding(x,encodings_probable) message(paste("Determined encoding: ", file_encoding)) if (file_encoding != "") { t <- read_encoded_csv(x,file_encoding) # apply function out <- dim(t) } else (out <- "No encoding found!") # write to file write.table(out, file = "data/test.txt", append = TRUE, sep="\t", quote=F, row.names=T, col.names=x) rm(t) })
It isn't particularly pretty, but it gets the job done. In fact, the code has been used extensively on real data-sets and have only failed when dealing with corrupted files, as would be the case for all file reading functions. Any ideas on refactoring are most welcome.
It is possible to automate file-reading of files with different encoding using a combination of what you know about the files in terms of their possible encodings, the separator used, the decimal sign used, and the desired result of the reading. Testing has shown that this works if the files have one of the encodings that can be used by readr's guess_encoding.
[1] How to detect the right encoding for read.csv - https://stackoverflow.com/questions/4806823/how-to-detect-the-right-encoding-for-read-csv/35854870#35854870 [2] The readr-package - https://cran.r-project.org/web/packages/testthis/testthis.pdf
In practical use, the script above has only failed to produce a result when dealing with corrupt files. A typical sign that a file is corrupt, is the following messages from the script:
[1] "DETERMINE ENCODING: NEW FILE" [1] "UTF-8" [1] "UTF-8" "windows-1252" "Big5" "windows-1254" Trying encoding Encoding caused a warning: UTF-8 Here's the original warning message: embedded nul(s) found in input Processed file: data-original/2012/NATURA2000SITES.csv Processed encoding: UTF-8 [1] "windows-1252" [1] "UTF-8" "windows-1252" "Big5" "windows-1254" [1] "Big5" [1] "UTF-8" "windows-1252" "Big5" "windows-1254" [1] "windows-1254" [1] "UTF-8" "windows-1252" "Big5" "windows-1254" Trying encoding Encoding caused a warning: UTF-8 Here's the original warning message: embedded nul(s) found in input Processed file: data-original/2012/NATURA2000SITES.csv Processed encoding: UTF-8 Trying encoding Encoding caused a warning: UTF-16LE Here's the original warning message: line 1 appears to contain embedded nulls [...] Processed file: data-original/2012/NATURA2000SITES.csv Processed encoding: KOI8-R Trying encoding Encoding caused a warning: IBM420 Here's the original warning message: invalid input found on input connection 'data-original/2012/NATURA2000SITES.csv' Processed file: data-original/2012/NATURA2000SITES.csv Processed encoding: IBM420 Trying encoding Encoding caused a warning: IBM424 Here's the original warning message: invalid input found on input connection 'data-original/2012/NATURA2000SITES.csv' Processed file: data-original/2012/NATURA2000SITES.csv Processed encoding: IBM424 Determined encoding:
Experienced dev and PM. Data science, DataOps, Python and R. DevOps, Linux, clean code and agile. 10+ years working remotely. Polyglot. Startup experience.
LinkedIn Profile
Statistics & R - a blog about - you guessed it - statistics and the R programming language.
R-blog
Erlang Explained - a blog on the marvelllous programming language Erlang.
Erlang Explained