Photo by @karsten_wuerth — Unplash.com
As someone with a background in statistics, I acknowledge the fact that I continuously have to improve my computer science and engineering skills, almost every day. While thinking of distributions, statistics and other key concepts when analyzing data are natural to me, writing efficient and clean code is not.
Luckily, I’ve had the opportunity to work with a lot of engineers that have taught and explained me why code needs to be clean and efficient — if I can summarize that need in a sentence, the best one comes from John Donne’s poem(a saying almost 400 year old!): ‘No man is an island’.
When it comes to developing our code and scripts, we are not an island. And working collaboratively is one of the greatest skills to have when working as a data scientist, analyst (or almost any other profession)— if you want to make a career out of analyzing data, the probability that someone will have to look at your code in the future is probably 99.99%. The better organized your code is, the easier it will be for someone to look, debug and improve it in the future. And this is not exclusive to other people that you might have to work it, it will also spare your future self a lot of hassle (who never looked at their own code and thought: “What the hell am I doing in this function?”)
So let’s jump right into some common best practices when it comes to R Programming (some are disputed, others are widely accepted by the community)!
Libraries go first
The first thing that should go into your R script are your libraries — the dependencies of your code should be explicit right at the beginning of it. This will avoid that someone gets a surprise while running the code because one of the imports is hidden in the middle of the code and the user did not prepare his environment for it.
Most style guides agree with this recommendation and you should avoid doing something like:
my_vector <- c(1,2,3)library(readxl)read_excel(path_file)
Above one creates a vector my_vector before importing the library readxl — this would generally be a bad practice.
The only argument where I find it acceptable to break this rule is during teaching — for example, when one introduces a new concept in the middle of a lecture that relies on a library — that library can be loaded in the middle of the script so that students keep a visual reference from the function that is being used and the library that contains that piece of code.
Hard-Coded Variables go second
Another generally accepted rule is that hard-coded variables such as paths or configuration files to access databases (if you have passwords it’s even better to take a look at this article to manage secrets: https://cran.r-project.org/web/packages/httr/vignettes/secrets.html) at the beginning of the script, after importing libraries.
A lot of scripts consume data from csv or xlsx files so it’s a general best practice to do the following:
library(readxl)path_file <- "data/data.csv"my_df <- read_excel(path_file)
In this way, whoever reads your code, has two important bits of information about it:
Which libraries you code depends on.
Which files and folders should the source folder structure contain.
Oh, and about file paths..
Relative paths over absolute paths
Absolute paths are never the way to go. This is specially relevant when you are working on a folder that is inside your operating system user structure.
Absolute paths have the following look:
"C:\Users\ivopb\My Documents\R Project\data\data.csv"
If I pass my code to someone that needs to run it on their machine, unless they have the username ivopb, they will never be able to run the code where we are consuming this file. Even if they run it on the same folder structure (My Documents/R Project/etc.).
Oh, and even if by random chance they are the user ivopb, but they have another letter mapped in their hard drive that is not C:, good luck on executing the code!
Generally, relative paths are always prefered:
"data\data.csv"
This will force you to set the working directory to the folder you are working on or opening the script from the folder — which is really much better than debugging and changing a huge number of paths you may have in your script!
Naming Conventions — File Names
For file names, always use easy to interpret file names and don’t use whitespaces in the file name (I actually commit this blunder in the scripts of my course to try to match the lecture names on Udemy, something I might change that in the future) — an example of a good and bad example:
# Good example
my_file.R# Bad example
My File.R
Also, aim for lower case names in script names. If your script has the goal to, for example, creating an aggregation of some specific data in a csv file use a name that is tied to the overall objective of the script:
aggregating_data.R
Naming Conventions — Objects and Functions
This is a hot topic in any coding language — and people tend to battle over which naming convention is the best.
Apart from the naming convention that you choose, just be sure to follow the same one throughout your script — to me, that’s a general golden rule.
I like to use snake case (using _) for objects and camel or snake case for functions but this is open for discussion. An example:
# Good
my_vector <- c(1,2,3)# Bad
myvector <- c(1,2,3)# Good
ThisFunction()
this_function()# Bad
thisfunction()
Also, your object and function names should be as explicit and brief as possible, imagining a function that takes an element and computes the power of a number:
ComputePowerOfBaseWithExponent <- function (base, exponent) {
return (base**exponent)
}
The function name is really long so we can reduce it and it’s normally recommended to do it:
ComputePower <- function (base, exponent) {
return (base**exponent)
}
Again, my only golden rule is to keep the style consistent throughout the script.
Returns
This, along with variable and function naming, is one of the most divisive topics inside the community (take a look at this thread to check both sides of the argument — https://stackoverflow.com/questions/11738823/explicitly-calling-return-in-a-function-or-not)
Return statements in R at the end of each function add redundancy — that’s a fact. Notice that I used an explicit return in the example above:
ComputePower <- function (base, exponent) {
return (base**exponent)
}
The return statement can be removed from the function, changing from explicit return to implicit return:
ComputePower <- function (base, exponent) {
base**exponent
}
Using explicit returns have a marginal difference on the speed of your code — increasing it a bit (it’s generally so tiny that it’s neglegible).
I tend to think that explicit returns are easier for all levels of R coders as explicit returns makes it easier for a beginner to understand the flow of the code. But this is mostly a choice from the programmer — people tend to sit on both sides of the fence when it comes to this one and my personal opinion is that I don’t think it’s fair to scold anyone that uses either implicit or explicit returns.
Be explicit in Loops
One of the important things when doing loops on objects is to name explicitly the elements you are looping on.
Let’s imagine the following exercise: you have a vector with ages of a specific group of people that you want to classify with “Major” or “Minor” and you use a for loop for that (let’s ignore other approaches that we could use that are better implementations for the sake of the argument):
ages_people = c(10, 20, 20, 30, 40)ClassifyAge <- function (ages) {
age_class <- c()
for (age in ages) {
if (age < 18) {
age_class <- c(age_class, 'Minor')
}
else {
age_class <- c(age_class, 'Major')
}
}
age_class
}
The following code would work if we call the age in the for loop as i or element:
ages_people = c(10, 20, 20, 30, 40)ClassifyAge <- function (ages) {
age_class <- c()
for (i in ages) {
if (i < 18) {
age_class <- c(age_class, 'Minor')
}
else {
age_class <- c(age_class, 'Major')
}
}
age_class
}
It is generally better to name the elements of your loop explicitly — in the case above we are doing something based on an age so it would be better to name the loop element age and not i, element or j. This would make it easier for someone that is not acquainted with the code to understand what the code is doing at a functional level.
Using <- or = on Object Assignment
This is another hot topic on the community — I tend to stick with the <- when creating objects or functions, although in that context = performs exactly the same. Most style guides agree with this but there is not a one size fits all opinion on this one.
One generally accepted rule is that when you use the <- operator for assignments, you leave the spaces between the object you are assigning and the name of the variable, as follows:
# Good example
my_vector <- c(1, 2, 3)# Bad example
my_vector<-c(1, 2, 3)
Of course, as I intertwine a lot between Python and R scripts, sometimes the naughty = is done on the assignment :-)
Line Length
Avoid more than 80 characters per line so that your code can fit in most IDE windows. This is also a general best practive for other programming languages such as Python. You want to avoid the reader of the script to use the horizontal scroll bar back and forth (this is an easy recipe to get lost in the code).
Imagining some function that is called with really long arguments:
CalculatesMeaningOfLife('This is a really long argument','This is another really long argument','This is a third really long argument!')
A general best practice when calling this code is to do the following:
CalculatesMeaningOfLife(
'This is a really long argument',
'This is another really long argument',
'This is a third really long argument!'
)
R Studio has some automatic indentation, when you press enter after the commas. This is a neat feature that makes it easier to keep our code tidy and clean.
Spacing
When calling functions or indexing objects it is always a good idea to provide a space after each ‘,’. This makes the code more readable and avoids the idea of “compressed code”. Compressed code is when we have the following:
my_array = array(1:10,c(2,5))
my_array[,c(1,2)]
The code works but all your code is tied together, with no spaces. Visually, it makes it difficult to understand which bit refers to the first or to the second dimension in the array.
A cleaner way to do this is by doing the following:
my_array = array(1:10, c(2, 5))
my_array[, c(1, 2)]
Notice how I’ve added a space after each comma that I have on the code. This is what one would generally call letting the code breathe — you make it easier to understand what you are indexing in each dimension and what is going into each argument of your function call.
DRY — Don’t Repeat Yourself
One of the most important concepts in any programming language (at least, functional ones) is the concept of DRY. A general golden rule is that when you find yourself copying and pasting a lot of code, that’s a good usage for a function.
As a really simple example, let’s imagine we want to greet 8 students in a class using R:
paste('Hello','John')
paste('Hello','Susan')
paste('Hello','Matt')
paste('Hello','Anne')
paste('Hello','Joe')
paste('Hello','Tyson')
paste('Hello','Julia')
paste('Hello','Cathy')
You are repeating the same code multiple times, only changing the name of the student. In your mind, this should immediately trigger the need to use a function:
GreetStudent <- function(name) {
paste(‘Hello’,name)
}class_names <- c(‘John’, ‘Susan’, ‘Matt’ ,’Anne’,
‘Joe’, ‘Tyson’, ‘Julia’, ‘Cathy’)for (student in class_names){
print(GreetStudent(student))
}
The cool thing is that now you can add more students to our class_names vector and you avoid repeating your paste command multiple times!
Functions are really one of the most powerful concepts in R. One should rely on them multiple times as they will make our code efficient and testable.
And that’s it! Do you have anything else to add?
There are a lot of best practices that I haven’t covered here and probably a lot more that I don’t know myself. What fascinates me about open source languages is how fast they evolve and how the community support each other on developing better programs and scripts to achieve higher productivity to every one.
As a final note, remember that learning a new language is a skill that takes time and it is almost impossible for someone to be a know-it-all. Everyone around you, no matter how senior they are the language, always have something to learn and that learning mindset is the best one for one to become a better professional and human being in general.
Feel free to reach out and if you would like to learn more about R, you can join my R Programming course here: R Programming course for Absolute Beginners
Thank you for taking the time to read this post! Feel free to add me on LinkedIn (https://www.linkedin.com/in/ivobernardo/) and check my company’s website (https://daredata.engineering/home).
If you are interested in getting trained on Analytics and Data Science you can also visit my page on Udemy (https://www.udemy.com/user/ivo-bernardo/)
This lecture is taken from my R Programming course available on the Udemy platform — the course is suitable for beginners and people that want to learn the fundamentals of R Programming. The course also contains more than 50 coding exercises that enable you to practice as you learn new concepts.