Creating an R DataFrame the Bad Way
A bad but sometimes 'good enough' way of iteratively creating a dataframe.
One of the pieces of advice you’ll commonly see when learning about R and dataframes is that you should always, always parallelize your operations. That is: if you’re creating a dataframe row-by-row or column-by-column, you’re doing it wrong.
This could certainly be correct. However, I am fairly new to R, and sometimes I just don’t know how I can go about parallelizing the operation that I’m currently working on. In other cases, I am reasonably sure that there isn’t a great way to parallelize the operations in the first place, as other considerations (code structure, &c.) take precedence. I’m also a fan of the idea that when you’re building something, get it working first, then optimize.
For all these reasons, in this post I’m going to share how you can create a dataframe row-by-row, in a way that is not very efficient but may be useful in some situations.
Creating the DataFrame
Let’s start with a contrived example. Imagine that I have a particle experiencing 1-D Brownian motion; that is, its position at every time step will be based on the previous one, plus some random factor. I want to store the time and position of the particle in a dataframe, potentially for graphing. Before we get started, I want to acknowledge that you could easily optimize this code; the purpose is to demonstrate the method of creating the dataframe.
The basic idea is to create an ‘empty’ dataframe with the correct columns and types, and then each time through the loop use the rbind function to append the new data to the end of the existing data.
# creating the empty dataframe, ensuring that we have the correct types for the columns
existing_data <- data.frame(time=double(), position=double())
position = 0
for (i in 1:10) {
position <- position + runif(1, -1, 1)
# creating a new dataframe with the same shape as the empty one
new_data <- data.frame(time=i, position=position)
# 'appending' the new data to the old
existing_data <- rbind(existing_data, new_data)
}
existing_data
| time | position |
|---|---|
| <int> | <dbl> |
| 1 | 0.85349725 |
| 2 | 1.18046767 |
| 3 | 0.70371025 |
| 4 | 0.03398731 |
| 5 | 0.47465615 |
| 6 | 0.48316093 |
| 7 | 0.01961896 |
| 8 | 0.78249028 |
| 9 | 1.11642522 |
| 10 | 1.78370915 |
And there we have it! It’s just that simple. You can use any of the built-in R types (double, character, factor, &c.) for the column types.