blog | oilshell.org

What Is a Data Frame? (In Python, R, and SQL)

2018-11-30

A data frame is a table-like data structure available in languages like R and Python. Statisticians, scientists, and programmers use them in data analysis code.

Once you've tried data frames, you'll reach for them during every data analysis project.

This post introduces the concepts behind them, and then shows how they work by solving the same problem in multiple ways:

  1. Without data frames, using straightforward but verbose Python code.
  2. With data frames in Python and R. You'll see that the underlying ideas are the same, even when the language is different.
  3. With SQL, for another comparison.

To run the code, clone the oilshell/blog-code repo, then follow the instructions in data-frames/README.md..

Table of Contents
Why Write About Data Frames?
Typical Operations
An Example Analysis
The Problem
Without Using Data Frames
Using the Pandas library in Python
Using the dplyr library in R
Using SQL
Interlude
Useful Documents
Recommended Books
Other Books
Conclusion
Appendix A: Implementations
Recommendations
Appendix B: Alternative Solution With "Base R"

Why Write About Data Frames?

This blog is about a new Unix shell, so the connection to data frames may not be obvious. I write about them for a few reasons:

  1. In the last post, I analyzed Oil's code with data frames.

    When writing the praise in that section, I looked for a friendly intro to the idea, but the best hit was from an inactive Clojure project. Surprisingly, Wikipedia doesn't have a page on data frames! This post aims to fill that gap.

  2. I've talked to more than one person who has done data analysis without data frames, e.g. at Recurse Center.

  3. Data frames may someday be a feature of the Oil language. Why? Because the output of both ls and ps is a table. (Caveat: this work is far in the future.)

Typical Operations

Before I show source code, here's some more color on how data frames are used. You may:

Logically, data frames look like SQL tables, but:

An Example Analysis

The Problem

Here's a measurement of hits over time to different pages on a website:

date URL hits
2018-11-30 /site.html 300
2018-11-30 /blog/ 1,000
2018-12-01 /site.html 320
2018-12-01 /blog/ 1,300
2018-12-02 /site.html 310
2018-12-02 /blog/ 1,800
2018-12-02 /data-frames.html 7,500

Let's analyze this data set in two ways.

  1. Daily Traffic: Total hits per day.
date hits
2018-11-30 1,300
2018-12-01 1,620
2018-12-02 9,610
  1. Popular Pages: The pages with the most hits, as a percentage of all traffic.
URL percentage
/data-frames.html 59.9%
/blog/ 32.7%
/site.html 7.4%

Notice that the first query aggregates over the date variable, while the second aggregates over URL, then sorts by percentage.

How would we write code to do this analysis?

Without Using Data Frames

We could write imperative code with explicit loops, building data structures cell-by-cell. I do this in without_data_frames.py, where the main() function is a dense 30 lines of code.

Excerpt:

# ...

by_url = collections.defaultdict(int)
by_date = collections.defaultdict(int)
total_hits = 0

for date, url, num_hits in reader:
    num_hits = int(num_hits)
    by_date[date] += num_hits
    by_url[url] += num_hits
    total_hits += num_hits

# ...

This works for small problems, but becomes tedious as the analysis gets more complex.

The next two solutions, using data frames, are shorter and use a functional style. They're more like expressions on values rather than algorithms to update mutable data structures.

Using the Pandas library in Python

Here's the core of the first query, excerpted from with_pandas.py:

# Daily Traffic
daily = traffic.groupby('date').sum()

Pandas uses the method chaining idiom on data frame objects.

It takes two steps to transform traffic into daily:

  1. Divide the rows into groups, such that each row in a group has the same value for the date column, e.g. 2018-11-30.
  2. Calculate the sum of the num_hits column within each group.

The second query takes a few more steps:

# Popular Pages
popular = (
    traffic.groupby('url')
    .apply(lambda x: sum(x.num_hits) / total_hits * 100.0)
    .reset_index(name='percentage')
    .sort_values(by='percentage', ascending=False)
)
  1. Instead of sum(), we apply a formula to each group, creating a new percentage column.
  2. Then we sort the new rows by percentage, in decreasing order.

Using the dplyr library in R

R implementations of these queries use the same ideas. They may look less familiar, due to a few features of R:

  1. Operator overloading, which lets us define a %>% operator that performs left-to-right function application. It's not part of R itself.
  2. Lazy evaluation, which removes the need for the lambda you saw above in Pandas.
  3. "Reversed" assignment lets you put the variable on the right, like this:
42 -> x  # same as x = 42 or x <- 42

Here's core of the first query, excerpted from with_dplyr.R:

# Daily Traffic
traffic %>%
  group_by(date) %>%
  summarize(num_hits = sum(num_hits)) ->
  daily    # assign to variable 'daily'

The second query is similar:

# Popular pages
traffic %>%
  group_by(url) %>%
  summarize(percentage = sum(num_hits) / total_hits * 100.0) %>%
  arrange(desc(percentage)) ->
  popular    # assign to variable 'popular'

Again:

  1. The summarize step uses a formula to compute a new percentage column.
  2. The arrange step sorts the resulting data frame by decreasing percentage.

I like the %>% operator because it reads left-to-right like a Unix pipeline. However, there are significant differences. The data passed between stages is:


The main point here is that data frames are a stable concept across multiple languages. The syntax is different, but the idea is largely the same.

Appendix B solves the problem with "base R", rather than with dplyr.

Using SQL

Raw SQL easily handles the Daily Traffic query:

SELECT date, SUM(num_hits) FROM traffic GROUP BY date;

It starts to look awkward for the Popular Pages query:

SELECT url,
       SUM(num_hits) * 100.0 / (SELECT SUM(num_hits) FROM traffic)
       AS percentage
FROM traffic
GROUP BY url
ORDER BY percentage DESC;

The inner SELECT clause is begging to be extracted into a total_hits variable.

In run.sh, I've used a Common Table Expression for total_hits, which will improve readability in complex cases, but makes the code longer in this case. It would be nice if SQL had Python-like scalar variables (i.e. variables that aren't tables).

It could also be done with two separate queries and a Python variable, but this would be incorrect in the face of concurrent updates. If someone adds a row between the two queries, then the total percentage will be more than 100%.

As your analysis becomes more complex, the advantage of data frames over SQL becomes clearer. In many contexts, it's useful to view SQL as a data extraction language, while R or Python is the data analysis language.

Interlude

The rest of this post is biased toward R, because it's what I use for data analysis. Feel free to leave a comment with analogous resources for Python or other languages.

Useful Documents

Tidy datasets are easy to manipulate, model and visualize, and have a specific structure.

Each variable is a column, each observation is a row, and each type of observational unit is a table.

Once the concepts are clear, using data frames is a matter of internalizing the API through practice. I refer to this document most often:

Recommended Books

Disclosure: This section contains affiliate links. I earn a commission if you buy a book.

I mentioned my desire to write book reviews a few times, and R and data frames are good topics to start with.

Why? I've purchased at least six books on R, which tells me that it's the hardest programming language I've learned. In comparison, I felt productive in Python before buying any books at all.

I'm narrowing my recommendations to these two books:


R for Data Science: Import, Tidy, Transform, Visualize, and Model Data

This book, by Wickham and Grolemund, explains data analysis following the philosophy of the tidyverse. After using several data frame libraries, I'm now fully "on board" with the tidyverse. It's well-documented and a pleasure to use.


The Art of R Programming: A Tour of Statistical Software Design

This is the best book for programmers who want to learn R. If you're already fluent in another language, this book explains R in familiar terms.

Other R books are aimed at stats students with no programming background, and weren't as useful to me as this one.

I only recommend these books after you've written a few lines of code with data frames, using online resources like the ones above. The runnable code from this post may also be a useful starting point.

Other Books

I could write more about other books I own, but the short takeaway is that the two above are the best for programmers that want to learn to use data frames.

My copies:

R books

R for Data Science was written by the creator of the tidyverse. The book Python for Data Analysis is analogous since it was written by the creator of Pandas:

However, I haven't read it, so I can't comment on its quality.

Conclusion

This post explained the concept of data frames, and showed concrete examples of code in R, Python, and SQL.

I linked to my two favorite online resources about data frames, and two books out of six that I found most useful.

If you're interested concrete examples of how I use R, see the last post.

Leave a comment if you have questions. There are further thoughts and recommendations below.


Thanks to Eric Higgins and Yuri Vishnevsky for valuable feedback on a draft of this post.

Appendix A: Implementations

Mentioned in this post:

Other:

The wiki page I mentioned above has links to more implementations (some of them defunct).

Please leave a comment if you have experience with another data frame library.

Recommendations

If you're heavily invested in the Python ecosystem and don't know R, you may want to try Pandas first.

Otherwise, I recommend learning R with dplyr and the rest of the tidyverse.

A language that doesn't affect the way you think about programming, is not worth knowing. — Alan Perlis

R is inconsistent and difficult to learn, but the tidyverse largely fixes this:

For what it's worth, I wrote code in five ways for this blog post, and I found the dplyr version the most elegant. The syntax is a little unusual, but the ideas fit together well.

(Regarding R vs. Python debates: I think the answer for many data science problems is to use both. But I use R when the code requires data frames.)

Appendix B: Alternative Solution With "Base R"

For comparison, I've also written with_base_R.R, which uses functions built into R, rather than the newer dplyr library.

It's not important to understand this code. I only include it to show that dplyr is nicer.

The tidyverse, which inculdes dplyr, is essentially a "reform" of R, enabled by Lisp-like metaprogramming. This is inspirational to me because I hope to "reform" shell with the Oil language.