DataClassroom

View Original

Do you know how to make a Tidy Data table? Here's how.

“It is often said that 80% of data analysis is spent on the process of cleaning and preparing the data”

“Tidy datasets and tidy tools work hand in hand to make data analysis easier, allowing you to focus on the interesting domain problem, not on the uninteresting logistics of data.”

- Hadley Wickham, Fellow of the American Statistical Association

Until very recently, when we were simply teaching students to graph by hand on a piece of paper, the format of data probably didn’t matter very much in a grade 6-12 classroom. It does now.

In this moment, as we begin to teach students to work with data in the era of big data, it has become absolutely imperative to teach best practices from the beginning. Formatting data so that it can be easily read by a computer has already become a foundational skill even if state standards haven’t caught up yet. Teaching students to set up a data table in a way that allows for easy analysis will set them up for future success when working with data in college and beyond. Tidy Data specifically has become the standard format for science and business because it easily allows people to easily turn a data table into graphs, analysis and insight.

Data Science with R by Garrett Grolemund

A key step in preparing data begins when the data is originally entered and stored in data in tables because this has an impact on the way you look a data later in your analysis. This is why we operate with Tidy Data, which was conceived to formalize some of the practices of creating data tables for easy graphing and analysis. Tidy Data is quickly becoming the industry standard for data formatting in science and business. The big advantage of Tidy Data is that it makes a clear distinction between a variable, an observation and a value. In this way, all data is standardized and can easily be read by a computer. It is never too early for students to learn this best practice. Students in grades 6-12+ can learn data formatting skills that will serve them well as they enter into a world that is awash in data.

Variable, Observation, Value

Tidy Data is often not the most instinctive way for most people to collect and enter data, but fortunately it is straightforward to convert to Tidy Format and you can see a quick example below (shown both as a still image and as a gif). When your students work in Tidy Data, your students learn data formatting practices that they will use in college and beyond while data concepts are reinforced. Below is an example of initial data collected in spreadsheet, and entered in a typical way that most novices would use (left), and the same data in Tidy Format (right):

This is clearly the same data, but it now has:

- Three clearly identified variables (days, plant type, and height) each of which has  its own column. 

- Each row is an independent observation of the variables

- Each cell contains a single value of a single variable


Here the process of tidying that dataset shown with a series of actions on the spreadsheet:

Learn how to Melt your data into Tidy Format here.

The DataClassroom User Guide also has an article and step-by-step video on how to convert your data to Tidy format.

Hadley Wickham on teaching students to make data tidy

Statistician Hadley Wickham is a celebrity in the data world. He coined the term Tidy Data in a widely cited 2014 paper. He has developed statistical software packages in R that are among the most used by scientists and businesses around the world working with data, including the likes of Google, Facebook, Twitter, The New York Times, stats journalism site Five Thirty Eight, and government agencies like the FDA and DEA, just to name a few. We asked Hadley about the importance of teaching data skills and, specifically, about teaching students in grades 6-12 the skills to make their data tidy.

How important do you think it is at this moment to teach skills for working with data in the K-12 math and science curriculum?

“Data skills give you powerful and fundamentally new ways of looking at the world. I think it's important for every student to understand the basis of data and how to work with it so that they can be informed citizens.”

What would you say to a teacher who says, "Tidy Data doesn't seem neater to me in this example (above). Isn't it easier to look at the table on the left and compare the heights for Plant A? Why should I teach my students to make their data tidy?"

“Storing data in this way works great for simple situations, but it starts to get really challenging to work with when you add more variables. For example, how would you record the data if you had four plants, where two were exposed to sunlight and two were kept in the dark? Or what if you wanted to record the height each day until the plant flowered? Or what if the experiments were carried out in two different class rooms and they started and ended on different days? Using tidy data sets you up for success no matter how complicated your experiment begins.”

How important is the format of a dataset to the ease with which insights can be gleaned from it?

“I think the organization of a data is just as important as how we organize words into sentences and sentences into paragraphs. Sureyoucanreadasentencewithnospaces, but proper punctuation make things so much clearer! The same is even more true with data since most of the time, you'll be using a computer to work with it, and computers have a much smaller capability to "read between the lines" and understand what you're really trying to say.”

Do you think it is a good idea for younger students to learn the basics of Tidy Data as soon as they start recording data in tables? Why?

“I think the most important reason is that it gives them a set of skills they can continue to use for the rest of their life. It's a little bit harder when you get started, but by teaching tidy data, you give young students exactly the same skills that professional data scientists have!”

What advantages does Tidy Data have for learning data skills?

 1) Tidy Format makes it much easier to clearly convey the concept of a variable and what type that variable has. Notice that when the data is in Tidy Format (example above), you can more readily see that you have two numeric variables (day and height), and one categorical variable (plant type; A or B). This is a huge advantage when teaching the basics of experimental design and you are working to get students to become fluent with the skill of identifying independent and dependent variables. 

2) Consistency. One of the most common challenges that data scientists face is the problem of formatting “messy” datasets. If your students learn the basics of Tidy Data, they will always be able to format data in a way that is suitable for downstream analysis. Whether that downstream analysis is plotting by hand on a piece of paper, or running statistical models with high powered software, Tidy Data works well.

3) Compatibility. Tidy data has full compatibility with more advanced analysis tools used in university courses. When you have your students make their data Tidy, you aren’t just teaching them a format that will work for one particular program. You are teaching them to organize data in a logical way that can be read easily by a computer. They will use this again in the real world.

At DataClassroom, we spend a lot of time thinking about how to present data concepts for students. We are the first to acknowledge that the first messy table in the example above can somehow feel more natural to use when collecting data. But our mission is to improve real understanding of important concepts, rather than just getting an answer. This involves wrestling with and making decisions about exactly what it is that the student should themselves be putting mental effort into. That is why we deliberately have not made a magic format-my-data button that would remove some valuable learning. We want your students to learn how to make their data Tidy

We are always open to input from real teachers using our tool in their classrooms. If you have an inspiration as to how our tool could ease the (often necessary) conversion process to Tidy Data, and still convey the important concepts, we would love to hear from you!


 Further reading: 

Statistician Hadley Wickham’s original paper formalizing Tidy Data.

 University of Chicago article explaining how/why Tidy data is used for the 'R' statistical package. 

 Blog post from molecular biologist, Joachim Goedhart, explaining why and how to convert to tidy data.