DataClassroom

View Original

Understanding data types

Why is understanding data types so important, and why can it be difficult? When is a number categorical? Why are dates problematic?

Having data organized by variable, and knowing which type each variable is, is very important, as it guides (or limits) the mathematical tests you can perform.

I’m going to assume you already know and love the concept of Tidy Data, as explained in this blog post and in our User Guide, which means your data will be divided up into:

  • Columns which each represent one variable

  • Rows which each represent one sample or data point

Each entry in the table is then the measured value of the column variable. And the value has a meaningful type.

After importing some data the first thing DataClassroom will show you is that you need to set the type for each column. A little yellow warning indicator will show until you sort this out:

Until you’ve done this, you won’t be able to do much with the data!

You may be asking - why doesn’t DataClassroom do this for me? Surely it can work out what kind of data is in each column?

Well…. two reasons. The most important is that DataClassroom is a learning tool, and understanding data types is important. The other is that (unfortunately) the tool does not have the psychic abilities that would let it guess correctly what kind of data you had collected, and get it right every time!

So, what is the process?

The first question to ask is: is this data something we are going to be using for an analysis or graph?

If the answer is “yes”, it will almost always want to be characterized as Categorical or Numeric. If not, it is really just there as information, and can be categorized as Info. See more detail, and some other types in the User Guide here.

The simple cases of Categorical and Numeric data seem obvious, but there are some traps to fall into.

Numeric data consists of numbers (no surprise there!) - but they must be numbers that can be meaningfully used in mathematical operations. So, a ZIP code, while appearing numeric, doesn’t have any meaning as a number. You can’t imagine adding 1 to a ZIP code and having the result be at all useful, right?

A ZIP code might be used as a Categorical variable. A Categorical variable should have values that are taken from a fixed, limited set of possible values.

If you are using the ZIP code to compare some situation between (say) five districts in a city, then you have a Categorical variable with five possible values, that you can use to group your samples by.

And that’s maybe the key point: a Categorical variable is used to group data into a set of categories. Kind of like a “label”.

Dates and times

Dates and times are always tricky. They may have been collected in any number of ways, and in any number of formats. When faced with a column with this kind of data, take a deep breath and ask yourself:

What is this data going to be used for?

If it is for labelling or grouping data, then it doesn’t really matter what format it is in. You can use a variable with the values Tuesday, Wednesday, Thursday as a Categorical for e.g. comparing hospital outcomes for people admitted on those days.

But often you may be wanting to use the information mathematically. Maybe you recorded the date so you could see how many days some treatment had been applied. Or you recorded the time so you could see how long a chemical process had taken.

Then you need to convert the dates or times to numbers.

We recommend that you make a new numerical column, and convert the date/time values to simple numeric values that represent the numbers you are interested in. For example Treatment Length (days) or Reaction Time (sec).

Make sure these are simple numeric values that you can perform math on. Use “seconds” and not “minutes:seconds” or you will be back to square one!

Conversion tips

Due to the enormous number of possible formats for such data, DataClassroom can’t itself perform conversions like the above. So you’ll need to do it either manually or using an external spreadsheet program, experimenting until you get the correct, usable values. It’s easy to copy/paste data between DC and other programs.

And, did you know that DataClassroom can perform mathematical transforms on whole columns of data? If you need to adjust a column by multiplying, adding, subtracting, taking the square root… or many other operations, see Transforming data in the User Guide.

Other data types

You may come across other names for data types, such as Ordinal, Continuous, Discrete... see this article for how they relate to the above in DataClassroom.

Exceptions

There’s always an exception, and in DataClassroom you can also label a variable as a Sample Count. This is a special case, used where you are not writing down each instance of a result as a separate sample (row) but just counting how many times it happens. Doing this limits the ways your data can be analyzed, but it can result in a much more readable and usable table, if you really don’t need every row. Use carefully and see the examples in the User Guide!

Errors and warnings

Once you have selected a type for a column, DataClassroom can then check if all looks OK. It will warn you if there is non-numeric data in a numeric column (maybe you typed L instead of 1 ?) which can be useful.

It will also warn of empty cells, which are OK in a numeric column (although you probably want to check whether you expected them) but can be problematic for categorical data, as all data should belong to a certain category.

Errors and warnings won’t affect graphs (the data will just be left out) and may not affect your hypothesis tests, but don’t worry, you’ll receive another warning if something fails as a result.

A final thought

Often, the experiment has already been carried out and the data collected before aspects like data types are thought about. This can cause difficulties!

Consider going through the above before you collect the data, or indeed while planning the experiment. Discuss which variables are going to be used, and what types they are going to have. This may lead to insights that can even improve the value of the experiment!




Haven’t tried DataClassroom yet? Register for a free trial or check out the website, have a browse through the searchable User Guide, watch one of our videos or get in touch and see how we can help!