Teaching students to Apply Statistics in R and Rstudio

Overview: In order to actually apply statistics in real life, it is imperative that students be able to implement statistical techniques using software. It is even better if students can learn a programming language which is either designed for statistical analyses (e.g. R) or one which has many packages designed for statistics (e.g. Python, Julia). In my experience, R has been the first choice for practitioners and one that is relatively accessible. It is important to note that in certain fields (most commonly within the social sciences), practitioners still rely heavily on proprietary tools (e.g. SPSS, STATA, SAS) and these still dominate the statistical “market-share” in those contexts. However, the popularity of these tools has started plummet as many practitioners in these fields are increasingly moving toward open-source resources like R and Python (see: Example 1 and Example 2).

There is a great need within SNC and the Division of Natural Sciences for students to learn R. The following is a quote from an anonymous professor in a survey I sent out to assess what other Natural Science faculty wanted to see in a statistics course, “In my fields of ecology/environment, using R for statistical analyses is a must. It would be fabulous for our students to gain experience in that language, especially because it is free, and they will be expected to use it in the next steps post-SNC.“  With this in mind, I re-designed SNC’s MATH 221 – Statistics in the Sciences class to place a heavy emphasis on using R to apply the statistical techniques discussed in class. This included using R to answer pen and paper questions, produce lab reports, and analyze datasets in class. In order to help the students learn R, I leveraged a package called learnR which allowed me to create interactive tutorials in which students could run R code while receiving guidance and hints. I took an activity-based learning approach to the class where I lectured for around 30 minutes each class period and then designed tutorials (think interactive worksheets) for students to complete throughout the rest of class while I walked around giving help

R Programming Language: R is a free programming language designed specifically to be used to for statistics (R: The R Project for Statistical Computing). It is one of tools most commonly used by scientists, statisticians, data analysts, and data scientists.

Top Analytics Data Science Machine Learning Software 2019, 3 yrs
KDnuggets Analytics/Data Science 2019 Software Poll: top tools in 2019, and their share in the 2017, 2018 polls (Source)

R has powerful core packages which implement all of the standard statistical techniques a user would want. In addition to these core packages, R is free and open-source so cutting edge techniques can be developed by anyone and packaged into libraries that anyone can use. In fact, the tutorials for this class were distributed as an R package. There are new packages being added all the time to implementing new statistical techniques and deal with new types of data. Without going into too much detail, R is flexible enough for a trained user to easily implement their own statistical techniques capable of tackling new and interesting data-oriented problems. It is easy to see why R has become a preferred tool for researchers both in and outside of academia.

RStudio: On it’s own, R can be a bit intimidating for first time users, and the graphical user interface (GUI) is unappealing to even long-time users… Enter RStudio! RStudio provides an accessible GUI called an IDE (short for Integrated Development Environment) that makes learning R much less intimidating and helps keep projects much more organized. The RStudio GUI provides several “panes” which display the console (i.e. the place you run code), a text editor where you can write R scripts, a pane which monitors the environment you’re working in, and a files/plots/packages/help pane which displays… wait for it… files, plots, packages, and help pages. In the context of learning R for the first time, having all of this information in front of you at once helps to decrease the learning curve.

RStudio IDE screenshot.png
RStudio GUI: Top Left – Text Editor, Bottom Left – Console, Top Right – Environment, Bottom Left – Files, Plots, Packages, and Help (Source)

LearnR: R Markdown is an extremely versatile file type which allows R users to create high-quality documents for disseminating…. well… lots of things! R Markdown files allow users to generate PDFs and HTML documents which seamlessly switch between chunks of prose, code and its output (not even just R code), formatted text, typeset mathematical expressions, images, tables, and many other useful types of output. A gallery of R Markdown documents displaying the range of possibilities can be found here. The package learnR provides tools for converting R Markdown documents into interactive tutorials. Using learnR, one can create html files which contain R consoles (boxes where you can run R code) referred to as “exercises”. As an instructor you can use these tutorials to teach students how to apply new concepts in R but leveraging exercises to appropriately scaffold the content students.

The exercises have many features you can use to help students learn. You can add hints, solutions, and even leverage a package called gradethis to check the results of exercises. I will note however, that while checking solutions is possible, it is extremely complex (in all but the most basic use-cases) and time consuming. I started off the semester using this functionality and eventually eschewed it in favor of just making my last hint the solution to the exercise.

One feature of learnR that I found quite frustrating is that the different consoles/exercises are stand-alone. Meaning that if a student defines an object in one console, they can’t access it in other consoles. This lack of functionality is particularly noticeable when you’re trying to scaffold content for students. There are many times when you may want to start a tutorial with an exercise in which students must clean, manipulate, and preprocess a dataset. Ideally, you’d like the students to be able to return to that same dataset later on in the tutorial and apply whatever statistical technique of the day is being covered. There are ways of “faking” this behavior by defining global variables that are the same was the ones you want the students to define but it is less than ideal and requires students to use the same variable names as you.

Class Time: My original goal for this course was to do almost no lecturing and have a completely activity-based structure for the course. In the beginning of the semester, when I would craft my leanrR tutorials they would start with long sections of definitions and explanations that students could reference when they needed them.

Unfortunately, this proved to be:

  • Too time consuming: It was taking me an inordinate amount of time to craft each tutorial. This was beginning to impact my other courses and a lot of the materials I produced were not quite at the quality I would’ve hoped.
  • Not what students seemed to need: While timing was the primary issue, I found that students needed a bit more guidance before they were fully capable of diving into an activity. It is very possible that this was partially due to me not crafting the most effective activities and so I plan to re-visit this in the future when I have time.

Instead, I started using slides created in R Markdown. By using R Markdown to create my slides I could display text/content alongside the R code for implementing the concepts. My basic formula for introducing a concept would be to do so through the lens of a research problem. I would begin each lesson by describing a realistic research situation, distill it down to a research question that could more easily be converted into a statistical question, and introduce a dataset addressing that question. My goal in doing this was to help students understand why statistics is important, think about how statistics is valid to the scientific process, and mentally anchor the concepts to topics they may have already discussed in other courses.

After this introduction, I would spend time explaining the topic of the day, making sure to bring the concepts back to the data set we used to conceptualize the problem. Once this portion of class was complete, I would have students open a learnR tutorial on their personal computer and work through a kind of case-study in which the exercises within the tutorial would walk them through each step in the process.

What Worked: Overall, it seemed that students enjoyed working through the tutorials and managed to gain a surface level understanding of how to use R. The tutorials went extremely smoothly and students seemed to ask questions that would indicate that they are engaging with the material at a level I would expect. While I would not expect that a student who had taken this course would be prepared to boot-up R and start diving head-first into data analysis, I would expect that they would have the tools to teach themselves how to analyze data from, for example, a summer research project. For those students who entered without any R experience, they should retain a general idea of what R syntax should look like and how it’s organized, know which packages to use for different types of analyses, and know WHERE and HOW to look for resources, help, and documentation. For those students who did have some previous R experience, I would expect them to be comfortable applying the statistical techniques outlined in class using R.

I was SHOCKED by how few technical difficulties I encountered. There were certainly times where students had trouble loading packages but I can only remember one instance were we had a really major bug that impacted a large portion of the class. Frankly, I was expecting to spend a much larger portion of my time this semester dealing with technical issues. I feel the need to thank the newly graduated Ms. Cassie Nooyen as she was the student working at the tech bar who was assigned to help my students, and based on the lack of problems I had in class, she did a wonderful job.

What Didn’t Work: Unfortunately I tend to be the kind of person who focuses on negatives instead of positives. The two major issues I had in implementing this course were (1) time management and (2) assessment.

  1. Time Management: Generating these tutorials was, and I can’t emphasize this enough… EXTREMELY time consuming. Especially in the beginning of the course when I was implementing Exercise Checking and trying to add all of the content to the tutorials. I would estimate that I would spend between 3-8 hours developing EACH lesson, with a median of between 4-5 hours. This took a rather large toll on me (mentally, physically, and emotionally) as I felt like I never had time and was constantly scrambling to put together lessons for this class and the rest of my classes. However, I am hoping that this effort will pay off down the line. Now that I have all of the tutorials developed it will be much less time consuming to just tweak them going forward. That being said, if you are planning to implement a similar strategy to the one outlined in this blog post, I suggest you take a summer to produce the majority of the resources you’ll use in class.
  2. Assessment: I struggled throughout the course to figure out how to effectively asses students. There were several issues here, some philosophical and some technological. While I wanted students to spend a lot of time in R and get used to using R, I was not expecting them to become experts in R and I was not expecting them to be able to quickly write up R scripts on their own. In addition, about half the class were Data Analytics majors who came in with a previous R experience and the other half were Natural Science students who didn’t. As a result, I had trouble figuring our how to write assessments that were both fair and effective. This was even reflected in a mid-semester survey I sent out to students. Basically, we spent the majority of class writing code in R and then students didn’t really need to use R or only needed to interpret output from R on exams. While I plan to address this in future iterations of the course, I’m still contemplating the best way to do this. I’m thinking of making it clear that there are two primary types of assessments in the class. The first will be labs and those are meant to assess students R abilities. The second will be written exams which are solely meant for assessing conceptual knowledge. I will then need to make sure that BOTH of these types of assessments are aligned to the lectures. Another issue with assessment was just figuring out how to have students submit their work. A major drawback of learnR is that there is not (currently) an easy way of collecting students responses. As a result, I needed to take a slapdash approach of having students either upload a pdf/html document of their work to Moodle (which frequently had issues) and them copying their solutions into a google form (which doesn’t show output). I need to figure out a better way to handle submission.

Future Directions: Below are listed a bunch of tasks that I’d like to incorporate in this class at some point in the future:

  • Base as many activities and labs on actual data and research from fellow SNC faculty members. By doing this, I hope to better prepare students to engage in undergraduate research with faculty.
  • Figure out a better way to handle document submission.
  • Collect more meaningful data (in a pedagogical sense) during class. When I would circulate the classroom I was pretty much reliant on students volunteering that they needed help (which most of them were fine doing) but I feel that a better approach would be to have “check-ins” where I could ask them meaningful questions about what they had done and assess their progress based on their answers.
  • Create more interactive lectures. I think a really nice next step would be to intersperse problem solving sessions throughout the lectures, rather than lecture for 30 minutes and then have them work on activities for 30 minutes.
  • Host learnR tutorials online: My initial goal was to host the tutorials online. However, this proved to be more complicated or expensive to accomplish. Doing so would completely eliminate the risk of technical problems that would be encountered during class.
  • Emphasize the process of statistical inquiry: After the class was over, I felt that my students viewed statistics as a set disjoint methods and techniques rather than an iterative process of inquiry involving decision making and problem-solving. This is, of course, a result of the way I taught the course and, indeed, something that the American Statistical Association warns against in their Guidelines for Assessment and Instruction in Statistical Education (GAISE) report. Going forward I plan to emphasize this process of statistical thought. I’m not exactly sure how 🤔 but I will!

I owe much more than a simple thank you to Krissy Lukens, Nick Plank, Annicka Rabida, Susan Ashley, my fellow Digital Fellows (Zoe, Brandon, Christina), Cassie Nooyen, and, of course, my lovely, wonderful, amazing students for all the hard work they put in this semester.

Leave a Comment

Your email address will not be published. Required fields are marked *