Coding Assignment 1

Exploratory Data Analysis of Phagocytosis Assay

You will work on a real biological dataset from Dr. Williamson, a professor in our Bio Department!

ImportantYour goal

To provide an exploratory data analysis (EDA) and craft a detailed report of your findings

A detailed explanation of Dr. Williamson’s dataset and experiment are provided in a separate document on Moodle (Background Information).

1 What are you trying to answer with your analysis?

You are trying to answer the following questions.

  1. Does the amount of particles phagocytes eat depend on the particle’s stiffness?

  2. Does the amount of particles phagocytes eat depend of the substrate’s stiffness?

  3. Are there any interesting effects on the number of particles eaten that reflects both substrate and particle stiffness?

  4. Are there any noticeable batch effects1 in the dataset?

I want you to use what we’ve learned in lectures and labs to explore this dataset.The tools you have learned so far will go a long way, but you will also need to learn new tools on your own, because that is an essential part of learning how to code. Visit R resources often!

2 Exploratory data analysis (EDA)

EDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind. Your goal during EDA is to develop an understanding of your data.

During the initial phases of EDA you should feel free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be dead ends. As your exploration continues, you will home in on a few particularly productive areas that you’ll eventually write up and communicate to others.

EDA involves exploring a dataset in three2 ways:

  1. Identifying missing values and deciding how to deal with them.

  2. Visualizing a dataset using charts.

  3. Summarizing a dataset using descriptive statistics.

The report you will compile should contain four sections (see Section 3).

2.1 Load & View the Data & Deal with Missing Data

This step involves three main tasks: a) getting the dataset into R, b) making sure that it is being read properly and if needed make changes to account for missing data, c) have that first look at your data to understand how it is structured or being understood by R. You should narrate what you are doing before each chunk of code.

Important

Narrate everything! Code and comments related to code go in your code chunks. Everything else explaining your process is text, not code, and should be outside code chunks.

Look at the spreadsheet and the associated README file, try to understand what’s in it, and decide how you will read it into R. We learned about some ways to do this in labs and DataCamp, and there are many others.

Once the dataset is in R, identify any missing data in the datasets and assess whether they are being handled properly or not and, if not, change your code accordingly.

This step also involves having that first look at your data.frame (or matrix). What does it look like? Which R data structure is best suited for this (matrix, data.frame, etc)? Does the dataset respect the rules of what should be the organization of a dataset? (rows are individual observations; columns are variables of interest) What are the dimensions of this dataset? You should assess this before and after deadline with missing data.

2.2 Visualizing the dataset using charts

This step involves deciding what types of visualizations make sense for your data and even any changes you make to your initial plots once you’ve made a first plot, and so on. What are the variables? What type of data are they? what types of plots and summaries are appropriate and which ones will help you answer the questions above?

CautionNot everything that is a number is a numeric variable

Remember: not everything that is a number is a numeric variable, and the same goes for other types of variables. We covered the do-s and don’ts of data visualization in class and labs.

As outlined above, you should narrate what you are doing before each chunk of code.

2.3 Summarizing the dataset using descriptive statistics

This step involves one main task: summarising the data based on the types of variables it has.

TipLet the data guide you

Remember: let the data type guide the visualization and summary processes! Since the question has to do with effects of different treatments on macrophage feeding habits, you will need to somehow summarise the observations for each of the groups/treatments.

2.4 Discussion

Although your report will not include proper statistical testing of hypotheses, I want you to do your best in terms of interpreting your exploratory data analysis.

The discussion should include:

  • Quick recap of your main observations
  • Interpreting what these observations say about the questions you are trying to answer.
  • Limitations/future studies: include any thoughts you have about what could be interesting to include in this experiment, what are the limitations, etc.

3 How to structure your report

Your report should have at least the following sections. You may use subheadings as well.

  • Background: give some information (in your own words) about the experiment and about the format of the dataset in the spreadsheet.

  • Exploratory data analysis: this includes the visualizations and descriptive statistics. Subheadings may come in handy here. This is the core part of your report. See Section 2.2 and Section 2.3.

  • Discussion: (see Section 2.4)

  • References: include a reference section. There are no stringent formatting needs here. See below.

4 What to cite and how to cite it

Sharing/Reusing Code

There is a huge volume of code available on the web to solve any number of problems. You should online resources (e.g. StackOverflow, R-bloggers) to help you troubleshoot errors or learn new tools. You will not be penalized for searching for and/or recycling code as long as the source is explicitly cited (see below).

The format of the citation is not going to be evaluated strictly but you must cite them and include: a link (if available), who (person, article, blog post, etc), when (approximate is fine), what (how did this person/AI/post) help you.personal communication (if this came from a conversation).

For example, to cite this thread on Stack Overflow, and a particularly good answer in said threat3, you would format it like this:

Holcombe, A. How to change legend title in ggplot2? Retrieved October 10, 2025, from https://stackoverflow.com/questions/14622421/how-to-change-legend-title-in-ggplot

And cite it in-text as (Holcombe, 2020).

To cite a discussion with a classmate/TA:

Cavalieri, N. (2025, October 25). Personal communication. Helped clarify how to change colors in boxplots.

And cite in-text as (Cavalieri, 2025).

To cite an interaction with an AI such as chatGPT, make it clear what the query was.

4.1 LLMs/AI

If you do use these tools, this is what is allowed:

  • to brainstorm ideas
  • to decode cryptic error messages or explain why they might have occurred
  • to ask specific questions about R-language related things like function parameters, etc.
WarningYou may not use AI for:
  • writing any portion of your report for you.
  • writing your code for you.

Don’t forget, your instructor, TA, and coursemates are here to help you! Collaborating is very much encouraged (as long as you don’t plagiarize), and pretty fun.

All of the allowed uses must be properly cited. Recycled code that is not explicitly cited will be treated as plagiarism. Do not hesitate to ask if you are unsure what constitutes “direct use” or “substantive inspiration”.

If you do use LLMs for a coding assignment, you must include citations for each instance where you used it, including the prompt you used.

For example:

ChatGPT, Oct 10th 2025. My query was: “can you tell me a funny reason why collaborating with people on writing R code is better than using AI to do it?”

5 Evaluation

You will be submitting 2 files:

  • an Rmarkdown script that will include your text and your R code — all in one document, AND
  • the compiled PDF that your R markdown script will generate.

Finally, I am providing you with a template for this assignment in the Rstudio Cloud project titled Coding_Assignment_1. The assignment will be graded using the rubric described below. Effort will be rewarded. When in doubt, be as descriptive/detailed/explicit as you can!

What will be evaluated:

  • Your R Markdown (.Rmd) file and the PDF that results from compiling it. You will upload the PDF into Moodle under Coding Assignment 1.
  • Your inclusion of each of the sections mentioned here and outlined in the R markdown template file I am providing.
  • Running your code from top to bottom should allow me to recreate your results. I will test this by running your code within your copy of the Posit Cloud project, so make sure the file is there, especially if you have modified it in any way.
  • The plots and descriptive statistics should be appropriate for the type of data.
  • The report must attempt to answer the questions outlined.

6 Instructions for resubmission

If your grade was less than 90% in this assignment, you may resubmit it based on my feedback. The final grade will be the mean of both attempts.

  • Make a new Markdown adding _v2.Rmd at the end. Leave a copy of the markdown in the Posit Cloud space for this assignment.
  • Knit the markdown into a PDF.
  • Upload both files by the new due date: TBD
  • Add the two extra portions below to the front of your PDF.

Very important: write a document separately or add it as a cover to your PDF where you explain IN DETAIL what you changed in this second version. I don’t have time to re-grade the assignment if you don’t tell me what to look for.

Very important: Use two copies of the rubric posted on Moodle to evaluate your version 1 and your updated version. This is a required part of your re-submission.

7 Learn more about EDA and R markdown

Footnotes

  1. In molecular biology, a batch effect occurs when non-biological factors in an experiment cause changes in the data produced by the experiment. The background information document provides important information about the biological replicates used in this study.↩︎

  2. Technically, four, because the first thing to do is to clean up the data, but we haven’t learned that yet!↩︎

  3. I am saying this name just as an example as it was the name of the user with the top reply in that thread.↩︎