Ecotourism Tutorial

When do glowworms glow? Joining wildlife, weather and tourism data

Author

Vajinder Kaur

🌿 The question

Imagine you work for a small ecotourism company in Queensland.

A client asks: “When is the best time to see glowworms?”

Glowworms lighting up a cave in Australia

To answer that, you need to answer three smaller questions first.

Where are glowworms actually spotted? What is the weather like there? And how busy is the region with other tourists?

This tutorial helps you find those answers.

🎯 Objectives

By the end of this tutorial you will be able to:

  • Filter wildlife occurrence records by region and month
  • Join occurrence data to daily weather using ws_id and date
  • Join occurrence data to quarterly tourism counts
  • Spot and explain missing data in a real dataset
  • Draw careful conclusions from incomplete information

Bonus: You will also learn to go online and check whether what the data shows matches what is actually true about glowworms in the wild.

🔧 Preparation

Install the ecotourism package if you have not already:

# install.packages("pak")
pak::pak("vahdatjavad/ecotourism")

Then load everything you need:

library(ecotourism)
library(tidyverse)
library(lubridate)
library(plotly)

We will work with four datasets from the package:

Dataset Description
glowworms Glowworm sighting records across Australia (2014-2024)
weather Daily weather observations by station
tourism_quarterly Quarterly overnight trip counts by region and purpose (trips in thousands)
tourism_region Tourism region names and locations

All four datasets share one variable: ws_id. This is the weather station ID that links a sighting to its nearest weather station and tourism region. You will use it in every join.

🔍 Meet the data

Before we answer any questions, let us look at what we are working with.

The organisms

The ecotourism package has four wildlife datasets. Why did we choose glowworms?

# Compare all organisms -- size, weather coverage, tourism coverage
list(
  `Gouldian Finch` = gouldian_finch,
  `Manta Rays`     = manta_rays,
  `Glowworms`      = glowworms,
  `Orchids`        = orchids
) |>
  purrr::map_dfr(function(df) {
    # Aggregate tourism then join both datasets
    tq <- tourism_quarterly |>
      group_by(ws_id, year, quarter) |>
      summarise(total_trips = sum(trips, na.rm = TRUE), 
                .groups = "drop")
    df_full <- df |>
      mutate(quarter = quarter(date)) |>
      left_join(tq, by = c("ws_id", "year", "quarter")) |>
      left_join(weather |> select(ws_id, date, max),
                by = c("ws_id", "date"))
    tibble(
      Sightings          = nrow(df),
      `Weather coverage` = paste0(round(mean(!is.na(df_full$max))         * 100, 1), "%"),
      `Tourism coverage` = paste0(round(mean(!is.na(df_full$total_trips)) * 100, 1), "%"),
      States             = n_distinct(df$obs_state)
    )
  }, .id = "Organism") |>
  kable() |>
  kable_styling(full_width = FALSE)
Organism Sightings Weather coverage Tourism coverage States
Gouldian Finch 3922 54% 50.6% 7
Manta Rays 953 75.1% 4.9% 5
Glowworms 124 8.9% 39.5% 3
Orchids 35052 7.5% 36.2% 1

Orchids and Gouldian Finches have too many records for a focused tutorial. Manta rays have great weather data but almost no tourism matches. That makes a three-way join nearly impossible.

Glowworms sit in a sweet spot. Small enough to explore comfortably. Good enough tourism coverage for a meaningful join. Glowworms have the least weather data of any organism. That turns out to be useful for teaching.

There is one more reason. Glowworms are bioluminescent. They glow after dark to attract prey. That means the hour of a sighting carries real biological meaning. A sighting at 2am tells a very different story than one at 2pm. No other organism in this package gives you that.

The sightings

Let us look at the key variables in the glowworm dataset:

tibble(
  Variable    = c("obs_lat", "obs_lon", "date", 
                  "hour", "obs_state", "ws_id"),
  Description = c(
    "Latitude of sighting",
    "Longitude of sighting",
    "Date of sighting",
    "Hour of day (0-23)",
    "Australian state",
    "Nearest weather station ID"
  )
) |>
  kable() |>
  kable_styling(full_width = FALSE)
Variable Description
obs_lat Latitude of sighting
obs_lon Longitude of sighting
date Date of sighting
hour Hour of day (0-23)
obs_state Australian state
ws_id Nearest weather station ID

Three things to notice. First, ws_id links each sighting to its nearest weather station. Second, hour tells us what time of day the sighting was recorded. Third, obs_state tells us which Australian state it was in.

Why Queensland and November?

We compared all three states before choosing:

glowworms |>
  count(obs_state, sort = TRUE) |>
  kable() |>
  kable_styling(full_width = FALSE)
obs_state n
Queensland 61
Tasmania 40
New South Wales 23

Queensland has the most sightings. But here is what made us choose November specifically:

glowworms |>
  filter(obs_state == "Queensland") |>
  mutate(
    time_of_day = case_when(
      hour >= 20 | hour <= 4 ~ "Night (8pm-4am)",
      hour >= 17              ~ "Evening (5pm-8pm)",
      TRUE                    ~ "Daytime"
    )
  ) |>
  count(month, time_of_day) |>
  complete(month = 1:12, time_of_day, fill = list(n = 0)) |>
  mutate(month_name = month.abb[month]) |>
  plot_ly(
    x      = ~month_name,
    y      = ~n,
    color  = ~time_of_day,
    type   = "scatter",
    mode   = "lines+markers"
  ) |>
  layout(
    xaxis  = list(title = "Month", categoryorder = "array",
                  categoryarray = month.abb),
    yaxis  = list(title = "Number of sightings"),
    legend = list(title = list(text = "Time of day")),
    margin = list(t = 20)
  )

 

Night sightings peak in November. The blue line rises sharply while daytime sightings stay flat. That matches what we know about glowworm biology.

Your bonus task: Go online and check. Are glowworms in Queensland really most active in November? What does the Atlas of Living Australia say?

We also noticed Tasmania has surprisingly good data coverage in November. We focus on Queensland here. But Tasmania is waiting for you in the extensions at the end.

📥 Exercises

Exercise 1 — Where is the weather?

You are planning glowworm tours in Queensland in November.

Your first instinct is to check the weather. Warm humid nights are best for glowworm activity. So you join the sighting records to the weather data.

But something unexpected happens.

Your tasks:

  1. Filter glowworms to Queensland sightings in November.
  2. Join to weather using ws_id and date.
  3. Count how many sightings have a non-missing max temperature.
  4. What proportion of sightings have no weather data at all?
  5. Why do you think weather data is missing for these locations?
Tip

Before you start: Look at the ws_id values in your filtered data. Then look at the ws_id values in weather. Do they overlap?

# Step 1 -- filter to Queensland November
glow_qld_nov <- glowworms |>
  filter(obs_state == "Queensland", month == 11)

cat("Total sightings:", nrow(glow_qld_nov), "\n")
Total sightings: 12 
# Step 2 -- join to weather
glow_qld_nov_wx <- glow_qld_nov |>
  left_join(weather, by = c("ws_id", "date"))

# Steps 3 and 4 -- completeness summary
glow_qld_nov_wx |>
  summarise(
    total_sightings = n(),
    with_weather    = sum(!is.na(max)),
    without_weather = sum(is.na(max)),
    pct_missing     = round(mean(is.na(max)) * 100, 1)
  ) |>
  kable() |>
  kable_styling(full_width = FALSE)
total_sightings with_weather without_weather pct_missing
12 0 12 100
# Step 5 -- check station date coverage
weather |>
  filter(ws_id %in% unique(glow_qld_nov$ws_id)) |>
  group_by(ws_id) |>
  summarise(
    n_records = n(),
    min_date  = min(date),
    max_date  = max(date)
  ) |>
  kable(caption = "Weather coverage for Queensland November glowworm stations") |>
  kable_styling(full_width = FALSE)
Weather coverage for Queensland November glowworm stations
ws_id n_records min_date max_date
945820-99999 1 2014-09-25 2014-09-25

What did you find? All 12 sightings show 100% missing weather. Even though one station (945820-99999) does appear in the weather dataset.

Check its coverage. It has exactly one record: September 25, 2014. Our November sightings span 2018 to 2024. The station was set up, recorded a single day, and then went silent.

This is what real monitoring data looks like. A station appearing in a dataset does not mean it has useful coverage. Always check the date range, not just the station ID.

Your bonus task: Search for Springbrook National Park in Queensland. Is it near a major weather station? What does that tell you about the kind of places glowworms live?

Exercise 2 — Who is visiting?

Weather data is missing. But tourism data might still help.

Before you join, there is one thing to sort out. The tourism_quarterly dataset has two rows per region per quarter. One for Holiday trips and one for Business trips. If you join directly to glowworm sightings you will get duplicate rows. One sighting will appear twice.

Try it and see:

glow_qld_nov |>
  left_join(tourism_quarterly, by = c("ws_id", "year")) |>
  nrow()
[1] 76

More rows than sightings. That is the many-to-many problem.

Think of it this way. You have 12 glowworm sightings. Each has a station ID. The tourism table has two rows per station. One for Holiday, one for Business. When you join, each sighting matches both rows. 12 sightings become 24 rows. Your data doubled without adding anything real.

The fix is simple. Before joining, collapse those two rows into one. Add Holiday and Business trips together to get one total_trips number per station per quarter. Now each sighting matches exactly one tourism row.

Your tasks:

  1. Aggregate tourism_quarterly by summing Holiday and Business trips into total_trips per ws_id, year and quarter.
  2. Add a quarter column to glow_qld_nov.
  3. Join your aggregated tourism data to glow_qld_nov.
  4. How many sightings have a matching tourism record?
  5. Are glowworm sighting locations Holiday destinations or Business destinations?
Tip

Hint: November is in quarter 4. Use lubridate::quarter(date) to calculate it automatically.

# Step 1 -- aggregate tourism to avoid many-to-many
tourism_total <- tourism_quarterly |>
  group_by(ws_id, year, quarter) |>
  summarise(
    total_trips    = sum(trips, na.rm = TRUE),
    holiday_trips  = sum(trips[purpose == "Holiday"],  na.rm = TRUE),
    business_trips = sum(trips[purpose == "Business"], na.rm = TRUE),
    .groups = "drop"
  )

# Steps 2 and 3 -- add quarter and join
glow_qld_nov_q <- glow_qld_nov |>
  mutate(quarter = quarter(date))

glow_qld_nov_tour <- glow_qld_nov_q |>
  left_join(tourism_total, by = c("ws_id", "year", "quarter"))

# Step 4 -- completeness summary
glow_qld_nov_tour |>
  summarise(
    total_sightings = n(),
    with_tourism    = sum(!is.na(total_trips)),
    without_tourism = sum(is.na(total_trips)),
    pct_missing     = round(mean(is.na(total_trips)) * 100, 1)
  ) |>
  kable() |>
  kable_styling(full_width = FALSE)
total_sightings with_tourism without_tourism pct_missing
12 5 7 58.3
# Step 5 -- Holiday vs Business
tourism_quarterly |>
  filter(ws_id %in% unique(glow_qld_nov$ws_id)) |>
  group_by(purpose) |>
  summarise(total_trips = round(sum(trips, na.rm = TRUE), 0)) |>
  plot_ly(
    x          = ~purpose,
    y          = ~total_trips,
    color      = ~purpose,
    type       = "bar",
    showlegend = FALSE
  ) |>
  layout(
    xaxis  = list(title = "Trip purpose"),
    yaxis  = list(title = "Total overnight trips (thousands)"),
    margin = list(t = 20)
  )

Holiday trips are nearly nine times higher than Business trips. Each unit of trips represents thousands of overnight visits. So these glowworm regions receive roughly 2.9 million Holiday overnight trips versus 333 thousand Business trips across all quarters on record.

That makes sense. People do not travel to remote Queensland rainforests for work meetings. They go to see glowworms, walk rainforest trails, and watch wildlife after dark.

Your bonus task: Look up the tourism regions that match these stations. Are they well known ecotourism destinations in Queensland?

Exercise 3 — Putting it all together

You now know two things about Queensland glowworm sightings in November.

Weather data is completely missing. Tourism data exists for about 42% of sightings.

Now bring everything together into one joined dataset. Then answer the question your client actually asked: when is the best time to see glowworms?

Your tasks:

  1. Start from glow_qld_nov_q (your Queensland November sightings with quarter added).
  2. Join tourism data using tourism_total.
  3. Join weather data using ws_id and date.
  4. How complete is your final joined dataset across all three sources?
  5. Using sightings by year, which years have tourism data and which do not?
  6. Look at the hour of sightings. What time of day are most glowworms spotted in November?
  7. Based on everything you have found, what would you actually tell your client?
Tip

Think before you code: You already created tourism_total in Exercise 2. You do not need to recreate it. Just use it directly.

# Steps 1, 2, 3 -- full three-way join
# Select only needed weather columns to avoid duplicate year/month columns
glow_full_join <- glow_qld_nov_q |>
  left_join(tourism_total, by = c("ws_id", "year", "quarter")) |>
  left_join(
    weather |> select(ws_id, date, temp, min, max, prcp, wind_speed),
    by = c("ws_id", "date")
  )

# Step 4 -- completeness summary
glow_full_join |>
  summarise(
    total_sightings = n(),
    has_weather     = sum(!is.na(max)),
    has_tourism     = sum(!is.na(total_trips)),
    has_both        = sum(!is.na(max) & !is.na(total_trips)),
    pct_weather     = round(mean(!is.na(max))         * 100, 1),
    pct_tourism     = round(mean(!is.na(total_trips)) * 100, 1)
  ) |>
  kable(caption = "Data completeness after three-way join") |>
  kable_styling(full_width = FALSE)
Data completeness after three-way join
total_sightings has_weather has_tourism has_both pct_weather pct_tourism
12 0 5 0 0 41.7
# Step 5 -- sightings by year colored by tourism availability
glow_full_join |>
  count(year, name = "n_sightings") |>
  mutate(
    tourism = ifelse(
      year %in% (glow_full_join |> 
                   filter(!is.na(total_trips)) |> 
                   pull(year) |> 
                   unique()),
      "Tourism data available",
      "No tourism data"
    )
  ) |>
  plot_ly(
    x         = ~factor(year),
    y         = ~n_sightings,
    color     = ~tourism,
    colors    = c("Tourism data available" = "#5cb85c",
                  "No tourism data"        = "#d9534f"),
    type      = "bar",
    text      = ~paste0(year, ": ", n_sightings, " sightings"),
    hoverinfo = "text"
  ) |>
  layout(
    xaxis  = list(title = "Year"),
    yaxis  = list(title = "Number of sightings"),
    legend = list(title = list(text = "")),
    margin = list(t = 20)
  )
# Step 6 -- time of day distribution
glow_full_join |>
  mutate(
    time_of_day = case_when(
      hour >= 20 | hour <= 4 ~ "Night (8pm-4am)",
      hour >= 17              ~ "Evening (5pm-8pm)",
      TRUE                    ~ "Daytime"
    )
  ) |>
  count(time_of_day) |>
  plot_ly(
    labels = ~time_of_day,
    values = ~n,
    type   = "pie"
  ) |>
  layout(
    legend = list(title = list(text = "Time of day")),
    margin = list(t = 20)
  )

What did you find?

Weather data is completely absent for Queensland November sightings. You cannot say anything about temperature or rainfall conditions.

Tourism data covers five of the twelve sightings. The years with tourism data are 2018, 2021 and 2022. After 2022 the tourism dataset ends. So 2023 and 2024 sightings have no tourism match at all. This is a data currency problem, not a coverage problem.

The hour data is the most useful finding. More than half of November sightings happen after dark. Glowworms are bioluminescent. They glow at night to attract prey. The data confirms the biology.

So what do you tell your client?

“November is a good month to look for glowworms in Queensland. Go after dark. Most sightings happen between 8pm and 4am. Head to known Holiday tourism areas. And bring a rain jacket. We have no weather data for these remote locations.”

That is an honest answer. It uses what the data can tell you. And it is clear about what it cannot.

👌 Finishing up

You set out to answer one question: when is the best time to see glowworms in Queensland?

You ended up learning three things that go beyond that question.

First, a station appearing in a dataset does not mean it has useful data. Always check date ranges and coverage before trusting a join.

Second, the many-to-many problem is not an error. It is a signal that your data needs preparation before joining.

Third, missing data tells a story too. The places where glowworms live in Queensland are remote, off the grid, and far from weather stations. That is part of what makes them special.

Try these extensions:

  • Re-run this entire analysis for Tasmania. Tasmania had 50% weather coverage and 83% tourism coverage in November. Does the story change? Is the client better served by a Tasmania glowworm tour?
  • Try inner_join() instead of left_join() in Exercise 1. How many sightings do you keep? What are you losing?
  • The tourism data only goes to 2022. Sightings go to 2024. What would you need to complete the picture?
  • Go to the Atlas of Living Australia at ala.org.au. Search for glowworms in Queensland. Do the locations match what you found here?