As if on cue, the birding gods delivered two early vagrants to Britain: a **Steppe Grey Shrike **in Dunbar, Lothian; and a **Swainson’s Thrush **on St Kilda, Outer Hebrides. Both occurred on the same day, and in this brief report I will be applying the newly developed probability methodology outlined in my previous post to these two records.

**Steppe Grey Shrike**

Steppe Grey Shrike is typically associated with the later autumn period; all bar five of the 29 records in the British Isles have occurred after the 20th September. The five exceptions are three spring records (Cornwall in 1992, Isle of Man in 2003 and Highland in 2023), plus a record from the 14th September from North Ronaldsay in 1994 and the famously ridiculously early bird on Chipping Sodbury Common in Gloucestershire in 2022.

Therefore, firstly, the occurrence of the Dunbar bird on such an early date is not unprecedented, but still noteable, and the probability score should reflect that. Secondly, unlike the Eastern Yellow Wagtail, the Steppe Grey Shrike records likely follow a normal distribution (Shapiro-Wilk score = 0.976, p-value = 0.789), so we have to vary our method very slightly. The core process is still the same, but the calculations vary slightly. I did also check whether a skew-normal distribution performed well, but it had a lower **Akaike Information Criterion **score than the normal model, so I proceeded using the normal distribution.

The model fit was rather good but not as good as the Eastern Yellow Wagtail one (two-sample Anderson-Darling score = 45.7, p-value = 0.837) and the associated probability for a Steppe Grey Shrike to occur on 10th September was **0.479% (3 significant figures)**. Not exactly a dead cert (indeed decidedly unlikely), but a damn sight more likely than an Eastern Yellow Wagtail on the 7th September – which is exactly what we expected!

One interesting facet of this was treatment of outliers: I removed the spring records as the occurrence pattern and mechanism in the British Isles for this species is fundamentally different in this season. However, how should I treat the August record?

Removing this record actually made the distribution **less **normal – the Shapiro-Wilk p-value dropped all the way down to 0.463 (3.s.f.). Additionally, while the AIC score for the normal model still outperformed a skew-normal model, its fit wasn’t as good as the previous normal model (two-sample Anderson-Darling score = 55.0, p-value = 0.703).

The analysis of this finding can cut one of two ways. Either:

- The occurrence pattern of Steppe Grey Shrike is changing, causing the distribution of record timing to shift, and/or:
- Steppe Grey Shrike has previously been under-recorded in the UK in early autumn.

There are merits to both arguments; however, the data is not extensive enough to support either conclusion definitively. In my view, circumstantial evidence points either to the latter or a combination of both, rather than the former solely on its own. Aside from the previous early September record on North Ronaldsay, there are also several records of Steppe Grey Shrike from the Continent in the early autumn period going back a long time, including the Netherlands on 4th September 1994 and 12th September 2014, and Norway on 5th September 1953, suggesting early records are not entirely unprecedented.

However, this provides an important cautionary tale about comparing apples and pears: a seemingly minor alteration to the data may not be interpreted as such by the computer. For instance, the kurtosis of the record timings (i.e. a measure of the strength and number of outliers in the data) dropped by 11%, from **2.40** to **2.13. **Given the kurtosis of a “perfect” normal distribution is 3.00, this would suggest that the outlier was actually not a “true” outlier, and instead fell well within the scope of the Steppe Grey Shrike record distribution. However, this assumption only holds if the records **do in fact **follow a normal distribution. It could be that the “true” distribution is skew-normal (like Eastern Yellow Wagtail) and the August record could then be a genuine outlier.

This is why **this entire process is dependent on a sufficient sample size** – the smaller the number of records, the fewer reliable interpretations we can draw.

What was less contentious was the probability of a Steppe Grey Shrike on 10th September dropping when the August record was removed – the new probability calculation came out at **0.343%**, which may not sound like much, but it’s a big drop relative to the previous value (about a 29% drop).

**Swainson’s Thrush**

With double the record of either Steppe Grey Shrike or Eastern Yellow Wagtail, we may be able to get more of an idea about this method’s robustness by looking at the recent **Swainson’s Thrush** on St Kilda.

There are two important differences in the distribution of Swainson’s Thrush records compared to either of the other two:

- The range is much shorter – there are no records of Swainson’s Thrush later than the fourth week of October, whereas with both of the others there are plenty.
- The recent Swainson’s Thrush record is earlier than any of the prior autumn records by a good five days – there are no earlier outliers like with Steppe Grey Shrike.

However, owing to the lack of a distinct peak in the middle of the record area as is required by the normal distribution, the normality score is quite low (Shapiro-Wilk score = 0.985, p-value = 0.663). However, the fit of the distribution is very high indeed (two-sample Anderson-Darling score = 56.2, p = 0.983). The associated probability of a Swainson’s Thrush occurring on 10th September is **0.109%**, so less likely than a Steppe Grey Shrike but more likely than an Eastern Yellow Wagtail.

### Method Evaluation: The Problem with Model Fit Assessment

The high fit of a normal distribution model on a set of records with a relatively low normality score is very odd, and led me to realise a problem with my initial method.

The way the fit of the continuous models is assessed, both normal and skew-normal, is to take a random sample of numbers that fit the distribution spat out by the computer and compare this random sample to the actual data. However, in order for the comparison test (the two-sample Anderson-Darling test) to work, the random sample has to be the same length as the data. Drawing only 30-odd samples from a distribution is very unreliable and so will likely give unreliable model fit results. So, how can we improve the reliability of the assessment of model fit?

The answer lies with a process called **Monte Carlo simulations**. In this instance, events with an uncertain outcome (i.e. the random sampling of 30 numbers from a given distribution) are simulated a certain amount of times (normally an order of 10^{X}) and the outcomes then mapped graphically.

In the case of the Swainson’s Thrush, 1000 simulations were performed, and here are the p-values graphed:

We can choose to take a couple of values as an answer for the model’s fit, such as:

**Kernel Density Estimate (KDE):**The maximum point of the density curve, found using**kernel density estimation,**broadly similar to a mode (0.902)**Expected value:**essentially a**weighted average**(0.684)

There’s merits to both as they both measure slightly different things, so in **all future probability analyses by OrniStats, both values will be given**.

Consequently, here are the two values for the taxa we’ve previously examined this way:

**Steppe Grey Shrike (with Gloucestershire 2022):**p-value (KDE) = 0.776, p-value (EV) = 0.595**Steppe Grey Shrike (without Gloucestershire 2022):**p-value (KDE) = 0.600, p-value (EV) = 0.557**Eastern Yellow Wagtail:**p-value (KDE) = 0.00105, p-value (EV) = 0.0623

Observant viewers will note that the p-value for the Eastern Yellow Wagtail distribution is significantly worse than what was reported, and looking at the Monte Carlo simulations it seems that the 0.996 value we got initially was nothing more than a fluke! However, the expected p-value is ** just **above the 0.05 threshold to consider it useful. But we shall take all predictions from that model with extreme caution – the small sample size will likely distort a lot of the modelling we can do.

**Conclusions**

While the shortcomings of this method are obvious at low sample sizes, the close match to intuition yielded by these models is encouraging, and shows there may be a path forward for them with more refinement and significantly more data! The latter point is especially important, as it seems that the estimations of model fit, both KDE and EV, scale significantly with sample size. An investigation into this relationship will be published at a later date.