A first look at Underdog

Author: Paul King

Published: 2025-04-17 10:30PM


Let’s explore Whisky profiles using Underdog!

A relatively new data science library is Underdog. Let’s use it to explore Whiskey profiles. It has many Groovy-powered features delivering a very expressive developer experience.

Underdog sits on top of some well-known data-science libraries like Smile, Tablesaw, and Apache eCharts. If you have used any of those libraries, you’ll recognise parts of the functionality.

First, we’ll load our CSV file:

def file = new File(getClass().classLoader.getResource('whiskey.csv').file)
def df = Underdog.df().read_csv(file.path).drop('RowID')

Let’s look at the shape of and schema for the data:

println df.shape()
println df.schema()

It gives this output:

86 rows X 13 cols
        Structure of whiskey.csv
 Index  |  Column Name  |  Column Type  |
-----------------------------------------
     0  |   Distillery  |       STRING  |
     1  |         Body  |      INTEGER  |
     2  |    Sweetness  |      INTEGER  |
     3  |        Smoky  |      INTEGER  |
     4  |    Medicinal  |      INTEGER  |
     5  |      Tobacco  |      INTEGER  |
     6  |        Honey  |      INTEGER  |
     7  |        Spicy  |      INTEGER  |
     8  |        Winey  |      INTEGER  |
     9  |        Nutty  |      INTEGER  |
    10  |        Malty  |      INTEGER  |
    11  |       Fruity  |      INTEGER  |
    12  |       Floral  |      INTEGER  |

Let’s look at a correlation matrix plot of the data:

def plot = Underdog.plots()
def features = df.columns - 'Distillery'
plot.correlationMatrix(df[features]).show()

Which has this output:

correlation plot

We can also look at the data for any individual distillery using a radar plot. Let’s look at it for row 0:

def data = df[features] as double[][]
plot.radar(
    features,
    [4] * features.size(),
    data[0].toList(),
    df['Distillery'][0]
).show()

Which has this output:

radar plot

Let’s now cluster the distilleries using k-means:

def ml = Underdog.ml()
def clusters = ml.clustering.kMeans(data, nClusters: 3)
df['Cluster'] = clusters.toList()

println 'Clusters'
for (int i in clusters.toSet()) {
    println "$i:${df[df['Cluster'] == i]['Distillery'].join(', ')}"
}

It gives the following output:

Clusters
0:Aberfeldy, Aberlour, Auchroisk, Balmenach, Belvenie, BenNevis, Benrinnes, Benromach, BlairAthol, Dailuaine, Dalmore, Edradour, GlenOrd, Glendronach, Glendullan, Glenfarclas, Glenlivet, Glenrothes, Glenturret, Knochando, Longmorn, Macallan, Mortlach, RoyalLochnagar, Strathisla
1:Ardbeg, Balblair, Bowmore, Bruichladdich, Caol Ila, Clynelish, GlenGarioch, GlenScotia, Highland Park, Isle of Jura, Lagavulin, Laphroig, Oban, OldPulteney, Springbank, Talisker, Teaninich
2:AnCnoc, Ardmore, ArranIsleOf, Auchentoshan, Aultmore, Benriach, Bladnoch, Bunnahabhain, Cardhu, Craigallechie, Craigganmore, Dalwhinnie, Deanston, Dufftown, GlenDeveronMacduff, GlenElgin, GlenGrant, GlenKeith, GlenMoray, GlenSpey, Glenallachie, Glenfiddich, Glengoyne, Glenkinchie, Glenlossie, Glenmorangie, Inchgower, Linkwood, Loch Lomond, Mannochmore, Miltonduff, OldFettercairn, RoyalBrackla, Scapa, Speyburn, Speyside, Strathmill, Tamdhu, Tamnavulin, Tobermory, Tomatin, Tomintoul, Tomore, Tullibardine

Finally, let’s project our data onto 2 dimensions using PCA and plot that as a scatter plot:

def pca = ml.features.pca(data, 2)
def projected = pca.apply(data)

df['X'] = projected*.getAt(0)
df['Y'] = projected*.getAt(1)

plot.scatter(
    df['X'],
    df['Y'],
    df['Cluster'],
    'Whiskey Clusters'
).show()

The output looks like this:

scatter plot

Further information

Conclusion

We have looked at how to use Underdog.

Update history

17/Apr/2025: Initial version