Natural Language Processing with Groovy, OpenNLP, CoreNLP, Nlp4j, Datumbox, Smile, Spark NLP, DJL and TensorFlow

Author: Paul King
Published: 2022-08-07 07:34AM


Natural Language Processing is certainly a large and sometimes complex topic with many aspects. Some of those aspects deserve entire blogs in their own right. For this blog, we will briefly look at a few simple use cases illustrating where you might be able to use NLP technology in your own project.

Language Detection

Knowing what language some text represents can be a critical first step to subsequent processing. Let’s look at how to predict the language using a pre-built model and Apache OpenNLP. Here, ResourceHelper is a utility class used to download and cache the model. The first run may take a little while as it downloads the model. Subsequent runs should be fast. Here we are using a well-known model referenced in the OpenNLP documentation.

def helper = new ResourceHelper('https://dlcdn.apache.org/opennlp/models/langdetect/1.8.3/')
def model = new LanguageDetectorModel(helper.load('langdetect-183'))
def detector = new LanguageDetectorME(model)

[ spa: 'Bienvenido a Madrid', fra: 'Bienvenue à Paris',
  dan: 'Velkommen til København', bul: 'Добре дошли в София'
].each { k, v ->
    assert detector.predictLanguage(v).lang == k
}

The LanguageDetectorME class lets us predict the language. In general, the predictor may not be accurate on small samples of text, but it was good enough for our example. We’ve used the language code as the key in our map, and we check that against the predicted language.

A more complex scenario is training your own model. Let’s look at how to do that with Datumbox. Datumbox has a pre-trained models zoo but its language detection model didn’t seem to work well for the small snippets in the next example, so we’ll train our own model. First, we’ll define our datasets:

def datasets = [
    English: getClass().classLoader.getResource("training.language.en.txt").toURI(),
    French: getClass().classLoader.getResource("training.language.fr.txt").toURI(),
    German: getClass().classLoader.getResource("training.language.de.txt").toURI(),
    Spanish: getClass().classLoader.getResource("training.language.es.txt").toURI(),
    Indonesian: getClass().classLoader.getResource("training.language.id.txt").toURI()
]

The de training dataset comes from the Datumbox examples. The training datasets for the other languages are from Kaggle.

We set up the training parameters needed by our algorithm:

def trainingParams = new TextClassifier.TrainingParameters(
    numericalScalerTrainingParameters: null,
    featureSelectorTrainingParametersList: [new ChisquareSelect.TrainingParameters()],
    textExtractorParameters: new NgramsExtractor.Parameters(),
    modelerTrainingParameters: new MultinomialNaiveBayes.TrainingParameters()
)

We’ll use a Naïve Bayes model with Chisquare feature selection.

Next we create our algorithm, train it with our training dataset, and then validate it against the training dataset. We’d normally want to split the data into training and testing datasets, to give us a more accurate statistic of the accuracy of our model. But for simplicity, while still illustrating the API, we’ll train and validate with our entire dataset:

def config = Configuration.configuration
def classifier = MLBuilder.create(trainingParams, config)
classifier.fit(datasets)
def metrics = classifier.validate(datasets)
println "Classifier Accuracy (using training data): $metrics.accuracy"

When run, we see the following output:

Classifier Accuracy (using training data): 0.9975609756097561

Our test dataset will consist of some hard-coded illustrative phrases. Let’s use our model to predict the language for each phrase:

[   'Bienvenido a Madrid', 'Bienvenue à Paris', 'Welcome to London',
    'Willkommen in Berlin', 'Selamat Datang di Jakarta'
].each { txt ->
    def r = classifier.predict(txt)
    def predicted = r.YPredicted
    def probability = sprintf '%4.2f', r.YPredictedProbabilities.get(predicted)
    println "Classifying: '$txt',  Predicted: $predicted,  Probability: $probability"
}

When run, it has this output:

Classifying: 'Bienvenido a Madrid',  Predicted: Spanish,  Probability: 0.83
Classifying: 'Bienvenue à Paris',  Predicted: French,  Probability: 0.71
Classifying: 'Welcome to London',  Predicted: English,  Probability: 1.00
Classifying: 'Willkommen in Berlin',  Predicted: German,  Probability: 0.84
Classifying: 'Selamat Datang di Jakarta',  Predicted: Indonesian,  Probability: 1.00

Given these phrases are very short, it is nice to get them all correct, and the probabilities all seem reasonable for this scenario.

Parts of Speech

Parts of speech (POS) analysers examine each part of a sentence (the words and potentially punctuation) in terms of the role they play in a sentence. A typical analyser will assign or annotate words with their role like identifying nouns, verbs, adjectives and so forth. This can be a key early step for tools like the voice assistants from Amazon, Apple and Google.

We’ll start by looking at a perhaps lesser known library Nlp4j before looking at some others. In fact, there are multiple Nlp4j libraries. We’ll use the one from nlp4j.org, which seems to be the most active and recently updated.

This library uses the Stanford CoreNLP library under the covers for its English POS functionality. The library has the concept of documents, and annotators that work on documents. Once annotated, we can print out all of the discovered words and their annotations:

var doc = new DefaultDocument()
doc.putAttribute('text', 'I eat sushi with chopsticks.')
var ann = new StanfordPosAnnotator()
ann.setProperty('target', 'text')
ann.annotate(doc)
println doc.keywords.collect{  k -> "${k.facet - 'word.'}(${k.str})" }.join(' ')

When run, we see the following output:

PRP(I) VBP(eat) NN(sushi) IN(with) NNS(chopsticks) .(.)

The annotations, also known as tags or facets, for this example are as follows:

PRP

Personal pronoun

VBP

Present tense verb

NN

Noun, singular

IN

Preposition

NNS

Noun, plural

The documentation for the libraries we are using give a more complete list of such annotations.

A nice aspect of this library is support for other languages, in particular, Japanese. The code is very similar but uses a different annotator:

doc = new DefaultDocument()
doc.putAttribute('text', '私は学校に行きました。')
ann = new KuromojiAnnotator()
ann.setProperty('target', 'text')
ann.annotate(doc)
println doc.keywords.collect{ k -> "${k.facet}(${k.str})" }.join(' ')

When run, we see the following output:

名詞(私) 助詞(は) 名詞(学校) 助詞(に) 動詞(行き) 助動詞(まし) 助動詞(た) 記号(。)

Before progressing, we’ll highlight the result visualization capabilities of the GroovyConsole. This feature lets us write a small Groovy script which converts results to any swing component. In our case we’ll convert lists of annotated strings to a JLabel component containing HTML including colored annotation boxes. The details aren’t included here but can be found in the repo. We need to copy that file into our ~/.groovy folder and then enable script visualization as shown here:

How to enable visualization in the groovyconsole

Then we should see the following when running the script:

natural language processing in the groovyconsole with visualization

The visualization is purely optional but adds a nice touch. If using Groovy in notebook environments like Jupyter/BeakerX, there might be visualization tools in those environments too.

Let’s look at a larger example using the Smile library.

First, the sentences that we’ll examine:

def sentences = [
    'Paul has two sisters, Maree and Christine.',
    'No wise fish would go anywhere without a porpoise',
    'His bark was much worse than his bite',
    'Turn on the lights to the main bedroom',
    "Light 'em all up",
    'Make it dark downstairs'
]

A couple of those sentences might seem a little strange, but they are selected to show off quite a few of the different POS tags.

Smile has a tokenizer class which splits a sentence into words. It handles numerous cases like contractions and abbreviations ("e.g.", "'tis", "won’t"). Smile also has a POS class based on the hidden Markov model and a built-in model is used for that class. Here is our code using those classes:

def tokenizer = new SimpleTokenizer(true)
sentences.each {
    def tokens = Arrays.stream(tokenizer.split(it)).toArray(String[]::new)
    def tags = HMMPOSTagger.default.tag(tokens)*.toString()
    println tokens.indices.collect{tags[it] == tokens[it] ? tags[it] : "${tags[it]}(${tokens[it]})" }.join(' ')
}

We run the tokenizer for each sentence. Each token is then displayed directly or with its tag if it has one.

Running the script gives this visualization:

Paul
NNP
has
VBZ
two
CD
sisters
NNS
,
Maree
NNP
and
CC
Christine
NNP
.
No
DT
wise
JJ
fish
NN
would
MD
go
VB
anywhere
RB
without
IN
a
DT
porpoise
NN
His
PRP$
bark
NN
was
VBD
much
RB
worse
JJR
than
IN
his
PRP$
bite
NN
Turn
VB
on
IN
the
DT
lights
NNS
to
TO
the
DT
main
JJ
bedroom
NN
Light
NNP
'em
PRP
all
RB
up
RB
Make
VB
it
PRP
dark
JJ
downstairs
NN

[Note: the scripts in the repo just print to stdout which is perfect when using the command-line or IDEs. The visualization in the GoovyConsole kicks in only for the actual result. So, if you are following along at home and wanting to use the GroovyConsole, you’d change the each to collect and remove the println, and you should be good for visualization.]

The OpenNLP code is very similar:

def tokenizer = SimpleTokenizer.INSTANCE
sentences.each {
    String[] tokens = tokenizer.tokenize(it)
    def posTagger = new POSTaggerME('en')
    String[] tags = posTagger.tag(tokens)
    println tokens.indices.collect{tags[it] == tokens[it] ? tags[it] : "${tags[it]}(${tokens[it]})" }.join(' ')
}

OpenNLP allows you to supply your own POS model but downloads a default one if none is specified.

When the script is run, it has this visualization:

Paul
PROPN
has
VERB
two
NUM
sisters
NOUN
,
PUNCT
Maree
PROPN
and
CCONJ
Christine
PROPN
.
PUNCT
No
DET
wise
ADJ
fish
NOUN
would
AUX
go
VERB
anywhere
ADV
without
ADP
a
DET
porpoise
NOUN
His
PRON
bark
NOUN
was
AUX
much
ADV
worse
ADJ
than
ADP
his
PRON
bite
NOUN
Turn
VERB
on
ADP
the
DET
lights
NOUN
to
ADP
the
DET
main
ADJ
bedroom
NOUN
Light
NOUN
'
PUNCT
em
NOUN
all
ADV
up
ADP
Make
VERB
it
PRON
dark
ADJ
downstairs
NOUN

The observant reader may have noticed some slight differences in the tags used in this library. They are essentially the same but using slightly different names. This is something to be aware of when swapping between POS libraries or models. Make sure you look up the documentation for the library/model you are using to understand the available tag types.

Entity Detection

Named entity recognition (NER), seeks to identity and classify named entities in text. Categories of interest might be persons, organizations, locations dates, etc. It is another technology used in many fields of NLP.

We’ll start with our sentences to analyse:

String[] sentences = [
    "A commit by Daniel Sun on December 6, 2020 improved Groovy 4's language integrated query.",
    "A commit by Daniel on Sun., December 6, 2020 improved Groovy 4's language integrated query.",
    'The Groovy in Action book by Dierk Koenig et. al. is a bargain at $50, or indeed any price.',
    'The conference wrapped up yesterday at 5:30 p.m. in Copenhagen, Denmark.',
    'I saw Ms. May Smith waving to June Jones.',
    'The parcel was passed from May to June.',
    'The Mona Lisa by Leonardo da Vinci has been on display in the Louvre, Paris since 1797.'
]

We’ll use some well-known models, we’ll focus on the person, money, date, time, and location models:

def base = 'http://opennlp.sourceforge.net/models-1.5'
def modelNames = ['person', 'money', 'date', 'time', 'location']
def finders = modelNames.collect { model ->
    new NameFinderME(DownloadUtil.downloadModel(new URL("$base/en-ner-${model}.bin"), TokenNameFinderModel))
}

We’ll now tokenize our sentences:

def tokenizer = SimpleTokenizer.INSTANCE
sentences.each { sentence ->
    String[] tokens = tokenizer.tokenize(sentence)
    Span[] tokenSpans = tokenizer.tokenizePos(sentence)
    def entityText = [:]
    def entityPos = [:]
    finders.indices.each {fi ->
        // could be made smarter by looking at probabilities and overlapping spans
        Span[] spans = finders[fi].find(tokens)
        spans.each{span ->
            def se = span.start..<span.end
            def pos = (tokenSpans[se.from].start)..<(tokenSpans[se.to].end)
            entityPos[span.start] = pos
            entityText[span.start] = "$span.type(${sentence[pos]})"
        }
    }
    entityPos.keySet().sort().reverseEach {
        def pos = entityPos[it]
        def (from, to) = [pos.from, pos.to + 1]
        sentence = sentence[0..<from] + entityText[it] + sentence[to..-1]
    }
    println sentence
}

And when visualized, shows this:

A commit by
Daniel Sun
person
on
December 6, 2020
date
improved Groovy 4's language integrated query.
A commit by
Daniel
person
on Sun.,
December 6, 2020
date
improved Groovy 4's language integrated query.
The Groovy in Action book by
Dierk Koenig
person
et. al. is a bargain at
$50
money
, or indeed any price.
The conference wrapped up
yesterday
date
at
5:30 p.m.
time
in
Copenhagen
location
,
Denmark
location
.
I saw Ms.
May Smith
person
waving to
June Jones
person
.
The parcel was passed from
May to June
date
.
The Mona Lisa by
Leonardo da Vinci
person
has been on display in the Louvre,
Paris
location
since 1797
date
.

We can see here that most examples have been categorized as we might expect. We’d have to improve our model for it to do a better job on the "May to June" example.

Scaling Entity Detection

We can also run our named entity detection algorithms on platforms like Spark NLP which adds NLP functionality to Apache Spark. We’ll use glove_100d embeddings and the onto_100 NER model.

var assembler = new DocumentAssembler(inputCol: 'text', outputCol: 'document', cleanupMode: 'disabled')

var tokenizer = new Tokenizer(inputCols: ['document'] as String[], outputCol: 'token')

var embeddings = WordEmbeddingsModel.pretrained('glove_100d').tap {
    inputCols = ['document', 'token'] as String[]
    outputCol = 'embeddings'
}

var model = NerDLModel.pretrained('onto_100', 'en').tap {
    inputCols = ['document', 'token', 'embeddings'] as String[]
    outputCol ='ner'
}

var converter = new NerConverter(inputCols: ['document', 'token', 'ner'] as String[], outputCol: 'ner_chunk')

var pipeline = new Pipeline(stages: [assembler, tokenizer, embeddings, model, converter] as PipelineStage[])

var spark = SparkNLP.start(false, false, '16G', '', '', '')

var text = [
    "The Mona Lisa is a 16th century oil painting created by Leonardo. It's held at the Louvre in Paris."
]
var data = spark.createDataset(text, Encoders.STRING()).toDF('text')

var pipelineModel = pipeline.fit(data)

var transformed = pipelineModel.transform(data)
transformed.show()

use(SparkCategory) {
    transformed.collectAsList().each { row ->
        def res =  row.text
        def chunks = row.ner_chunk.reverseIterator()
        while (chunks.hasNext()) {
            def chunk = chunks.next()
            int begin = chunk.begin
            int end = chunk.end
            def entity = chunk.metadata.get('entity').get()
            res = res[0..<begin] + "$entity($chunk.result)" + res[end<..-1]
        }
        println res
    }
}

We won’t go into all of the details here. In summary, the code sets up a pipeline that transforms our input sentences, via a series of steps, into chunks, where each chunk corresponds to a detected entity. Each chunk has a start and ending position, and an associated tag type.

This may not seem like it is much different to our earlier examples, but if we had large volumes of data, and we were running in a large cluster, the work could be spread across worker nodes within the cluster.

Here we have used a utility SparkCategory class which makes accessing the information in Spark Row instances a little nicer in terms of Groovy shorthand syntax. We can use row.text instead of row.get(row.fieldIndex('text')). Here is the code for this utility class:

class SparkCategory {
    static get(Row r, String field) { r.get(r.fieldIndex(field)) }
}

If doing more than this simple example, the use of SparkCategory could be made implicit through various standard Groovy techniques.

When we run our script, we see the following output:

22/08/07 12:31:39 INFO SparkContext: Running Spark version 3.3.0
...
glove_100d download started this may take some time.
Approximate size to download 145.3 MB
...
onto_100 download started this may take some time.
Approximate size to download 13.5 MB
...
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|               token|          embeddings|                 ner|           ner_chunk|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|The Mona Lisa is ...|[{document, 0, 98...|[{token, 0, 2, Th...|[{word_embeddings...|[{named_entity, 0...|[{chunk, 0, 12, T...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
PERSON(The Mona Lisa) is a DATE(16th century) oil painting created by PERSON(Leonardo). It's held at the FAC(Louvre) in GPE(Paris).

The result has the following visualization:

The Mona Lisa
PERSON
is a
16th century
DATE
oil painting created by
Leonardo
PERSON
. It's held at the
Louvre
FAC
in
Paris
GPE
.

Here FAC is facility (buildings, airports, highways, bridges, etc.) and GPE is Geo-Political Entity (countries, cities, states, etc.).

Sentence Detection

Detecting sentences in text might seem a simple concept at first but there are numerous special cases.

Consider the following text:

def text = '''
The most referenced scientific paper of all time is "Protein measurement with the
Folin phenol reagent" by Lowry, O. H., Rosebrough, N. J., Farr, A. L. & Randall,
R. J. and was published in the J. BioChem. in 1951. It describes a method for
measuring the amount of protein (even as small as 0.2 γ, were γ is the specific
weight) in solutions and has been cited over 300,000 times and can be found here:
https://www.jbc.org/content/193/1/265.full.pdf. Dr. Lowry completed
two doctoral degrees under an M.D.-Ph.D. program from the University of Chicago
before moving to Harvard under A. Baird Hastings. He was also the H.O.D of
Pharmacology at Washington University in St. Louis for 29 years.
'''

There are full stops at the end of each sentence (though in general, it could also be other punctuation like exclamation marks and question marks). There are also full stops and decimal points in abbreviations, URLs, decimal numbers and so forth. Sentence detection algorithms might have some special hard-coded cases, like "Dr.", "Ms.", or in an emoticon, and may also use some heuristics. In general, they might also be trained with examples like above.

Here is some code for OpenNLP for detecting sentences in the above:

def helper = new ResourceHelper('http://opennlp.sourceforge.net/models-1.5')
def model = new SentenceModel(helper.load('en-sent'))
def detector = new SentenceDetectorME(model)
def sentences = detector.sentDetect(text)
assert text.count('.') == 28
assert sentences.size() == 4
println "Found ${sentences.size()} sentences:\n" + sentences.join('\n\n')

It has the following output:

Downloading en-sent
Found 4 sentences:
The most referenced scientific paper of all time is "Protein measurement with the
Folin phenol reagent" by Lowry, O. H., Rosebrough, N. J., Farr, A. L. & Randall,
R. J. and was published in the J. BioChem. in 1951.

It describes a method for
measuring the amount of protein (even as small as 0.2 γ, were γ is the specific
weight) in solutions and has been cited over 300,000 times and can be found here:
https://www.jbc.org/content/193/1/265.full.pdf.

Dr. Lowry completed
two doctoral degrees under an M.D.-Ph.D. program from the University of Chicago
before moving to Harvard under A. Baird Hastings.

He was also the H.O.D of
Pharmacology at Washington University in St. Louis for 29 years.

We can see here, it handled all of the tricky cases in the example.

Relationship Extraction with Triples

The next step after detecting named entities and the various parts of speech of certain words is to explore relationships between them. This is often done in the form of subject-predicate-object triplets. In our earlier NER example, for the sentence "The conference wrapped up yesterday at 5:30 p.m. in Copenhagen, Denmark.", we found various date, time and location named entities.

We can extract triples using the MinIE library (which in turns uses the Standford CoreNLP library) with the following code:

def parser = CoreNLPUtils.StanfordDepNNParser()
sentences.each { sentence ->
    def minie = new MinIE(sentence, parser, MinIE.Mode.SAFE)

    println "\nInput sentence: $sentence"
    println '============================='
    println 'Extractions:'
    for (ap in minie.propositions) {
        println "\tTriple: $ap.tripleAsString"
        def attr = ap.attribution.attributionPhrase ? ap.attribution.toStringCompact() : 'NONE'
        println "\tFactuality: $ap.factualityAsString\tAttribution: $attr"
        println '\t----------'
    }
}

The output for the previously mentioned sentence is shown below:

Input sentence: The conference wrapped up yesterday at 5:30 p.m. in Copenhagen, Denmark.
=============================
Extractions:
        Triple: "conference"    "wrapped up yesterday at"       "5:30 p.m."
        Factuality: (+,CT)      Attribution: NONE
        ----------
        Triple: "conference"    "wrapped up yesterday in"       "Copenhagen"
        Factuality: (+,CT)      Attribution: NONE
        ----------
        Triple: "conference"    "wrapped up"    "yesterday"
        Factuality: (+,CT)      Attribution: NONE

We can now piece together the relationships between the earlier entities we detected.

There was also a problematic case amongst the earlier NER examples, "The parcel was passed from May to June.". Using the previous model, detected "May to June" as a date. Let’s explore that using CoreNLP’s triple extraction directly. We won’t show the source code here but CoreNLP supports simple and more powerful approaches to solving this problem. The output for the sentence in question using the more powerful technique is:

Sentence #7: The parcel was passed from May to June.
root(ROOT-0, passed-4)
det(parcel-2, The-1)
nsubj:pass(passed-4, parcel-2)
aux:pass(passed-4, was-3)
case(May-6, from-5)
obl:from(passed-4, May-6)
case(June-8, to-7)
obl:to(passed-4, June-8)
punct(passed-4, .-9)

Triples:
1.0 parcel was passed
1.0 parcel was passed to June
1.0 parcel was passed from May to June
1.0 parcel was passed from May

We can see that this has done a better job of piecing together what entities we have and their relationships.

Sentiment Analysis

Sentiment analysis is a NLP technique used to determine whether data is positive, negative, or neutral. Standford CoreNLP has default models it uses for this purpose:

def doc = new Document('''
StanfordNLP is fantastic!
Groovy is great fun!
Math can be hard!
''')
for (sent in doc.sentences()) {
    println "${sent.toString().padRight(40)} ${sent.sentiment()}"
}

Which has the following output:

[main] INFO edu.stanford.nlp.parser.common.ParserGrammar - Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ... done [0.6 sec].
[main] INFO edu.stanford.nlp.sentiment.SentimentModel - Loading sentiment model edu/stanford/nlp/models/sentiment/sentiment.ser.gz ... done [0.1 sec].
StanfordNLP is fantastic!                POSITIVE
Groovy is great fun!                     VERY_POSITIVE
Math can be hard!                        NEUTRAL

We can also train our own. Let’s start with two datasets:

def datasets = [
    positive: getClass().classLoader.getResource("rt-polarity.pos").toURI(),
    negative: getClass().classLoader.getResource("rt-polarity.neg").toURI()
]

We’ll first use Datumbox which, as we saw earlier, requires training parameters for our algorithm:

def trainingParams = new TextClassifier.TrainingParameters(
    numericalScalerTrainingParameters: null,
    featureSelectorTrainingParametersList: [new ChisquareSelect.TrainingParameters()],
    textExtractorParameters: new NgramsExtractor.Parameters(),
    modelerTrainingParameters: new MultinomialNaiveBayes.TrainingParameters()
)

We now create our algorithm, train it with or training dataset, and for illustrative purposes validate against the training dataset:

def config = Configuration.configuration
TextClassifier classifier = MLBuilder.create(trainingParams, config)
classifier.fit(datasets)
def metrics = classifier.validate(datasets)
println "Classifier Accuracy (using training data): $metrics.accuracy"

The output is shown here:

[main] INFO com.datumbox.framework.core.common.dataobjects.Dataframe$Builder - Dataset Parsing positive class
[main] INFO com.datumbox.framework.core.common.dataobjects.Dataframe$Builder - Dataset Parsing negative class
...
Classifier Accuracy (using training data): 0.8275959103273615

Now we can test our model against several sentences:

['Datumbox is divine!', 'Groovy is great fun!', 'Math can be hard!'].each {
    def r = classifier.predict(it)
    def predicted = r.YPredicted
    def probability = sprintf '%4.2f', r.YPredictedProbabilities.get(predicted)
    println "Classifing: '$it',  Predicted: $predicted,  Probability: $probability"
}

Which has this output:

...
[main] INFO com.datumbox.framework.applications.nlp.TextClassifier - predict()
...
Classifing: 'Datumbox is divine!', Predicted: positive, Probability: 0.83
Classifing: 'Groovy is great fun!', Predicted: positive, Probability: 0.80
Classifing: 'Math can be hard!', Predicted: negative, Probability: 0.95

We can do the same thing but with OpenNLP. First, we collect our input data. OpenNLP is expecting it in a single dataset with tagged examples:

def trainingCollection = datasets.collect { k, v ->
    new File(v).readLines().collect{"$k $it".toString() }
}.sum()

Now, we’ll train two models. One uses naïve bayes, the other maxent. We train up both variants.

def variants = [
        Maxent    : new TrainingParameters(),
        NaiveBayes: new TrainingParameters((CUTOFF_PARAM): '0', (ALGORITHM_PARAM): NAIVE_BAYES_VALUE)
]
def models = [:]
variants.each{ key, trainingParams ->
    def trainingStream = new CollectionObjectStream(trainingCollection)
    def sampleStream = new DocumentSampleStream(trainingStream)
    println "\nTraining using $key"
    models[key] = DocumentCategorizerME.train('en', sampleStream, trainingParams, new DoccatFactory())
}

Now we run sentiment predictions on our sample sentences using both variants:

def w = sentences*.size().max()

variants.each { key, params ->
    def categorizer = new DocumentCategorizerME(models[key])
    println "\nAnalyzing using $key"
    sentences.each {
        def result = categorizer.categorize(it.split('[ !]'))
        def category = categorizer.getBestCategory(result)
        def prob = sprintf '%4.2f', result[categorizer.getIndex(category)]
        println "${it.padRight(w)} $category ($prob)"
    }
}

When we run this we get:

Training using Maxent …done.
…

Training using NaiveBayes …done.
…

Analyzing using Maxent
OpenNLP is fantastic! positive (0.64)
Groovy is great fun! positive (0.74)
Math can be hard! negative (0.61)

Analyzing using NaiveBayes
OpenNLP is fantastic! positive (0.72)
Groovy is great fun! positive (0.81)
Math can be hard! negative (0.72)

The models here appear to have lower probability levels compared to the model we trained for Datumbox. We could try tweaking the training parameters further if this was a problem. We’d probably also need a bigger testing set to convince ourselves of the relative merits of each model. Some models can be over-trained on small datasets and perform very well with data similar to their training datasets but perform much worse for other data.

This example is inspired from the UniversalSentenceEncoder example in the DJL examples module. It looks at using the universal sentence encoder model from TensorFlow Hub via the DeepJavaLibrary (DJL) api.

First we define a translator. The Translator interface allow us to specify pre- and post-processing functionality.

class MyTranslator implements NoBatchifyTranslator<String[], double[][]> {
    @Override
    NDList processInput(TranslatorContext ctx, String[] raw) {
        var factory = ctx.NDManager
        var inputs = new NDList(raw.collect(factory::create))
        new NDList(NDArrays.stack(inputs))
    }

    @Override
    double[][] processOutput(TranslatorContext ctx, NDList list) {
        long numOutputs = list.singletonOrThrow().shape.get(0)
        NDList result = []
        for (i in 0..<numOutputs) {
            result << list.singletonOrThrow().get(i)
        }
        result*.toFloatArray() as double[][]
    }
}

Here, we manually pack our input sentences into the required n-dimensional data types, and extract our output calculations into a 2D double array.

Next, we create our predict method by first defining the criteria for our prediction algorithm. We are going to use our translator, use the TensorFlow engine, use a predefined sentence encoder model from the TensorFlow Hub, and indicate that we are creating a text embedding application:

def predict(String[] inputs) {
    String modelUrl = "https://storage.googleapis.com/tfhub-modules/google/universal-sentence-encoder/4.tar.gz"

    Criteria<String[], double[][]> criteria =
        Criteria.builder()
            .optApplication(Application.NLP.TEXT_EMBEDDING)
            .setTypes(String[], double[][])
            .optModelUrls(modelUrl)
            .optTranslator(new MyTranslator())
            .optEngine("TensorFlow")
            .optProgress(new ProgressBar())
            .build()
    try (var model = criteria.loadModel()
         var predictor = model.newPredictor()) {
        predictor.predict(inputs)
    }
}

Next, let’s define our input strings:

String[] inputs = [
    "Cycling is low impact and great for cardio",
    "Swimming is low impact and good for fitness",
    "Palates is good for fitness and flexibility",
    "Weights are good for strength and fitness",
    "Orchids can be tricky to grow",
    "Sunflowers are fun to grow",
    "Radishes are easy to grow",
    "The taste of radishes grows on you after a while",
]
var k = inputs.size()

Now, we’ll use our predictor method to calculate the embeddings for each sentence. We’ll print out the embeddings and also calculate the dot product of the embeddings. The dot product (the same as the inner product for this case) reveals how related the sentences are.

var embeddings = predict(inputs)

var z = new double[k][k]
for (i in 0..<k) {
    println "Embedding for: ${inputs[i]}\n${embeddings[i]}"
    for (j in 0..<k) {
        z[i][j] = dot(embeddings[i], embeddings[j])
    }
}

Finally, we’ll use the Heatmap class from Smile to present a nice display highlighting what the data reveals:

new Heatmap(inputs, inputs, z, Palette.heat(20).reverse()).canvas().with {
    title = 'Semantic textual similarity'
    setAxisLabels('', '')
    window()
}

The output shows us the embeddings:

Loading:     100% |========================================|
2022-08-07 17:10:43.212697: ... This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2
...
2022-08-07 17:10:52.589396: ... SavedModel load for tags { serve }; Status: success: OK...
...
Embedding for: Cycling is low impact and great for cardio
[-0.02865048497915268, 0.02069241739809513, 0.010843578726053238, -0.04450441896915436, ...]
...
Embedding for: The taste of radishes grows on you after a while
[0.015841705724596977, -0.03129228577017784, 0.01183396577835083, 0.022753292694687843, ...]

The embeddings are an indication of similarity. Two sentences with similar meaning typically have similar embeddings.

The displayed graphic is shown below:

Heatmap plot of sentence encodings

This graphic shows that our first four sentences are somewhat related, as are the last four sentences, but that there is minimal relationship between those two groups.

Conclusion

We have look at a range of NLP examples using various NLP libraries. Hopefully you can see some cases where you could use additional NLP technologies in some of your own applications.