Created by Pablo Duboue / @pabloduboue
Natural Language Generation
The other NLP
Decisions, decisions, decisions
Continuous enrichment
My thesis (defended ten years ago!) was in Machine Learning for NLG. I worked in two full NLG systems:
Even though I gravitated towards ML and IR, half my papers are in NLG and I'm coming back to the field. (I recently run for a position in the board of SIGGEN.)
Adapted from https://code.google.com/p/simplenlg/wiki/AppendixA
content determination decides what information will appear in the output text. This depends on what your goal is, who the audience is, what sort of input information is available to you in the first place and other constraints such as allowed text length.
document structuring decides how chunks of content should be grouped in a document, how to relate these groups to each other and in what order they should appear. For instance, when describing last month’s weather, you might talk first about temperature, then rainfall. Or you might start off generally talking about the weather and then provide specific weather events that occurred during the month.
OpenSchema performs both tasks.
lexicalization decides what specific words should be used to express the content. For example, the actual nouns, verbs, adjectives and adverbs to appear in the text are chosen from a lexicon. Particular syntactic structures are chosen as well. For example you can say ‘the car owned by Mary’ or you might prefer the phrase ‘Mary’s car’.
referring expressions decides which expressions should be used to refer to entities (both concrete and abstract). The same entity can be referred to in many ways. For example March of last year can be referred to as:
aggregation decides how the structures created by document planning should be mapped onto linguistic structures such as sentences and paragraphs. For instance, two ideas can be expressed in two sentences or in one:
The month was cooler than average.The month was drier than average.
vs.
The month was cooler and drier than average.
linguistic realisation uses rules of grammar (about morphology and syntax) to convert abstract representations of sentences into actual text.
structure realization converts abstract structures such as paragraphs and sentences into mark-up symbols which are used to display the text.
SimpleNLG performs the last part, namely surface realisation.
PostGraphe, a system developed as part of
Dr. Fasciano's thesis at UdeM
Basic intentions covered in PostGraphe:
OpenSchema takes care of selecting what to say and structuring the selected information. This is achieved by going executing an augmented transition network (ATN), which for the purposes of this software package it is a grammar for a regular language (think regular expressions) over discourse predicates defined also as part of the schema itself.
RDF is a graph description notation used in the Semantic Web.
The output of the OpenSchemaPlanner is a DocumentPlan, which contains a list of paragraphs, each of which is a list of aggregation segments. Finally, an aggregation segment is a list of clauses, where each clause is a hierarchical attribute-value matrix, represented as a java Map from Strings to Object.
schema biography(self: c-person) ; name of the schema 'biography' ; self is the person the bio is about, required ; first paragraph, the person plus pred-person(person|self) optional pred-birth(person|self) star ; zero or more aliases pred-alias(person|self) star ; zero or more parents choice pred-father(self|self,parent|parent) pred-mother(self|self,parent|parent) star pred-person(person|parent) star ; zero or more education pred-education(person|self) paragraph-boundary
predicate pred-person variables req def person : c-person occupation : c-occupation properties ; properties that the variables have to hold occupation == person.occupation output ; use this for template generation template "{{name-first}} {{name-last}} is a {{occupation}}. " name-first person.name.first-name name-last person.name.last-name occupation occupation.#TYPE ; use this preds for SimpleNLG pred attributive pred0 person pred1 occupation
Tutorial, adapted from https://code.google.com/p/simplenlg/wiki/Tutorial
Lexicon lexicon = Lexicon.getDefaultLexicon();
NLGFactory nlgFactory = new NLGFactory(lexicon);
Realiser realiser = new Realiser(lexicon);
NLGElement s1 =
nlgFactory.createSentence("my dog is happy");
String output = realiser.realiseSentence(s1);
System.out.println(output);
My dog is happy.
SPhraseSpec p = nlgFactory.createClause();
p.setSubject("Mary");
p.setVerb("chase");
p.setObject("the monkey");
String output = realiser.realiseSentence(p);
System.out.println(output);
Mary chase the monkey.
NPPhraseSpec subject1 =
nlgFactory.createNounPhrase("Mary");
NPPhraseSpec subject2 =
nlgFactory.createNounPhrase("your", "giraffe");
CoordinatedPhraseElement subj =
nlgFactory.createdCoordinatedPhrase(subject1, subject2);
p.setSubject(subj);
Mary and your giraffe chase the monkey.
NPPhraseSpec object1 =
nlgFactory.createNounPhrase("the monkey");
NPPhraseSpec object2 =
nlgFactory.createNounPhrase("George");
CoordinatedPhraseElement obj =
nlgFactory.createdCoordinatedPhrase(object1, object2);
obj.addCoordinate("Martha");
p.setObject(obj);
obj.setFeature(Feature.CONJUNCTION, "or");
Mary and your giraffe chase the monkey, George or Martha.
p.setFeature(Feature.TENSE, Tense.FUTURE);
p.setFeature(Feature.NEGATED, true);
Mary will not chase the monkey.
Going from the OpenSchema predicates to SimpleNLG class:
https://github.com/DrDub/Alusivo
{ Dilma_Rousseff Antonio_Palocci Celso_Amorim Hu_Jintao }
http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://dbpedia.org/ontology/OfficeHolder
http://dbpedia.org/ontology/profession http://dbpedia.org/resource/Economist
Input: RDF Triples
Output: propiedades distinctivas
En Uso
Evaluating Robustness of Referring Expression Generation Algorithms
We took DBpedia 2011/1, generated RE and evaluated them on DBpedia 2014/5
Most algorithms behaved correctly
Sampling to obtain differences beyond 3 years
Dimensionality reduction
What is missing?
Capturing generalizations
Do you want to learn more?
I have here the material for my NLG class from 2011.
(A graduate level, semester long course.)
Keep in touch with Pablo at: