Showcase Project: Working with Lex Rules


This project showcases some of the advanced features you can benefit from when you use lex rules (lexicalization rules) in Studio.

Enter your Studio email below to access this sample

 

Use the above button to download the showcase project for working with lex rules. Documentation can be found below.


Introduction

This project showcases some of the advanced features you can benefit from when you use lex rules (lexicalization rules).

Lex rules enable people with some knowledge of grammar to design narratives above the sentence level. This means that with lex rules, so you will see benefits such as the following:

  • similar sentences being joined together (known as sentence aggregation and subject elision)
  • names of people, places, and concepts being handled intelligently according to their context (known as named entity referencing)
  • technical terms being defined once only

Lex rules can reference each other by ID, or group — and this, together with conditional selection, OptionSets, and CaseSets, allows you to build complex language variation into your narratives.

Lex rules also have many within-sentence grammatical features that you can use to enhance your narratives. These include: subject-verb agreement; handling of singulars and plurals; inflection of noun, verb and adjectival phrases; and handling of numerical phrases.

Lex rules are defined in special grammar-like tags (e.g. <NP></NP>) in XML files (click Lex Rules in the left navigation ribbon to view the files). The rules accept JSON objects and other variables as their input data.

Note: You can only use lex rules in JSON projects.

The realise function sends data to lex rules and generates text from them, for example:
realise((WholeJSON.obj1,"lexRuleClass1"), WholeJSON.obj2,"lexRuleClass2")) sends data WholeJSON.obj1 to the rule denoted by "lexRuleClass1" and WholeJSON.obj2 to rule lexRuleClass2.

This documentation assumes that you have a good, working knowledge of ATL and Studio.

Scripts in this project

The Main script calls the user-defined function, describeCamera, that generates a subsection for each camera. It also calls the summary script that generates a single summary section to compare the cameras.

Key lex-rule features and where to find them in this project

This section is organized as follows:

Varying references to concepts using named entity referencing
Language variation using a CaseSet
Once-only explanations of acronyms and technical terms using OptionSets
Complex conditional variation
Variation in sentence structures
Subject elision
Syntactic aggregation
Date formatting
Country adjectives
Subject-verb agreement
Modifier-count noun agreement

Varying References to Concepts Using Named Entity Referencing
Named entity references allow you to vary references to an entity (person, place, or concept) throughout your narrative. If it is the first mention of the entity, then normally the full name is used. In subsequent mentions a shorter name or a pronoun (e.g. he, she, him, her, they, it) is used.

Note: Use of a shorter name or pronoun also depends on the distance from the last mention of the entity and whether a pronoun such as ‘it’ would be ambiguous in the context.

See the lex rule with classes="resolutionClass" in the file resolutionLexRules.xml. The noun phrase (or ‘NP’) component in the subject has reference="{{msg.name}}"; this means it is a named entity reference using the value of the key "name" in the input JSON object. The "name" key-value pair should use the special configuration shown (see our data in Data view) with keys fullName and shortName — whose values are strings to be used in referencing — and "__class" – whose value must be a unique identifier. The name generated here uses the possessive form (specified in the feature possessive="true"), so, for the first camera object in our data, the result would be “the Nikon D850’s”, “the D850’s” or “its”, depending on the context.

Other named entity references to the camera name can be viewed in the following:

  • File resolutionLexRules.xml, lex rule with id="resolution"
  • File priceLexRules.xml, lex rules with id="priceSentence", id="threeThousand", and id="underThousand"
  • File autoFocusLexRules.xml, lex rules with id="autofocus1", id="autofocus2", id="autofocus3", id="autofocus4", and id="autofocus5".
  • File continuousModelLexRules.xml, lex rule with id="s1".

If you preview the output from the Main script, you will see the result of named entity referencing in action where long names, short names, and pronouns have been generated in appropriate places and in the appropriate forms. This is especially evident where the ordering of sentences in the description of each camera changes according to three ATL conditionals (see our describeCamera user-defined function). When the price sentence comes first, it uses the full name; when it comes last, it uses the short name or a pronoun. The next two sentences (about resolution and auto-focus) can also vary in their ordering. In the Summary section, many cameras are mentioned giving rise to possible ambiguities. Below is a fragment of the generated output with the named entity references highlighted:

The Pentax K-1 Mark II
Announced: February 2018.
Price: £1,799
Megapixels: 36.4MP
The Pentax K-1 Mark II’s full-frame 36.4MP sensor gives you a lot of megapixels for the money. It features a fast focusing (even at the corners) 33-point AF, 25 cross-type system.
Burst shooting at 6 fps maximum is respectable. This enthusiast/professional-level Japanese camera has a 3 inch three-way-tilt screen with 1,037,000 dots per inch and captures 1K video.
At £1,799.00, it is at a price within the reach of enthusiasts.

The Nikon D3400

Announced: August 2017.
Price: £338
Megapixels: 24.2MP
At £338.00, the Nikon D3400 is cheaper than many models.
The 11-point AF, 1 cross-type system means it’s harder to focus on off-centre subjects. Its APS-C 24.2MP sensor doesn’t have the magnificent resolution of other models.
In continuous shooting mode, the D3400 shoots a modest 5 fps maximum. This beginner-level Japanese camera has a 3 inch screen with 921,000 dots per inch and captures 1K video.

Summary
The camera with the best resolution is the Nikon D850 with 45.4MP and the one with the worst is the Nikon D500 with 20.9MP.
The Nikon D500 has the largest number of autofocus points (153) while the Nikon D3400 has only 11 AF points, the smallest number.
In the high price range, the EOS 5D costs over £3,000. Among the lower-priced options, the D3400 and the Canon EOS Rebel T7i are valued at under £1,000.

Language Variation Using a CaseSet
A CaseSet is similar to a switch statement in Java. It chooses a case according to a value in the input data. We have used a CaseSet in file resolutionLexRules.xml, in the lex rule with id="resolution". The subject of the lex rule is a noun phrase component (NP) with a reference-by-ID to a CaseSet (rule="#frameFormats"). Scroll down in resolutionLexRules.xml until you see a rule with id="frameFormats". This rule is a CaseSet that chooses a case according to the value of the data key "sensor_format". The choices are “full-frame”, “APS-C”, or any other value (the default case). In the default case, an empty string is returned. The other two cases have reference-by-ID calls to OptionSets explained below.

Once-Only Explanations of Acronyms and Technical Terms Using OptionSets
When a report uses an acronym or a technical term, it is good practice to write it out in full or explain what it means, but only the first time it is used. An OptionSet allows you to do this. That is the purpose of two OptionSets in file resolutionLexRules.xml. One has id="fullframeOptions" and the other id="apscOptions". Both OptionSets give a choice between two strings. The first string is a full explanation, for example: “full-frame (equivalent in size to 36 mm × 24 mm film)”. The second string is a shorter version, for example: “full-frame”. An important feature ensures that the first string is always generated the first time this OptionSet is accessed in a report and that the second string is always used for subsequent accesses. The feature that does this is strategy="last".

This technique is used in the following lex rules files:
File resolutionLexRules.xml, OptionSets id="fullframeOptions", and id="apscOptions"
File autoFocusLexRules.xml, OptionSet id="autoFocusOptions"
File continuousModelLexRules.xml, OptionSet id="continuousModelOptions"

Complex Conditional Variation
You can build quite complex language variations by combining conditional lex rule selection with OptionSets used for random variation. We demonstrate this technique in the lex rule with id="resolution" in file resolutionLexRules.xml. This uses reference-by-group to a set of lex rules, then conditional rule selection, and finally references to OptionSets. The reference-by-group to a group of verb phrase (VP) rules is rule=".sensorChoiceGroups". This links to a group of three VP rules that specify groups="sensorChoiceGroups". Each VP rule also has a condition that constrains whether it is chosen. In our example, this depends on the value of the JSON object key "megapixels", for example condition="{{msg.megapixels >= 40}}". The conditions cover all possible cases — high, medium, and low — to ensure that one of the rules is chosen. Each VP rule contains a reference-by-ID to an OptionSet with three alternatives. This time, the OptionSet has strategy="random_loop" meaning that an option is chosen randomly, but then is not chosen again until all other options have been used.

Variation in Sentence Structures
The sentence about a camera’s auto-focus demonstrates how sentences with very different structures can be implemented. The sentence is generated with the call realise((camera,"autofocusClass")). In the Lex Rules view, open the lex rules file autoFocusLexRules.xml and you will find five lex rules with classes="autofocusClass". These lex rules have IDs autofocus1, autofocus2, autofocus3, autofocus4, and autofocus5. Four have different structures (autofocus1 and autofocus5 have the same structure). With five rules to choose from, the choice is constrained by conditions specified in each rule, for example: condition="{{msg.autofocus_points > 100}}". The resulting sentences are very different, for example:

The Canon EOS 5D Mark IV features a professional autofocus system, 61-point AF, 41 cross-type, for all your action shots., and A fast and effective 33-point AF, 25 cross-type system gives the K-1 Mark II the sharp focus you need.

Subject Elision
Open the lex rules file otherFeaturesLexRules.xml and you will find two lex rules with id="screenSize" and id="video". Both rules have the feature <Feature name="elidable" value="true"/> in the Subject (Subj) component. This means that if two or more adjacent sentences have identical subjects, the sentences can be joined and the subjects of sentences after the first can be missed out. Our example is trivial but you can see the idea: with two separate sentences you would get, for instance, “This expert-level Japanese camera has a 3.2 inch touchscreen with 1,620,000 dots per inch. This expert-level Japanese camera captures 4K video.” but with subject elision, you would see “This expert-level Japanese camera has a 3.2 inch touchscreen with 1,620,000 dots per inch and captures 4K video.”.

Syntactic Aggregation
Syntactic aggregation is similar to subject elision, but in syntactic aggregation, two or more sentences that are identical apart from their subjects are joined. For example, “John likes chocolate.” and “Mary likes chocolate.” are aggregated to form “John and Mary like chocolate.”

See the lex rules file priceLexRules.xml and you will find two lex rules at the end of the file with classes="threeThousandClass" and classes="underThousandClass". The feature that enables aggregation to take place is in the Subj component of the rule, <Feature name="surface_aggregatable" value="true"/>.

There is only one camera that costs over £3,000, so the first rule generates “In the high price range, the EOS 5D costs over £3,000.” However, there are two cameras costing under £1,000, so instead of generating “Among the lower-priced options, the D3400 is valued at under £1,000.” and “Amongst the lower-priced options, the Canon EOS Rebel T7i is valued at under £1,000.”, the sentences are aggregated to produce “Among the lower-priced options, the D3400 and the Canon EOS Rebel T7i are valued at under £1,000.” Notice how the verb is automatically inflected to agree with a singular or plural subject.

Date Formatting
A lex rule formats a date time stamp string as “July 2017”. The rule is referenced in a call to the realise function, realise((camera,"announceDateClass")). Look in the lex rules file dateLexRules.xml and you will see a sentence (S) rule with classes="announceDateClass". The Text component with reference="{{msg.date}}" and refType="dateTime" means that this is a special date-time reference using the value of the "date" key in the input JSON object, a time-stamp string, for example: "2017-07-25T11:30:00". The time stamp is formatted according to the format string “MMMM yyyy” (based on the Java DateTimeFormatter) to produce “July 2017”.

Country Adjectives
If the JSON data has country names, a lex rule can express them as adjectives; this means that if the data has “Australia” it can appear in the narrative as “Australian”. Open the lex rule file otherFeaturesLexRules.xml and you will find two lex rules with id="screenSize" and id="video". Both rules have an adjective phrase (AdjP) component with word="{{msg.country}}". Whenever the data item is a country name, the correct adjectival form is generated. As it turns out, all cameras in our dataset were made in Japan – you will see the correct country adjective, Japanese, when you preview the output.

Subject-Verb Agreement
All Sentence (S) lex rules demonstrate automatic subject-verb agreement. For example, in the Summary section, the sentence “Amongst the lower-priced options, the D3400 and the Canon EOS Rebel T7i are valued at under £1,000.”, the verb ‘value’ in the passive form is inflected to ‘are valued’ because there are two cameras in the subject. This sentence is produced by the S lex rule with classes="underThousandClass" in file priceLexRules.xml.

Modifier-Count Noun Agreement
The S lex rule with classes="screenSizeClass" in file otherFeaturesLexRules.xml has a noun phrase component containing a Number modifier in front of the count noun “dot”: <Number int="{{msg.screen_dots}}"/> <N word="dot"/>. When the input value for the current camera’s screen_dots key is input, the noun “dot” is inflected to agree with the number. For instance, if the number is 2359000 then it is inflected to produce “dots”.