LEXICAL EXTRACTION

In order to create a corpus map, we need to first select a set of lexical sequences to model the text with, as well as a label to represent each of the lexical sequences in the map.

The label can be the sequence itself as found in the corpus, or a normalized version that is easier to read or that abstracts away from variatbility in the way to refer to the "concept" expressed by a set of sequences. For instance, we may label the singular and plural form of a sequence with the singular, or the past tense of a verb with the infinitive, and so on.

We used two methods for lexical extraction. The first method, explained on tab 1 below, is based on Entity Linking to DBpedia. The second method is keyphrase extraction with the Yatea tool, and is described below here.

1. DBpedia ENTITY MENTIONS


2. KEYPHRASE EXTRACTION


Keyphrases are important terms in a corpus. To identify them, we used the Yatea tool. Keyphrase extraction is sometimes used to get an overview of a corpus.

Unlike for lexical extraction based on Entity Linking, where a set of lexical items was normalized to a label to represent the set, for keyphrase-based extraction no such normalization was performed. We could say that each keyphrase is used as "its own label".

Yatea

Yatea is a rule-based keyphrase extractor. It takes as its input part-of-speech tagged text in Treetagger output format.

We performed part-of-speech tagging (PoS-tagging) with Treetagger. Based on PoS tags, Yatea first chunks text in order to identify noun phrases, according to configurable PoS patterns. The tool then filters the resulting noun phrases, in order to eliminate candidates, which, although matching one of the expected patterns, contain uninformative sequences. For instance, terms containing the preposition + noun sequence of course would be filtered out.

We configured the tool to output both phrases with several words and single-word phrases.

Annotation Selection

Keyphrases with at least 10 occurrences in the corpus were initially kept, giving a list of ca. 2550 terms. This list was filtered further with regular expressions to eliminate ill-formed terms. An example of such terms are terms containing punctuation, given tokenization errors coming from irregular corpus formatting. After applying regular expressions, the list was finally filtered manually to eliminate remaining irrelevant terms. This yielded a final list of approx. 1950 terms. From these, the most frequent 250 terms were used to create corpus maps. The list of terms is shown below.

Term Table

The list of 250 keyphrases, selected as just described, follows here:

Keyphrases
absolute monarchyenglish lawyerslitigationpublic functionary
act of parliamentenglish practicelogicpublic interest
adam smithevidencelord chancellorpublic mind
aggregate massevillord presidentpublic opinion
american united statesexchequer billslords delegatespublic opinion tribunal
anglo-american united statesexternal instruments of felicitymajestypublic spirit
annuityfactitious causesmajority of the peoplepure monarchy
appropriate active talentfactitious delaymanquantity of money
appropriate aptitudefactitious honormarginal insertionquantity of time
appropriate moral aptitudefactitious rewardmass of moneyquestion of fact
arbitrary powerfailure of justicematter of corruptionquestion of law
author of the actsfemale sexmembersradical reform
bank paperfictitious entitiesmen of lawrate of interest
bentham esqfictitious entitymilitary forcereal evidence
billfide appealsmischiefreal law
body of menfide defendantmixt monarchyreal wealth
body of the lawfide suitormode of procedurereform
body of the peoplefield of lawmode of votingreign
breach of trustfield of legislationmonarchrelation
british constitutionforms of governmentmoneyreligion
business of governmentforthcomingnessmoral sanctionreligion of jesus
casefreedom of suffragenational wealthreview chamber
case admittsgeneral interestnatural causesrise of prices
causegeneral rulenatural procedurerule of action
circumstantial evidencegreat britainnatural systemscotch law
civil wargreater numbernature of mansecrecy of suffrage
common interestgreatest happinessnature of the casesecret mode
common sensegreatest numbernew south walessecurities
commons househands of the judgenon agendasecurity
communityholy ghostnumberself-regarding interest
constituted authoritieshouse of commonsnumber of individualsseparate interest
constitutionhuman beingsnumber of the membersside of the cause
constitutional lawhuman breastnumber of the personssingle hand
constitutive powerhuman happinessofficial establishmentsingle individual
corrupt dependencehuman mindopen modesingle person
corruptionindividualoperative powersingle word
corruptive influenceindividual caseoriginal draughtsinister end
countryindividual instancepaulsinister influence
courtindividual occasionpenitentiary housesocial affection
court of justiceinfluencepersonspecies of evidence
defendantinfluence of understandingplaintiffstandard of rectitude
degreeinfluence of willpleasurestate of dependence
delay vexationinner housepoint of factstatute law
difficultyinstruments of felicitypolitical communitystatutory law
direct evidenceinterest of the peoplepolitical powersubstantive branch of the law
distant dependenciesinterest of the rulingpolitical statesum of money
division of powerinterest of the subjectpopulationsupreme operative
doctrinejeremy benthampowersupreme operative power
efficient causejudgepowers of governmentsystem of pleading
efficient causesjudicial establishmentpresence of the judgesystem of procedure
electionjudicial injusticeprincipal facttheory
election districtjudicial procedureprinciple of utilitytothill fields
election districtsjurisprudential lawprivate interestultramarian provinces
electorjusticeprobative forceuniversality
encrease of wealthlanguageprofitunwritten law
end of governmentlawproper endvast majority
end of justicelegislative powerproper end of governmentvices
ends of judicaturelegislatorproportionwestminster hall
english constitutionlimited monarchypublic discussionwork
english jurisprudenceline of conductpublic functionarieswritten evidence