LEXICAL EXTRACTION
In order to create a corpus map, we need to first select a set of
lexical sequences
to model the text with, as well as a
label to represent each of the lexical sequences
in the map.
The label can be the sequence itself as found in the corpus, or a normalized
version that is easier to read or that abstracts away from variatbility in the way to refer
to the "concept" expressed by a set of sequences. For instance, we may label the singular
and plural form of a sequence with the singular, or the past tense of a verb with
the infinitive, and so on.
We used two methods for lexical extraction. The first method, explained on
tab 1 below, is based
on Entity Linking to DBpedia. The second method is keyphrase extraction with the Yatea tool,
and is described below here.
Keyphrases are important terms in a corpus. To identify them, we used the
Yatea tool.
Keyphrase extraction is sometimes used to get an overview of a corpus.
Unlike for lexical extraction based on
Entity Linking, where a set of lexical items was normalized to a label to
represent the set, for keyphrase-based extraction no such normalization
was performed. We could say that each keyphrase is used as "its own label".
Yatea
Yatea is a rule-based keyphrase extractor. It takes as its input part-of-speech
tagged text in
Treetagger output format.
We performed part-of-speech tagging (PoS-tagging) with Treetagger. Based on PoS tags,
Yatea first chunks text in order to identify noun phrases, according to
configurable PoS patterns. The tool then filters the resulting noun phrases,
in order to eliminate candidates, which, although matching one of the expected
patterns, contain uninformative sequences. For instance, terms containing the
preposition + noun sequence
of course would be filtered out.
We configured the tool to output both phrases with several words and single-word
phrases.
Annotation Selection
Keyphrases with at least 10 occurrences in the corpus were initially kept, giving a
list of ca. 2550 terms. This list was filtered further with regular expressions to
eliminate ill-formed terms. An example of such terms are terms containing
punctuation, given tokenization errors coming from irregular corpus formatting.
After applying regular expressions, the list was finally filtered manually to
eliminate remaining irrelevant terms. This yielded a final list of approx. 1950
terms. From these, the most frequent 250 terms were used to create corpus maps.
The list of terms is shown below.
Term Table
The list of 250 keyphrases, selected as just described, follows here:
Keyphrases |
absolute monarchy | english lawyers | litigation | public functionary |
act of parliament | english practice | logic | public interest |
adam smith | evidence | lord chancellor | public mind |
aggregate mass | evil | lord president | public opinion |
american united states | exchequer bills | lords delegates | public opinion tribunal |
anglo-american united states | external instruments of felicity | majesty | public spirit |
annuity | factitious causes | majority of the people | pure monarchy |
appropriate active talent | factitious delay | man | quantity of money |
appropriate aptitude | factitious honor | marginal insertion | quantity of time |
appropriate moral aptitude | factitious reward | mass of money | question of fact |
arbitrary power | failure of justice | matter of corruption | question of law |
author of the acts | female sex | members | radical reform |
bank paper | fictitious entities | men of law | rate of interest |
bentham esq | fictitious entity | military force | real evidence |
bill | fide appeals | mischief | real law |
body of men | fide defendant | mixt monarchy | real wealth |
body of the law | fide suitor | mode of procedure | reform |
body of the people | field of law | mode of voting | reign |
breach of trust | field of legislation | monarch | relation |
british constitution | forms of government | money | religion |
business of government | forthcomingness | moral sanction | religion of jesus |
case | freedom of suffrage | national wealth | review chamber |
case admitts | general interest | natural causes | rise of prices |
cause | general rule | natural procedure | rule of action |
circumstantial evidence | great britain | natural system | scotch law |
civil war | greater number | nature of man | secrecy of suffrage |
common interest | greatest happiness | nature of the case | secret mode |
common sense | greatest number | new south wales | securities |
commons house | hands of the judge | non agenda | security |
community | holy ghost | number | self-regarding interest |
constituted authorities | house of commons | number of individuals | separate interest |
constitution | human beings | number of the members | side of the cause |
constitutional law | human breast | number of the persons | single hand |
constitutive power | human happiness | official establishment | single individual |
corrupt dependence | human mind | open mode | single person |
corruption | individual | operative power | single word |
corruptive influence | individual case | original draught | sinister end |
country | individual instance | paul | sinister influence |
court | individual occasion | penitentiary house | social affection |
court of justice | influence | person | species of evidence |
defendant | influence of understanding | plaintiff | standard of rectitude |
degree | influence of will | pleasure | state of dependence |
delay vexation | inner house | point of fact | statute law |
difficulty | instruments of felicity | political community | statutory law |
direct evidence | interest of the people | political power | substantive branch of the law |
distant dependencies | interest of the ruling | political state | sum of money |
division of power | interest of the subject | population | supreme operative |
doctrine | jeremy bentham | power | supreme operative power |
efficient cause | judge | powers of government | system of pleading |
efficient causes | judicial establishment | presence of the judge | system of procedure |
election | judicial injustice | principal fact | theory |
election district | judicial procedure | principle of utility | tothill fields |
election districts | jurisprudential law | private interest | ultramarian provinces |
elector | justice | probative force | universality |
encrease of wealth | language | profit | unwritten law |
end of government | law | proper end | vast majority |
end of justice | legislative power | proper end of government | vices |
ends of judicature | legislator | proportion | westminster hall |
english constitution | limited monarchy | public discussion | work |
english jurisprudence | line of conduct | public functionaries | written evidence |