LEXICAL EXTRACTION
In order to create a corpus map, we need to first select a set of
lexical sequences
to model the text with, as well as a
label to represent each of the lexical sequences
in the map.
The label can be the sequence itself as found in the corpus, or a normalized
version that is easier to read or that abstracts away from variatbility in the way to refer
to the "concept" expressed by a set of sequences. For instance, we may label the singular
and plural form of a sequence with the singular, or the past tense of a verb with
the infinitive, and so on.
We used two methods for lexical extraction. The first method, explained below, is based
on Entity Linking to DBpedia. The second method is keyphrase extraction with the Yatea tool,
and is described on
tab 2 below,
on the right.
1. DBpedia ENTITY MENTIONS
|
|
DBpedia
The terms we chose to map the corpus with are based on lexical sequences found in the corpus
that express
DBpedia concepts (read on for links and details).
DBpedia is a knowledge-base that expresses Wikipedia content
(including text, category hierarchy, link structure and other structured information)
in semantic web format, which makes it easy for a program to use that information
automatically.
DBpedia is navigable in this
browser.
More information about it is available in this
description.
DBpedia Spotlight
The tool used to find sequences in the text that express DBpedia concepts was
DBpedia Spotlight
[doc]
[demo].
The tool compares the content of sequences of words in a text against DBpedia articles.
The comparison is based on several elements:
- The definition text in the DBpedia article
- Links in the DBpedia article's text
- DBpedia link structure (redirection links, homonymy etc.)
If the comparison provides a good match, a DBpedia
term is assigned to its
mention in the text (i.e. the lexical sequence expressing that term).
This allows for tagging the same DBpedia term in the text in spite of varying ways to refer
to it. For instance, a term like
Monarch can be tagged in the text whether the words
monarch or
king are found in the text.
The technology used by DBpedia Spotlight is called
Entity Linking
/
Wikification
Term Selection
Automatic term selection
Spotlight finds many more terms than were used in this work's maps.
We restricted the terms using the following criteria:
- Minimum extraction confidence is 0.1
(Spotlight assigns a confidence score to each extraction based on
factors like the number of competing candidate DBpedia terms for a
mention—e.g. a mention like
Clinton
is compatible with more terms than the more precise mention
Hillary Clinton)
-
For each variant referring to a DBpedia term (i.e. each mention),
its corpus frequency needed to be at least 100 for the variant to
be considered.
Manual refinement by a domain-expert
A domain expert corrected a small number of obvious errors made by DBpedia Spotlight.
Otherwise the list below was not further refined by a domain-expert, in order to show what
the automatic method can provide by itself.
It can be expected that deeper involvement of an expert in order to select more terms
specifically interesting for Bentham's work could result in more informative maps.
Term Labels
In the maps, instead of using the DBpedia label for a term,
we used the most frequent variant with which the term was expressed in the corpus.
(See the
Term Table below).
In cases where a mention has been disambiguated as more than one DBpedia term,
the most frequent term was retained.
Validity of the Method
Is it valid to use a current knowledge source like Wikipedia / DBpedia in order to
analyze a 18th / 19th century corpus?
The validity of the approach is suggested by the fact that a domain-expert found
that the terms in each of the map's clusters correspond to issues Bentham discussed
as regards the general theme of each cluster.
However, the domain-expert also found that maps based on
keyphrase extraction are more
informative for a Bentham specialist than the DBpedia-based maps, as they contain
very specific terms in Bentham's thought absent from the maps based on DBpedia
mentions. The DBpedia-based maps were perceived by the expert as helpful for
users unfamiliar with Bentham, for them to get a first overview of his writings.
Term Table
Column
Label in the table is the label used in the maps to represent occurrences
of the textual mentions given in column
Variants
Matching of variants against the corpus was configured as
case-insensitive.
Note that the tool that creates the corpus maps
(CorText Manager)
[doc]
based on the output of the lexical extraction process just described
was set to create two sets of maps, selecting two subsets 270 terms that came out of
the lexical extraction process (using the settings described
above).
- First, maps with 157 terms were created (see the links for "150 terms" maps)
- Second, more detailed maps with 261 terms were created (see the links for "250 terms" maps)
Label |
Variants |
abuse |
abuse |
action |
action |
acts |
acts |
addition |
addition |
aggregate |
aggregate |
appeal |
appeals, appeal |
application |
application |
applied |
applied |
aptitude |
aptitude |
argument |
arguments, argument |
art |
art |
article |
article |
attention |
attention |
authority |
authorities, authority |
belief |
belief |
benefit |
benefit |
bentham |
jeremy bentham, bentham |
bill |
bill |
body |
body |
bribery |
bribery, bribe |
capital |
capital |
case |
case, cases |
class |
class |
code |
code |
codification |
codification |
common law |
common law |
community |
community |
conception |
conception |
consideration |
consideration |
constitution |
constitutional, constitution |
contract |
contract |
corinthians |
cor |
corruption |
corruption |
cortes |
cortes |
country |
country, countries |
court |
court, courts |
court of session |
court of session |
crime |
criminal, crime |
crown |
crown |
damage |
damage |
death |
death |
decision |
decision |
defence |
defence |
defendant |
defendant |
degree |
degree |
demand |
demand |
democracy |
democracy |
design |
design |
despotism |
despotism |
dignity |
dignity |
discourse |
discourse |
doctrine |
doctrine |
dominion |
dominion |
duty |
duty |
economy |
economy |
election |
election |
elector |
elector |
employed |
employed |
england |
england |
english |
english |
english law |
english law |
entities |
entities, entity |
equity |
equity |
evidence |
proof, evidence |
evil |
evils, bad, evil |
execution |
execution |
exercised |
exercise, exercised |
existence |
existence, existing |
expected |
expected, expectation |
experience |
experience |
fact |
fact |
faculty |
faculty |
faith |
faith |
fallacies |
fallacies |
fear |
fear |
fide |
fide |
field |
field |
force |
force |
foreign |
foreign |
fraud |
fraud |
free |
free |
function |
function |
god |
god |
good |
good |
goods |
goods |
government |
government |
hands |
hands, hand |
happiness |
happiness |
hope |
hope |
house of commons |
house of commons |
house of lords |
house of lords |
human |
human |
idea |
idea |
income |
income |
individual |
individuals, individual |
injury |
injury |
injustice |
injustice |
object |
object |
instrument |
instrument |
intellectual |
intellectual |
interest |
interest |
jesus |
jesus |
john |
john |
judge |
judge |
judgment |
judgment |
judicature |
judicatory, judicial, judicature |
jurisdiction |
jurisdiction |
jurisprudential |
jurisprudential |
jury |
juries, jury |
justice |
justice |
king |
monarch, king, monarchy |
knowledge |
knowledge |
labour |
labour |
language |
language |
law |
law, legal, laws |
lawyers |
lawyers, lawyer |
learned |
learned |
legislation |
legislation |
legislator |
legislator |
length |
length |
letter |
letter |
liable |
liable |
liberty |
liberty |
life |
life |
logic |
false, logic |
lordship |
lordship, lord |
love |
love |
luke |
luke |
majority |
majority |
man |
men, man |
mark |
mark |
mass |
mass |
matter |
matter |
measure |
measure |
member |
member |
mind |
mind, minds |
minister |
minister |
miracle |
miracles, miracle |
money |
pecuniary, money |
moral |
moral |
motion |
motion |
nation |
nation, nations |
number |
number |
object |
object |
obligation |
obligation |
observation |
observed, observations, observation |
office |
offices, office |
official |
official |
opinion |
opinions, opinion |
opposition |
opposition |
ordinary |
ordinary |
pain |
pain |
paper |
paper |
parliament |
parliament |
parliamentary |
parliamentary |
parties |
parties |
party |
party |
patronage |
patronage |
paul |
paul |
people |
people |
performed |
performed |
person |
person, persons |
personal |
personal |
persuasion |
persuasion |
peter |
peter |
plaintiff |
plaintiff |
plan |
plan |
pleasure |
pleasure |
point |
point |
political |
political |
population |
population |
possession |
possession |
power |
power |
powers |
powers |
practical |
practical |
practice |
practice |
prejudice |
prejudice |
price |
prices, price |
principle |
principle |
private |
private |
probability |
probable, probability |
procedure |
procedure |
productive |
productive |
profit |
profit |
property |
property, properties |
proportion |
proportion |
proposition |
proposition, propositions |
public |
public |
punishment |
punishment |
purpose |
purpose |
quality |
quality |
quantity |
amount, quantity |
question |
question |
rate |
rate |
real |
real |
reason |
reason |
reference |
reference |
reform |
reform |
reform bill |
reform bill |
regulations |
regulations |
relation |
relation |
religion |
religion |
remedy |
remedy |
representatives |
legislative, representatives |
reputation |
reputation |
rights |
rights |
rule |
rule |
rulers |
rulers |
sacrifice |
sacrifice |
science |
science |
seat |
seat |
security |
security |
service |
service |
share |
share |
silent |
silent |
sin |
sin |
sinister |
sinister |
sir |
sir |
social |
social |
society |
society |
source |
source |
space |
space |
spain |
spain |
spanish |
spanish |
spanish america |
spanish america |
species |
species |
stock |
stock |
subject |
subject |
subordinate |
subordinate |
suffering |
suffering, suffer |
suffrage |
suffrage |
lawsuit |
litigation, suit |
supreme |
supreme |
tax |
tax, taxes |
testimony |
testimony |
theory |
theory |
thought |
thought |
title |
title |
trade |
trade |
tripoli |
tripoli |
trust |
trust |
truth |
truth |
understanding |
understood, understand, understanding |
universal |
universal |
universal suffrage |
universal suffrage |
utility |
utility |
view |
view |
virtue |
virtue |
vote |
vote, voting |
war |
war |
wealth |
wealth |
whigs |
whigs |
wisdom |
wisdom |
witness |
witnesses, witness |
word |
word |
worth |
worth |
writing |
written, writing |
year |
year, years |