LEXICAL EXTRACTION

In order to create a corpus map, we need to first select a set of lexical sequences to model the text with, as well as a label to represent each of the lexical sequences in the map.

The label can be the sequence itself as found in the corpus, or a normalized version that is easier to read or that abstracts away from variatbility in the way to refer to the "concept" expressed by a set of sequences. For instance, we may label the singular and plural form of a sequence with the singular, or the past tense of a verb with the infinitive, and so on.

We used two methods for lexical extraction. The first method, explained below, is based on Entity Linking to DBpedia. The second method is keyphrase extraction with the Yatea tool, and is described on tab 2 below, on the right.

1. DBpedia ENTITY MENTIONS


2. KEYPHRASE EXTRACTION


DBpedia

The terms we chose to map the corpus with are based on lexical sequences found in the corpus that express DBpedia concepts (read on for links and details).

DBpedia is a knowledge-base that expresses Wikipedia content (including text, category hierarchy, link structure and other structured information) in semantic web format, which makes it easy for a program to use that information automatically.

DBpedia is navigable in this browser. More information about it is available in this description.

DBpedia Spotlight

The tool used to find sequences in the text that express DBpedia concepts was DBpedia Spotlight [doc] [demo]. The tool compares the content of sequences of words in a text against DBpedia articles. The comparison is based on several elements: If the comparison provides a good match, a DBpedia term is assigned to its mention in the text (i.e. the lexical sequence expressing that term). This allows for tagging the same DBpedia term in the text in spite of varying ways to refer to it. For instance, a term like Monarch can be tagged in the text whether the words monarch or king are found in the text.

The technology used by DBpedia Spotlight is called Entity Linking / Wikification

Term Selection

Automatic term selection
Spotlight finds many more terms than were used in this work's maps. We restricted the terms using the following criteria:

Manual refinement by a domain-expert
A domain expert corrected a small number of obvious errors made by DBpedia Spotlight.

Otherwise the list below was not further refined by a domain-expert, in order to show what the automatic method can provide by itself.

It can be expected that deeper involvement of an expert in order to select more terms specifically interesting for Bentham's work could result in more informative maps.

Term Labels

In the maps, instead of using the DBpedia label for a term, we used the most frequent variant with which the term was expressed in the corpus. (See the Term Table below).

In cases where a mention has been disambiguated as more than one DBpedia term, the most frequent term was retained.

Validity of the Method

Is it valid to use a current knowledge source like Wikipedia / DBpedia in order to analyze a 18th / 19th century corpus?

The validity of the approach is suggested by the fact that a domain-expert found that the terms in each of the map's clusters correspond to issues Bentham discussed as regards the general theme of each cluster.

However, the domain-expert also found that maps based on keyphrase extraction are more informative for a Bentham specialist than the DBpedia-based maps, as they contain very specific terms in Bentham's thought absent from the maps based on DBpedia mentions. The DBpedia-based maps were perceived by the expert as helpful for users unfamiliar with Bentham, for them to get a first overview of his writings.

Term Table

Column Label in the table is the label used in the maps to represent occurrences of the textual mentions given in column Variants

Matching of variants against the corpus was configured as case-insensitive.

Note that the tool that creates the corpus maps (CorText Manager) [doc] based on the output of the lexical extraction process just described was set to create two sets of maps, selecting two subsets 270 terms that came out of the lexical extraction process (using the settings described above).

Label Variants
abuse abuse
action action
acts acts
addition addition
aggregate aggregate
appeal appeals, appeal
application application
applied applied
aptitude aptitude
argument arguments, argument
art art
article article
attention attention
authority authorities, authority
belief belief
benefit benefit
bentham jeremy bentham, bentham
bill bill
body body
bribery bribery, bribe
capital capital
case case, cases
class class
code code
codification codification
common law common law
community community
conception conception
consideration consideration
constitution constitutional, constitution
contract contract
corinthians cor
corruption corruption
cortes cortes
country country, countries
court court, courts
court of session court of session
crime criminal, crime
crown crown
damage damage
death death
decision decision
defence defence
defendant defendant
degree degree
demand demand
democracy democracy
design design
despotism despotism
dignity dignity
discourse discourse
doctrine doctrine
dominion dominion
duty duty
economy economy
election election
elector elector
employed employed
england england
english english
english law english law
entities entities, entity
equity equity
evidence proof, evidence
evil evils, bad, evil
execution execution
exercised exercise, exercised
existence existence, existing
expected expected, expectation
experience experience
fact fact
faculty faculty
faith faith
fallacies fallacies
fear fear
fide fide
field field
force force
foreign foreign
fraud fraud
free free
function function
god god
good good
goods goods
government government
hands hands, hand
happiness happiness
hope hope
house of commons house of commons
house of lords house of lords
human human
idea idea
income income
individual individuals, individual
injury injury
injustice injustice
object object
instrument instrument
intellectual intellectual
interest interest
jesus jesus
john john
judge judge
judgment judgment
judicature judicatory, judicial, judicature
jurisdiction jurisdiction
jurisprudential jurisprudential
jury juries, jury
justice justice
king monarch, king, monarchy
knowledge knowledge
labour labour
language language
law law, legal, laws
lawyers lawyers, lawyer
learned learned
legislation legislation
legislator legislator
length length
letter letter
liable liable
liberty liberty
life life
logic false, logic
lordship lordship, lord
love love
luke luke
majority majority
man men, man
mark mark
mass mass
matter matter
measure measure
member member
mind mind, minds
minister minister
miracle miracles, miracle
money pecuniary, money
moral moral
motion motion
nation nation, nations
number number
object object
obligation obligation
observation observed, observations, observation
office offices, office
official official
opinion opinions, opinion
opposition opposition
ordinary ordinary
pain pain
paper paper
parliament parliament
parliamentary parliamentary
parties parties
party party
patronage patronage
paul paul
people people
performed performed
person person, persons
personal personal
persuasion persuasion
peter peter
plaintiff plaintiff
plan plan
pleasure pleasure
point point
political political
population population
possession possession
power power
powers powers
practical practical
practice practice
prejudice prejudice
price prices, price
principle principle
private private
probability probable, probability
procedure procedure
productive productive
profit profit
property property, properties
proportion proportion
proposition proposition, propositions
public public
punishment punishment
purpose purpose
quality quality
quantity amount, quantity
question question
rate rate
real real
reason reason
reference reference
reform reform
reform bill reform bill
regulations regulations
relation relation
religion religion
remedy remedy
representatives legislative, representatives
reputation reputation
rights rights
rule rule
rulers rulers
sacrifice sacrifice
science science
seat seat
security security
service service
share share
silent silent
sin sin
sinister sinister
sir sir
social social
society society
source source
space space
spain spain
spanish spanish
spanish america spanish america
species species
stock stock
subject subject
subordinate subordinate
suffering suffering, suffer
suffrage suffrage
lawsuit litigation, suit
supreme supreme
tax tax, taxes
testimony testimony
theory theory
thought thought
title title
trade trade
tripoli tripoli
trust trust
truth truth
understanding understood, understand, understanding
universal universal
universal suffrage universal suffrage
utility utility
view view
virtue virtue
vote vote, voting
war war
wealth wealth
whigs whigs
wisdom wisdom
witness witnesses, witness
word word
worth worth
writing written, writing
year year, years