Corpus linguistics has recently emerged as a method for addressing problems in legal interpretation. Corpus linguistics draws on evidence of language use from large, coded, electronic collections of natural language, that can be designed to sample the linguistic conventions of a wide variety of speech communities, industries, or linguistic contexts. And corpora (plural of corpus) have begun to see increasing use by judges, scholars, and advocates, including in the U.S. Supreme Court. This Teleforum will first provide an overview for those unfamiliar with corpus linguistics, and then address advantages and limitations of using language evidence from linguistic corpora in legal interpretation, such as when interpreting contracts, statutes, or constitutions, as well as highlight the use of corpus linguistics in recent cases.
Donald A. Daugherty, Jr., Senior Counsel, Wisconsin Institute for Law and Liberty
Stephen C. Mouritsen, Shareholder, Parr Brown Gee & Loveless
James C. Phillips, Assistant Professor of Law, Fowler School of Law, Chapman University
Teleforum calls are open to all dues paying members of the Federalist Society. To become a member, sign up on our website. As a member, you should receive email announcements of upcoming Teleforum calls which contain the conference call phone number. If you are not receiving those email announcements, please contact us at 202-822-8138.
Dean Reuter: Welcome to Teleforum, a podcast of The Federalist Society's practice groups. I’m Dean Reuter, Vice President, General Counsel, and Director of Practice Groups at The Federalist Society. For exclusive access to live recordings of practice group teleforum calls, become a Federalist Society member today at fedsoc.org.
Nick Marr: Welcome to The Federalist Society’s Teleforum conference call as this afternoon, August 6, will be a discussion on “Corpus Linguistics in Legal Interpretation.” My name is Nick Marr. I’m Assistant Director of Practice Groups at The Federalist Society.
As always, please note that all expressions of opinion are those of the experts on today’s call.
Today I have the privilege of introducing our moderator, Don Daugherty. He’s Senior Counsel at the Wisconsin Institute for Law and Liberty. And Don, before I hand it off to you, a quick note for our audience. We’re hoping to have some time left over at the end for Q&A, so be thinking of questions as we go along and have them ready to go for when we get to that portion of the call. And with that, Don, thanks for being with us. The floor is yours.
Donald A. Daugherty, Jr.: Great. Thanks, Nick, and good afternoon to everybody. Good morning, I guess, depending if you’re sitting out west. My name is Don Daugherty. I’m Senior Counsel at the Wisconsin Institute for Law and Liberty, which is a litigation and policy shop based out of Milwaukee, Wisconsin.
I appreciate everybody’s flexibility and apologize for the technical difficulties on Monday that we had. And we were able to reschedule. It’s unfortunate. I think someone needs to tell, perhaps, Senator Whitehouse that FedSoc needs even more dark money to maybe buy our own satellite or something like that. But at any event, we got through it, and we’re all here today. And we’re going to be hearing today about truly what I think is a cutting-edge legal tool: corpus linguistics.
As a neophyte, I understand corpus linguistics to essentially allow us to go back to the future in order to interpret legal text. Now, I’m certain our two speakers today will give you a much better, much more in depth understanding of the topic than I just gave you. Those two speakers are James Phillips and Stephen Mouritsen, who are leading experts in this area.
Before I introduce James and Stephen, let me first preview our format for you. To begin with, Stephen will give us a primer on corpus linguistics so that we can better understand how the tool works. James will then discuss the use of corpus linguistics in constitutional interpretation. And Stephen will then talk about its application to statutory contractual interpretation, although he may cover somewhat the statutory interpretation issues when he gives us his opening primer. And even hopefully afterwards, again, we’ll have some time for your questions.
So onto our speakers. James Phillips is an Assistant Professor at Chapman University’s law school in Southern California. He teaches courses in civil procedure and law and religion there. He has published more than two dozen academic articles in journals like the Penn and Southern Cal Law Reviews, as well as the Harvard Journal of Law and Public Policy. His shorter writing has appeared in outlets like the Atlantic, the LA Times, the National Review. Notably for our purposes today, James designed and he supervised the initial stages of the creation of the Corpus of Founding Era American English. And he is a pioneer in applying corpus linguistics to constitutional interpretation.
After law school, James served as a visiting assistant professor at BYU’s law school. Prior to that, he clerked for Justice Thomas Lee of the Utah Supreme Court. And Justice Lee actually is one of the judges who is truly a pioneer and a leader in the area of corpus linguistics. James also clerked for Judge Thomas Griffith of the D.C. Circuit Court of Appeals. He has been a fellow at the Becket Fund and at Stanford’s Constitutional Law Center.
He has worked in private practice as well, focusing on First Amendment issues and Supreme Court litigation. James has a PhD in jurisprudence and social policy from Cal Berkley. He earned his law degree also from Cal Berkley and graduated Order of the Coif and served on the Law Review. He has a masters in mass communication from BYU and a bachelor’s degree in history from Arizona State.
Stephen Mouritsen, Stephen currently serves as an adjunct professor at BYU’s law school where he has taught courses on the theory and practice of legal interpretation and law and corpus linguistics. In addition to his role at BYU, Stephen is also a shareholder at the Salt Lake City law firm of Parr Brown Gee & Loveless. His writing has appeared in the Yale Law Journal, the Washington Law Review, the Columbia Science and Technology Law Review, and on the Volokh Conspiracy law blog.
He has been cited by judges in the Third, Sixth, and Tenth Circuit Courts of Appeal, as well as the Idaho, Michigan, Montana, and Utah Supreme Courts. His work has also been cited in casebooks on legislation and contracts and in the Congressional Research Service’s report on Statutory Interpretation. Stephen received his JD magna cum laude from BYU and also received a masters in linguistics and a bachelor’s degree in English from BYU. It’s a pleasure today to be able to hear from both James and Stephen. Now I’ll turn the floor over to Stephen.
Stephen C. Mouritsen: Thank you, Don. Thank you for that introduction. So we’re talking today about corpus linguistics, which for a lot of listeners may be something that’s new. It’s the focus of my research and writing. I’m going to give sort of a basic outline of what corpus linguistics is about and then how it is comparatively recently begun to be applied to questions of legal interpretation.
A corpus is an electronic collection of natural language text that are designed to serve as a representative sample of the language uses of a particular speech community at a particular point in time. So the idea is to gather examples of the way the different speech communities use language, whether spoken or written. Now, I didn’t invent corpus linguistics. Corpus linguistics, in particular computational corpus linguistics, has been around now for 50 or 60 years.
Linguists use corpora, which is the plural of corpus, to make observations about language that cannot be made through introspection or with a dictionary. And they’ve been relying on linguistic corpora for decades in fields like lexicography, in language teaching, in automated speech recognition and machine learning and machine translation. And basically wherever there’s an interface between language and technology, somebody is sampling human linguistic behavior by creating a corpus.
Now, since I started writing my first papers on law and corpus linguistics, we’ve seen -- and certainly since Justice Thomas Lee has started putting corpus linguistic reasoning in some of his earlier opinions, we’ve seen sort of an explosion of academic interest. Judges and lawyers and legal scholars have been citing and engaging and certainly criticizing the use of linguistic corpora in legal interpretation. My work has been to examine ways in which you can apply corpora to questions of interpretation.
So I want to give just one example, and this is from statutory interpretation. And if you were to pick up a case book on legislation, one of the cases that they’re almost certainly going to deal with is the case of Muscarello v. United States. That’s a United States Supreme Court case from 1998. And the facts of the case were that Frank Muscarello had unlawfully sold marijuana that he had carried in his truck to the place of sale.
And the officers found a handgun when he was arrested that was locked in the truck’s glove compartment. And under Title 18 U.S.C. § 924(c)(1), it says that you get a five-year mandatory prison term increase for anyone who uses or carries a firearm during and in relation to a drug trafficking crime. So the question that goes up to the United States Supreme Court is what is the ordinary meaning of “carries a firearm,” or is a guy who has a firearm locked in his glove box carrying the firearm?
Now, the case came down to an interesting split. It was five justices for the majority, written by Justice Breyer, joined by Justice Thomas. The dissent was written by Justice Ginsburg, joined by Justice Scalia.
And the majority does something very, very familiar when you’re dealing with cases of the interpretation of the statute. The majority relies on dictionaries. So Justice Breyer has a section of this opinion that says, “Consider first the words’ primary meaning. The Oxford English Dictionary gives as its first definition ‘convey originally by cart or wagon, hence in any vehicle by ship or on horseback.’” I’ll add that Justice Breyer puts the word “first” in italics and then notes the first definition in the Webster’s Third New International Dictionary and the first definition in the Random House Dictionary.
So Justice Breyer engages in something that judges often do, which is an assumption that, number one, the ordinary meaning of this statutory term can be found in a dictionary and, number two, the more ordinary meaning is going to be the first one that is listed. Now, the problem with that reasoning is that, first of all, both the Webster’s Third New International Dictionary and the Oxford English Dictionary do not rank their senses according to how ordinary or how important they are. They rank them according to antiquity.
So in the Oxford English Dictionary, the reason that carry by means of a vehicle and carry -- comes before carry on your person is because one of them occurred first -- was found first in English in 1310. And one was found first in 1330 A.D. And that is the reason -- one of the reasons why Justice Breyer gives for putting Frank Muscarello in jail for an additional five years.
Justice Breyer next says that the origin of the word “carries” explains why the first or basic meaning of the word “carry” includes conveyance in a vehicle and cites a number etymological dictionaries that demonstrate that the word “carry” comes from the Latin carrum, which means car or cart. But this fallacy is so common that it already has a name and its own Wikipedia entry. It’s called the etymological fallacy, the idea that a word has to mean something that it meant in a long-ago cognate language.
But if that’s the case, then December should be the tenth month. October should be the eighth month. If you’re trying to interpret your company’s dress code, you wouldn’t be able to tell if you’re wearing a skirt or a shirt because those words both come from the same proto-Indo-European root. An anthology would be a bouquet of flowers. Words mean what they mean in the speech community that is using them at the time that they are used and not what they meant in the cognate language from whence they came.
So one of the things that Justice Breyer is doing though is a very, very familiar application of the use of dictionaries. He’s essentially saying, as judges frequently do, the plain or ordinary meaning of X is Y. See the dictionary. And if you were to enter essentially that sentence structure into Westlaw, you would break Westlaw because there are so many sentences -- so many cases in which judges make this type of reasoning. The plain or ordinary meaning of a word is X. See the dictionary.
But in fact, no dictionary sets out to provide the ordinary meaning. Ordinary meaning is a legal term and not a linguistic one. That’s why the Webster’s Third New International in the front matter said, “the best sense is the one that most aptly fits the context of an actual genuine utterance.”
So what are we left with? We’re left with a couple of limitations. One of the limitations is the limitation on dictionary use. Judges are turning to dictionaries for ordinary meaning, and dictionaries don’t contain ordinary meaning. We’re also left with one of the limitations of human intuition, which is that the more common a word is the more likely it is to have multiple different senses. So the word “carry,” for example, in the Oxford English Dictionary has dozens and dozens of different senses because it’s a very common word.
The problem is is that more common -- the more sense a word has the more you and I are likely to disagree about its meaning in a given context. And that creates sort of this counterintuitive result that we are more likely to disagree about the meaning of common words than uncommon ones. Seems counterintuitive until you open up a legislation textbook and you see that the big debates about ordinary meaning very often center on very common words. So we have these limitations. And what those of us who advocate for corpus linguistics have proposed is using databases that sample language use to learn more about the way language is used and, by doing so, to try to find out what the ordinary meaning of a word might be in a given context.
So I’m going to talk a little bit now about what linguistic corpora are and some of the linguistic corpora that are available. I’m going to be mostly looking at the website English-corpora.org. Corpora’s spelled c-o-r-p-o-r-a. So English-corpora.org.
And most of my work has been with the corpora listed here, the “Corpus of Contemporary American English,” the “Corpus of Historical American English,” and what is called the “News on the Web”, or NOW Corpus. These are corpora that represent a variety of standard written American English. And in fact, the Corpus of Contemporary American English also has a spoken language section that includes transcriptions of speech.
So the corpus, then, is an electronic collection of speech or writing that’s designed to sample the language use of a given speech community in a given context and at a given point in history. You can also use comparative corpora to compare the language of different speech communities at the same time or different speech communities at different times, for example, comparing specialized regions or industries or the language use of a particular racial or ethnic group. And you can do a comparison of the way that different speech communities are using the language.
The corpora are also tagged with linguistic metadata. And what that means is that when you do a search in a corpus you don’t necessarily just have to do sort of a control F type word search. The corpus will allow you to search for parts of speech. So this word -- any noun that co-occurs with “carry” or any noun that co-occurs with “firearm” or appears within so many words of “firearm.” The corpus is tagged with that sort of metadata to allow you to look not just at individual words but at certain phrasal and clausal grammatical constructions.
That allows the corpus to allow you to look at not just -- not just to do a word search in the way you would look up a word in a dictionary but to be able to see how the words function in particular environments -- so taking into account syntax and flexion, the semantic role that the word plays, and even some of the pragmatic contexts, the social or spatial context in which the article in which the word is used will appear. And it also -- the nature of the corpus is that you can construct a corpus out of any existing text. That means that you can construct a corpus to represent any type of speech community from any point in history for which there are surviving texts.
Now, the corpus that I used when I wrote about the Muscarello case in my first article on corpus linguistics -- I used the “Corpus of Contemporary American English,” which at the time was the largest freely available corpus. I should mention that all of the corpora at English-corpora.org are free to use. If you make a small donation, the designer of the corpus—which is not me. It’s my thesis chair, Mark Davies—will make them -- will eliminate the request for a donation that you’ll see. But other than that, they are free to use.
And I use the “Corpus of Contemporary English -- Contemporary American English” because it represents text from 1990 to the present in what’s called a monitor corpus, a corpus that is updated with new text every couple of months, so it remains current. And I performed a search. And what I was able to do with looking at Muscarello was not simply to search for the word “carry” and to see how the word “carry” is used because, as Justice Ginsburg points out in her dissent, Muscarello’s case was not about the meaning of “carry.” It was about the meaning of “carry a firearm.”
So one of the things you can do with a corpus is look at colocation, the statistical tendency of words to appear with other words. And I used the corpus to find all of the most common synonyms of “firearm:” “rifle,” “pistol,” of course, “firearm,” singular and plural -- all of the most common synonyms. And then I looked at “carry” in the context within a certain number of words, basically within the same sentence of the word “firearm” or any of its synonyms where there was a human agent—somebody doing the carrying—and where the firearm was the thing carried -- so not only able to take into account the word but the specific context that appeared in the statute.
And what you find is that in the overwhelming majority of cases, something in the realm of 90 to 95 percent of the time, the phrase “carry a firearm” or “carry” synonym of a firearm has reference to carrying on one’s person. I should say, however, that that is not 100 percent of the time. It is -- you know, I’m calling you right now from Utah. And in Utah, it is a perfectly acceptable sentence to say, “I am carrying a firearm in my car.”
So then you have a question, and you begin to be able to quantify, well, what do we mean by “ordinary meaning?” Do we insist that it is the exclusive or only meaning, or are we starting to have a conversation about probabilities? What would the person likely expect the word to mean? And I think in the case of “carry a firearm,” a person would most commonly think that that had reference to carrying on your person. And so as we talk about questions of notice and lenity, that might become a key issue.
Now, I should say that nobody -- I think I can speak for James on this. But he’s about to speak, so he can speak for himself. But nobody is arguing that corpus linguistics is a black box that you can input a question and you will get a definitive 100 percent certain answer. What corpus linguistics does is it allows you to gather evidence about language usage from particular speech communities.
So when Justice Breyer says that the ordinary meaning of “carry a firearm” means carry in your car, first of all, we can call that conclusion into question. We can also start to have a serious discussion and give some content to these vague terms we use in the law. What, in fact, do we mean when we say “ordinary meaning”? Because it turns out that carry a firearm in your car is a much, much less common meaning. And so we begin to have a more serious and, I think, evidence-based discussion about what these terms in the law mean. And with that, I’ll turn the time to James.
James C. Phillips: Great. Thanks, Stephen, and thanks to The Federalist Society for hosting me today. I’m happy to talk about corpus linguistics, and I just want to note that I’m kind of standing on the shoulders of giants, so to speak. Stephen and Justice Thomas Lee are kind of the giants who first got this going.
So I want to talk about the context of corpus linguistics with constitutional interpretation, particularly originalism. So there’s different stripes of originalism. There’s original intent. There’s original methods. Most common one is original public meaning. And what that is, as described by some scholars such as Michael Stokes Paulsen, Randy Barnett, Kurt Lash, and others, is the meaning of the words and phrases of the Constitution would have had in context to ordinary readers, speakers, and writers of the English language reading a document of this type at the time adopted.
So how do we figure out how people 200 plus years ago, if you’re looking at the original Constitution, or 150 plus years ago, if you’re looking at, say, the Reconstruction Amendments, would have understood that document? Well, up until this point, originalists have tended to use a couple of different sources, both of which have severe shortcomings. One is they looked at dictionaries, and Stephen’s already highlighted some of the problems with modern dictionaries.
Founding Era dictionaries are even more problematic, besides the facts as is common with all dictionaries that dictionaries tend to not define phrases, and a phrase can be more or less than the linguistic sum of its parts, which is the concept of compositionality or the idiom principle, which is closely related. I think Judge Easterbrook has a great example of this. He says try looking up “cut the mustard” in the dictionary. You look up “cut,” you look up “the,” you look up “mustard,” and it won’t tell you what the phrase “cut the mustard” means to us today.
But besides that, Founding Era dictionaries have some other problems. So one is what’s called lexigraphical prescriptivism. And that’s basically the principle that dictionaries at that time—and for some time after that—tended to define words as they should be used or what their proper definitions were rather than how they might actually have been being used. And in fact, when the Webster’s International Third Edition in the 1960s decided to break from this tradition and just define things on how they were being used, it was quite controversial and radical.
And another problem with Founding Era dictionaries is they’re often not the proper timeframe. For example, they will often provide examples of language usage from Shakespeare or the King James version of the Bible, which is centuries old by that time. And there’s the potential of missing what’s called linguistic drift. The meaning of words can change over time. And I’m going to give an example of that in a little bit.
Another problem with Founding Era dictionaries is, unlike today’s dictionaries, they tend to be the work of just one person, one line so they could be idiosyncratic. Samuel Johnson’s famous dictionary was Samuel Johnson’s. Webster’s Dictionary was Webster’s.
And on top of that, there tended to be a bit of you might call copying of other dictionaries or, if you wanted to give it a fancy term, lexigraphical piracy wherein you can imagine putting together a dictionary just by one person is a quite a feat. So they would tend to copy what other definitions -- the definitions in other dictionaries. And this could mean that Webster might be copying Samuel Johnson. Samuel Johnson might be copying somebody from the 1730s. And you action have a definition that’s 100 years old.
And finally, there’s the concept of lumpers and splitters. So some dictionaries tend to lump similar or related senses together into one kind of meta-sense. Others tend to split out senses or definitions based on their various nuances. And as you can imagine, when it’s the work of one mind, Founding Era dictionaries tended to be more lumpers. So dictionaries are great. I love dictionaries, but they can only get us so far.
The other thing that originalists have tended to do so far, both jurists and scholars, is they looked at a small sample size of sources from the Founding Era. And that’s just hard to generalize to the entire population of the United States, or at least the population that we want to look at -- the speech community or the public, as originalists would say. Just as if you were to ask ten people who they’re voting for for president in 2020, you would have no confidence that that’s how the nation’s going to vote.
Another problem with a handful of sources is they tend to be unrepresentative. They might primarily by the Federalist Papers, for example. And while the Federalist Papers are great, they may not necessarily represent how ordinary folks would have understood language at the time. And there can tend to be a little bit of -- cherry picking is always a problem when you just have a few sample sizes.
And finally, there’s a tendency to mix legal and ordinary sources. And sometimes dictionaries -- the Constitution, for example, uses legal terms. “Corruption of blood” is a term that’s kind of a legal term of art. Maybe the man or woman on the street in 1789 wouldn’t know what corruption of blood meant. Maybe they thought it had something to do with being sick. So looking to how they would have understood that actually might be problematic. And those sources tend to get mixed interchangeably by originalists.
So that gives us corpus linguistics. To paraphrase a quote from Star Wars, corpus linguistic is the tool that originalists have been waiting for and because it’s based on the principle that the best way to find out how people understand language is to find out how people actually use language. And Stephen has already talked a bit about some of the tools of corpus linguistics. I just want to say that it’s more familiar and more everywhere than people realize.
So for example, our brains work like this. And cellphones do this, right, when they predict the next word that you’re going to text or type. That’s using a form of corpus linguistics. Modern dictionaries are built using corpus linguistics and, in fact, have been for a little while. Some of you may have seen the movie -- I think it’s called The Professor and the Madman, or something like that, with Mel Gibson and Sean Penn, which is like a very early corpus of just papers in the Oxford English Dictionary’s first edition in the scriptorium there.
And you could think of Google or Westlaw as a type of corpus. It’s not a linguistic corpus. It’s not trying to represent a speech community per se. But it’s rudimentary type of database or corpus.
And the law has recognized some of the tools that linguistic corpora use for some time. So Stephen mentioned colocation or collocates, which I think of as just word neighbors. Words appear in the same semantic environment as some or not others. So the word “dark” is going to be more likely to appear near the word “light” or “night” but not near the word “perfume” because we just don’t use those two words together. And this colocation or word neighbor phenomena has been recognized in the principle in the law of noscitur a sociis -- the thing is known by its associates.
But the problem with using corpus linguistics for originalism is there really wasn’t a corpus to use it on. So the oldest corpus we have in American English is the “Corpus of Historical American English,” and it doesn’t start until 1810, which is just a little bit too late for the original Constitution or the Bill of Rights because of the potential phenomena of linguistic drift. So while I was a visiting professor at BYU, I had this idea to create this Founding Era corpus and so helped design it and kind of oversee the initial creation of it. And then after I left, others carried on and finished the work and did a great job.
Although with all corpora, obviously, it’s not perfect, and there are always improvements to be made. And COFAE, which can be found at lawcorpus.byu.edu—and COFAE is what we call the “Corpus of Founding American English” for short—covers from 1760 to 1799. And it has kind of three smaller corpora within it. One is from the Evans Early American Imprint series, about a third of COFAE, and it has books, pamphlets, broadsides, speeches, and is kind of more of a mix of, quote/unquote, “ordinary folks” and Founders.
Then we have another third of the corpus is from the Founders’ archives -- Founders’ papers that the National Archives has put together. And that has mostly letters to and from Founders -- or about six famous Founders, actually. And then there’s about a third of the corpus is legal documents from HeinOnline. So that’s statute cases, state records, legal treatises and the like.
And so I just want to walk through a few examples of using COFAE to try to get some leverage on what aspects of the Constitution may have meant. This first one is a very uncontroversial one that nobody’s really litigating over. It’s the term “domestic violence” in the Constitution where it protects the states against domestic violence. They can ask for the federal government or the Executive Branch to intervene and protect them in such instances.
If you do a frequency search using both COFAE and the “Corpus of Historical American English,” COHAE, which goes from 1810 to 2000, you see that it’s used pretty sparingly, about zero to nine times a year -- or decade, sorry, a decade up until about the 1990s. And then it’s used 90 times a year roughly for the next two decades. So something happened in the ‘80s and ‘90s. And so frequency is one tool -- kind of an exploratory tool that you can use on a corpus.
The next thing I looked at was collocates of domestic violence, and I broke it up based on that exploration of this change in frequency of usage. So from 1760 to 1979, the top collocates or word neighbors of “domestic violence” were words like “against,” “states,” “protect,” “convene,” “invasion,” “suppress,” “legislature,” “foreign,” “congress.” And that reflects the meaning that’s in the Constitution -- the Founding Era meaning of kind of a state insurrection. But from 1980 to 2009, the top collocates of domestic violence were “women,” “abused,” “honor,” “national,” “victims,” “killings,” “coalitions,” “issues,” and “violence.”
And you can see that it’s talking about a completely different sense of the word there and reflecting that -- the sense that’s most common nowadays, which is kind of abuse of a family member. And then, of course, you can always go through and categorize the senses that you’re seeing. And again, in that time period up until the end of the 1970s, 98 percent of the time it’s that insurrection sense. And then from the 1980s on, 96 percent of the time it’s this family abuse sense.
So Justice Lee and I joked in our paper that this wasn’t linguistic drift. This was linguistic divorce because in about ten years you saw a complete flip of almost predominantly being used one way to almost predominantly being used another way. And that’s something that the corpus can document.
Here’s an example of something that’s a little bit more controversial, and that’s the word “commerce.” And commerce has -- scholars have argued that it has different senses, including a very broad, any kind of market-based economic activity sense, to a very narrow trade sense and somewhere in between. And so one of the things that I did was what are called engrams or cluster searches. And that’s where you’re trying to find kind of an X, Y, and Z pattern or an X and a Y or whatever the pattern may be.
And so one pattern that I found is that frequently in the Founding Era in these documents you’d see the phrase “agriculture,” “commerce,” and blank. And when you look at the words that show up as blank, overwhelming amount of the time the word is “manufactures.” Any of the other words that show up only usually show up once or twice, and there’s maybe six or eight other words that show up in that phrase: “agriculture, commerce, and.”
And so when you look at that “agriculture, commerce, and manufactures,” which is the most common phrase there that’s use, it begins to show you that, well, it would make a lot of sense for commerce to mean any market-based economic activity because that would make agriculture and manufactures redundant. Right? It would be swallowed up in the term “commerce.” And so unless we’re going to make those redundant, which doesn’t make sense given how frequently that was used, perhaps then they have independent meaning. And that would point more towards the trade meaning of commerce.
Let me give you another example from another context and a little bit different tool, and that’s the word “emolument.” So there’s some litigation going on right now regarding President Trump and whether he’s violated the Foreign Emoluments Clause, which prevents officers of the United States from receiving any kind of emolument from a foreign entity or government leader. And the great thing about corpus linguistics is you can really boil down the context.
So what we’re interested in is not just what is perhaps the most common or dominate sense of the term “emolument” or “emoluments” but how is it used most often in the context of emolument from government because that’s the context of the Foreign Emoluments Clause. It’s emoluments received from a foreign government. And there are two general senses of emolument in the Founding Era.
One is a very broad sense, which is really any kind of benefit that you get from anything, really. And there’s a narrow sense, which is kind of the perks or pay of office -- government office or employ. And when you look at the context of emoluments from government in COFAE, what you see is that 87 to 97 percent of the time, whether you’re looking at the legal documents, the Founders’ documents, or the more ordinary documents, it’s referring to that narrow sense. And so it really -- the corpus is great for really trying to drill down on context in a way that you just never can get with dictionaries.
Now, this doesn’t mean that a corpus or corpus linguistics is a panacea for constitutional interpretation. Sometimes there just isn’t enough data to really get any leverage or to feel confident in what you’re finding. Sometimes the results are inconclusive. If you find roughly a 50/50 split of two competing senses, it’s hard to say that one is the more ordinary one. But I think that’s useful information in and of itself, just like a public opinion poll that shows a 50/50 election is useful information, even if you can’t figure out a winner from that.
Sometimes corpus linguistics just might not be the right tool or might not be the most helpful tool. For example, some research that I’m planning on doing on stare decisis and to whether that played a role or was thought to be seen as part of the judicial power -- Article III judicial power at the Founding, it’s not clear to me that corpus linguistic analysis will be super helpful there. It might just be nothing more than a database to try to find some discussions of it. And of course the more you have closely related senses the trickier it is to parse things out and the harder that is.
But that being said, we’re still honestly working things out with it. But it’s a useful tool. It’s an advance over the tools that have been often used in originalism. And it is, at the end of the day, just a tool and data. And so it’s not ideological in that sense. You just never know what answers you’re going to get.
I’ve done some research on the Second Amendment. And what I found with my co-author Josh Blackman is that both Justice Scalia and Justice Stevens made mistakes. So that’s the great thing about tools and data is they don’t have an agenda. So anyway, we’re at the very early stages of this, but I think there’s a lot more to do. But I’m excited to see where it goes. And now I’ll turn it back over to Stephen.
Stephen Mouritsen: Thanks, James, and I should also say thank you to The Federalist Society for hosting this. It’s always a pleasure to talk about research that I’m working on and to hear from James who, as he said and deserves all the credit in the world for, creating the first Founding Era corpus. It’s hard to overstate what a task that is. English spelling was not regularized at the time of the Founding. English orthography is not easily subjected to optical character recognition. So creating a database of Founding era text is a huge challenge. And it was a remarkable achievement.
So one last note, I won’t spend a lot of time because I’d like to open it up to questions. I’m going to talk briefly about some problems in contract interpretation. But I think that one of the exciting points that James really highlighted is that, while corpus linguistics itself has been around for some time, the application of corpus linguistics to questions of legal interpretation has not. And we are at a very preliminary stage in taking an entirely new method for collecting evidence of language usage and applying that -- bringing that evidence to bare on questions of legal interpretation. So we’re at a very early stage.
The only thing that I feel like the -- in looking over what I have written about law and corpus linguistics, the thing that I feel most strongly about is that, when interpreting a legal text, the text -- the written word in that text always matter. And one of the ways that you can try to interpret that text is by gathering up evidence of the way the language is used by both the drafters of the text and the people who are charged with interpreting the text -- their linguistic conventions. And then the corpus is one of the ways to gather that evidence.
With that said, whether or not James or I or anyone else who has been writing in law and corpus linguistics is necessarily doing it right I think is totally open to debate. It is a very new field. And it’s an exciting thing to see the explosion of interest and papers and responses and even criticism to the method.
I was talking briefly about contract interpretation. There’s a big debate in contract interpretation between formalists who are represented by Wilson and primarily, from a jurisdictional standpoint, by the jurisdiction of New York arguing that you -- where there is no ambiguity in the text you should apply the text as it is written in the four corners of the document. And you have on the other side of that debate Corbin and Llewellyn and the legal realists and, from a jurisdictional standpoint, California and about ten other state jurisdictions arguing for contextualism, the idea that you should allow some evidence of the drafting history, of the understandings of the parties to interpret the contract.
Now, formalism, it’s advocates will often argue that it will increase efficiency, predictability. It’s detractors will note that it may not give you a full picture of the party’s intent and that -- one of the criticisms that I would level against formalism is it relies very heavily on dictionaries and intuition, which we’ve already discussed. Contextualism, on the other hand, argues that it better represents the parties intentions, but it has a problem of potentially inviting strategic behavior and increasing the cost.
So I’ve argued in my most recent paper on corpus linguistics in contract interpretation that corpus linguistics could offer a middle way. It would offer better evidence of the linguistic conventions of contracting parties than -- clearly better evidence than some of the traditional tools of formalism, like dictionaries or the judge’s linguistic intuition. It could avoid some of the costs and some of the risk associated with strategic behavior in contextualism. And it may also help us, as we discussed earlier, give content to vague legal terms like “plain meaning” and “ambiguity.”
One of the things that I’m working on right now is a notion in the Uniform Commercial Code of usage of trade. One of the things that you can do with linguistic corpora is use comparative corpora to compare the speech or writing conventions of different linguistic communities. And in fact, the UCC tried to take account for that by allowing you to take into account the usage of “trade” or “custom” in the interpretation of the legal text.
And so one of the things you could imagine doing is creating a specific bespoke corpus that is representative of the linguistic usage of a particular industry, particular geographic region. For folks doing administrative law, we might even imagine creating a corpus that is representative of the speech or usage of a particular regulatory regime or regulatory agency. So that’s where my work is going forward right now. Rather than -- seeing where we are with the time, rather than diving into some of the examples, I think it’s probably a good idea to open things up for questions.
The one thing I want to add that we were discussing before the call began is that, right now, for law and corpus linguistics there is not a great primer. And this is one of -- this is my fault. It is the fault of some of us early advocates. We haven’t got a great how-to do corpus linguistics.
Probably the best thing to do is to pick up a -- if you’re curious about getting started, pick up a recent paper by myself or James or some of the other early advocates. One of the things that we try to do very carefully in the footnotes wherever we do corpus analysis is to explain the searches that we perform. But I will also add that Justice Thomas Lee and myself, we’ve collaborated on a couple of articles. And we’re right now writing a book on law and corpus linguistics. And we’re hopeful that each chapter in that book is going to have exercises that you could follow that would lead you to being able to perform basic corpus linguistics analysis.
As for right now, looking at the existing publications, using the footnotes and the instructions in those footnotes for corpus analysis may be a good start. And then I just also want to say to anyone on this call, if you’re using corpus linguistics for advocacy and you have any questions about it or using it in your research, by all means shoot me an email. You can find me on Parr Brown’s website. I’m always happy to talk about it. And with that, we’ll open it up for questions.
Nick Marr: Okay. Great. We’ll go to audience questions now.
Caller 1: Yes, thanks very much. Fascinating stuff. I’m definitely going to try to get more information on this from you gentlemen. One quick question, do you believe this resolves questions of law or of fact that go to judges or juries?
Stephen C. Mouritsen: I’ll go first, and then, James, if you want to add anything. I have a written research agenda that has a paper that I’ve been working on called “The Evidentiary Status of Language.” That is, I think, a difficult question because, historically, if you look at the case of Nix v. Hedden, the famous tomato is a vegetable case, the Court not only concludes that tomato is a vegetable. But it says something really interesting.
It says that the Court will take judicial notice of the ordinary meaning of words—and I’m paraphrasing but not by much what the Court says about it—and, of course, will cite dictionaries to remind the court of those meanings. Now, of course, the standard for judicial notice is not just a fact that is not in dispute but that cannot be disputed, which is a funny thing to say about a dispute that centers entirely on the meaning of a word. So historically, I think that courts have thought of interpretation as questions -- interpretive questions as questions of law.
Now, Justice Marshall saying that if emphatically the -- well, now I’m going to not quote correctly Justice Marshall. But it’s the province of the Court to say what the law is. And so historically courts have been just fine thinking of the meaning of terms in a legal text as a question of law.
Now, some originalists as well and some legal scholars and certainly many linguists think of the meaning of a word as derived from its usage and as a factual issue. And so the question is whether or not the interpretation of language has some sort of special evidentiary status. One might argue that, well, it’s a legislative fact, the kind of fact like legislative history where you are doing factual investigation, but it is the kind of factual investigation that we’re comfortable with judges doing.
When a judge looks up a word in a dictionary, the judge is really investigating a fact about the universe. How is this word defined? Which really means how did the lexicographer, who is an expert on word usage -- how did the lexicographer look at the evidence and decide how it should be defined?
I think that very often that the question will depend on the complexity of the information presented to the judge. I can imagine uses of linguistic corpora that really belong in the domain of an expert. Here I’m thinking of forensic linguistics, a very common -- a place where there’s no question that if you’re doing feature comparison, using linguistics in the same way that some experts use hair follicle analysis to identify the person who wrote the ransom note, for example. And corpora are used in those contexts. That is something that will likely involve an expert.
On the other end, there may be simply dropping a footnote to say, “If you take a look at this search, this word is used most commonly this way.” I think that that is one of the open questions, but I think it’s complicated by the fact that we are bringing tools that haven’t existed before to turn what was traditionally a legal question into something that more resembles a fact question. So how courts will sort that out is going to be at least interesting to watch. I don’t know that I have the answer, though.
James C. Phillips: I don’t have anything to add.
Nick Marr: Okay. We’ll go to our next question, then.
Eugene Kirman: Hey, this is Eugene Kirman here. Thanks for the great work you’re doing. I just had a question. In the context like Muscarello where -- which it doesn’t occur maybe too often that criminals, guns get discovered in their cars and they get additional sentence for that or not, is corpus linguistics going to provide many answers, considering, for example, that maybe states’ carry laws are not being enforced so much or discussed so much in the context of firearm possession in a vehicle as opposed to on a person?
Stephen C. Mouritsen: So there was actually some research on the effect of the Muscarello case. And as a result of the Court’s decision, there was an estimate that some 200 people a year would get that heightened penalty -- and additional 200 people because they were carrying a firearm in a vehicle. Now, it’s certainly the case that my research had to do only with the federal law.
And so the state firearm laws, I don’t have -- they could have a very different text. But I do think that, yes, even in a highly idiosyncratic case where you’ve got a statute that is unusual, that doesn’t appear very often -- and I think James’ work on emoluments is a good example. There may be two or three people before the election of the President who had given a lot of thought to what emoluments meant and not a lot of cases interpreting what that means. But then there comes a point where the case is in front of the court and you want to be able to present evidence of the way that word is used by a speech community.
So I think it can be useful. And I, in my own practice, I have on occasion included a short paragraph or a short footnote, nothing terribly fancy, that simply argues that if you look at the evidence of the way this word is used you can see that the other side’s interpretation is incorrect. That being said, I have the advantage of practicing in a jurisdiction where the Supreme Court has already recognized the validity of using corpus linguistics. And right now, I think there are four such state jurisdictions and two federal circuits.
Eugene Kirman: Thank you.
Nick Marr: We’ll go to our next question here.
Caller 3: Hi, gentlemen, and thank you for such a thought provoking and intriguing topic. I had kind of two-part question. The first is Congress’ reaction to this new mode or blended mode of interpretation that you put forth, particularly on the kind of criticism to dictionaries -- if judges were improperly reaching for dictionaries or solely for dictionaries. But Congress hasn’t necessarily criticized that approach and said, “That’s not what you should be doing, judges, to interpret the words of what we’re saying.”
Should -- what state do they need to kind of speak up to say that? And then my second part of that question is just the concept of congressional insiders and outsiders, where legislation is often written by small groupings of people that use particular language that ensures -- at least attempts to ensure consistency through the entire corpus juris.
Stephen C. Mouritsen: So on the first question, I should say that personally I used dictionaries. I have a stack of them right here on my desk. I actually -- my specialty was in corpus lexicography, so using corpora to write dictionaries. And for my graduate thesis, I wrote a dictionary called “The Frequency Dictionary of Newsprint Arabic” that defined the most common words in Arabic newsprint. So I love dictionaries. And I have no problem with judges citing dictionaries.
The thing that you can’t -- you can cite dictionaries for a variety of reasons. One is attestation, simply making the case that a word has been used in a certain way. Sometimes parties will dispute that. They’re also just useful to give you sort of a metalanguage, a way to talk about the words that you have to define. There’s no problem in my view with citing a dictionary.
The thing you can’t cite it for is ordinary meaning. Judges and Congress and I think a lot of lay people I don’t think have really just noticed the incoherence of the citation to a dictionary. If you think about -- imagine a statute that defined the verb “to set.” Well, in the Oxford English Dictionary, the second edition, that is the most defined word in the dictionary. It covers 22 pages and has some 430 different senses.
And so if the judge says, “The ordinary meaning of “set” is X. See the Oxford English Dictionary,” the judge’s reasoning there is incoherent. The dictionary doesn’t tell you an ordinary meaning. In fact, it gives you 400 different senses. And it might seem like, well, judges don’t do that. And the reality is that they do.
In my contracts paper, I highlight a number of cases where a judge will say, “The parties are arguing that this word means A or B, but the dictionary says it means A.” And the judge fails to point out it also means B, and the same dictionaries contain both definitions. This is a very common phenomena. I honestly believe that judges and most lay people just haven’t noticed. But I don’t have any trouble with judges citing the dictionary.
Now, on the second issue with respect to specialized constituencies writing and interpreting statues, I completely agree with that. In fact, I think that that is one of the things that corpus linguistics can do to assist in the interpretation of legal text is helping us to understand the differences between, say, a standard variety of English usage such as, basically, your newspaper English, the sort of educated reader and writer of English, versus the specialized variety of English of the lawyers who write statutes. And it may be that we have some -- in fact, this is actually well-documented that statutes are written in a very particular type of language.
And it would be helpful for us to have a better understanding of those differences when interpreting legal text. So I think that that’s a really interesting and open area for research and that corpus linguistics could be a part of that research.
Caller 3: Thank you.
Nick Marr: So since we’re up on the hour now, I’ll give our speakers a chance for some closing remarks. Apologies to anyone who’s still in the queue. If you call back, we’ll have lots of telephone calls, lots of opportunities to ask questions. Thanks, all, for your time. So any closing remarks, Stephen, James, or Don?
Stephen C. Mouritsen: I’ll finished where I finished last time and simply say that this is an area that is wide open. It’s brand new. There’ve been -- you’ve seen it expand well beyond the work that I’ve done, that Justice Lee has done, or that James has done. A number of scholars writing about it, some favorably and some critically, and a number of advocates using corpus linguistics to try to persuade appellate courts at both the state and federal level. And so I am not proposing that every brief and every legal argument needs linguistic corpora. But it can be -- once you have a sense of what it’s about, it can a useful tool for some arguments. And that’s all.
James C. Phillips: Yeah. I agree with that, and I just want to give a shout out and a thanks to David Armond and his team at BYU who have put so much work into COFAE and helped make it a great tool. And, of course, there’s lots more work to do.
Nick Marr: Great. Thanks. And on behalf of The Federalist Society I want to thank you all for being here today. Be sure to check our website for upcoming Teleforum calls. We have one coming up tomorrow. So thanks, all, and we’re adjourned.
Dean Reuter: Thank you for listening to this episode of Teleforum, a podcast of The Federalist Society’s practice groups. For more information about The Federalist Society, the practice groups, and to become a Federalist Society member, please visit our website at fedsoc.org.