Text Mining the Đại Việt Sử Ký Toàn Thư

There is an emerging “field” in places like North America and Europe that people are calling the “Digital Humanities.” People who work in this field are employing digital tools to enhance the traditional work of humanities scholars, and they are also thinking critically about how digital media might be transforming the way that we produce and process knowledge.

One tool that some scholars employ is software for text mining. Through text mining, scholars can search through large quantities of text to try to detect certain patterns that they can then examine more closely through the traditional techniques of humanities scholars – namely, the close reading of texts.

I decided to try a simple experiment with Voyant Tools, a free on-line site that allows you to do basic text mining (http://voyant-tools.org/).

What I did was to input the text for the first chapter of the Đại Việt sử ký toàn thư in both the original classical Chinese and then the Vietnamese translation.

There are various ways that you can analyze the text. It creates a “word cloud.” It shows the frequency of usage of words. You can produce graphs of word usage over the course of the text simply by clicking on a word in the text. And finally, you can see the words in their contexts.

In order to data mine the classical Chinese text, I had to put a space between each of the characters so that the software could recognize them as separate.

So did the results reveal anything interesting? Kind of. I can see, for instance, how using this software could help a person analyze Vietnamese language translations of classical Chinese texts.

Take a look at the frequencies.

In the classical Chinese text, the term “quốc” (國) appears 27 times. In the Vietnamese translation, the term “nước,” meaning “country,” appears 46 times. Perhaps in some cases nước is used to refer to “water” rather than a “country.” But perhaps this is a sign that the translator injected a term in places where it did not belong.

Whatever the case may be, this numerical discrepancy leads one to want to investigate the issue further by doing what humanities scholars have always done, that is, to read the text closely.

In the first instance, vạn quốc (萬國, “the ten thousand kingdoms”) is translated as muôn nước. Then Xích Quỷ quốc (赤鬼國, “the ScarletGhostKingdom”) is translated as nước Xích Quỷ.

The above two translations are very straightforward and unproblematic. The third time that the term “nước” appears in the translation however, is not as clear-cut. There we find that “Ngã Việt chi cơ” (我越之基, “the enterprise of We/Our Việt”) is translated as cơ nghiệp của nước Việt ta.

First of all, in the original there is no term here for kingdom/country (國) like there are in the first two cases. So the translator added the term “nước.” Was that addition warranted?

Does Ngã Việt refer to a “country”? How do we know? Was there ever a kingdom called “Ngã Việt”? Did the term ever appear in expressions like “Ngã Việt chi sơn hà” (我越之山河) to refer to the “mountains and rivers” or “territory” of Ngã Việt?

From my reading of original texts, I don’t get the sense that Ngã Việt was used to refer to a “country” in a territorial sense. Instead, I get the sense that it was a concept that was restricted to the elite, and which referred mainly to the existence of a political tradition as the extended phrase “Ngã Việt chi cơ” (我越之基, “the enterprise of We/Our Việt”) indicates here.

So if I was translating this, I would not add the word “nước” here. It’s not in the original, and to add it distorts the way that the premodern elite viewed the world.

Adding that term, however, does make the past fit the way we view the world in the present. But in the process, the uniqueness of the past is erased.

In any case, it just struck me that this might be one way that text mining could be put to productive use, namely by helping to identify issues to examine, and then enabling a scholar to focus in on the issue and engage in a close textual reading.

Share This Post

Leave a comment

This Post Has 6 Comments

  1. tranthanh

    I don’t think this is a new field in Chinese literature. But you probably are right that no one ever tried with the text of Dai Viet Sy Ky toan thu.

    1. tranthanh

      I heard that there were a lot of Nom texts. I wonder what people in the past would write about “guo” in Nom texts…

      1. leminhkhai

        Nom texts are not all the same. I’ve seen Nom texts in which almost everything is in Hán except for the verb là. Or you can find texts that have much more Vietnamese than Hán. So it would depend to some extent on who wrote the text, in what context, for what purpose, etc. But I’ve definitely seen nước used the most.

    2. leminhkhai

      I’m not sure about the field of Chinese literature, but in general humanities scholars in North America have only recently started to experiment with text mining. What I demonstrated here is much simpler than what people are doing. Most people who text mine search through thousands (or millions) of documents to try to look for “trends” in word usage, and then they look more closely to see what might explain this.

      The article below is about a text mining project that has focused on the records of the London criminal court, the Old Bailey.

      (http://www.nytimes.com/2011/08/18/books/old-bailey-trials-are-tabulated-for-scholars-online.html?_r=2&amp 😉
      August 17, 2011

      As the Gavels Fell: 240 Years at Old Bailey

      By PATRICIA COHEN

      For 240 years the grand parade of human greed, love, cruelty, longing, and foolishness was captured in the Proceedings, the published record of trials that took place at the Old Bailey, the central criminal court, in London.

      Now, powerful digital tools developed by an international team of researchers to search these trial reports and summaries have begun to offer new insights into the evolution of the justice system, the institution of marriage and changing morals.

      The Old Bailey offers a unique window into the criminal justice system and, by extension, British culture. The free searchable online archive, oldbaileyonline.org, contains accounts of nearly 198,000 trials between 1674 and 1913. “It’s the largest body of accurately transcribed historical texts online,” said Tim Hitchcock, a historian at the University of Hertfordshire in England and part of the team. “All of human life is here.”

      Mr. Hitchcock argues that new methods of digitally analyzing and mapping the history of crime using the entire Proceedings will revise “the history of the criminal trial.” After scouring the 127 million words in the database for patterns in a project called Data Mining With Criminal Intent, he and William J. Turkel, a historian at the University of Western Ontario, came up with a novel discovery. Beginning in 1825 they noticed an unusual jump in the number of guilty pleas and the number of very short trials. Before then most of the accused proclaimed their innocence and received full trials. By 1850, however, one-third of all cases involved guilty pleas. Trials, with their uncertain outcomes, were gradually crowded out by a system in which defendants pleaded guilty outside of the courtroom, they said.

      Conventional histories cite the mid-1700s as the turning point in the development of the modern adversarial system of justice in England and Colonial America, with defense lawyers and prosecutors facing off in court, Mr. Hitchcock and Mr. Turkel said. Their analysis tells a different story, however.

      “Mapping all trials suggests that the real moment of evolution was in the first half of the 19th century,” with the advent of plea bargains that resulted in many more convictions, Mr. Hitchcock said. “The defendant’s experience of the criminal justice system changed radically. You were much more likely to be found guilty.” Last month the scholars submitted an article to the British journal Past and Present on their findings.

      Profound shifts were behind the turn toward negotiated agreements. The class of professional lawyers, police officers and judges was growing quickly at the same time that prison began to be used as an alternative to exile or capital punishment, historians have noted. (The first modern prison in Britain can be dated to 1792.) As Mr. Hitchcock said, “It’s hard to have plea bargaining when all they are going to do is hang you.”

      Scholars have long considered the Old Bailey an invaluable resource. The court’s practices not only deeply influenced the authors of the American constitution and the young nation’s developing legal code, but the files are also the only large-scale printed source of how ordinary Englishmen actually spoke in the 17th, 18th and 19th centuries.

      Centuries of Londoners avidly read the Proceedings for entertainment, moral instruction and news, learning in 1763 that William David was executed for robbing a man of his watch and hat, or in 1910 that Stanley Dennis was guilty of murdering his wife, Violet, but was found insane. The spread of newspapers and a hike in publishing costs abruptly ended the Proceedings in 1913.

      John H. Langbein, a professor of law and legal history at Yale University and an author of “History of the Common Law: The Development of Anglo-American Legal Institutions,” called the Proceedings “a wonderful narrative resource.” Mr. Langbein was the first scholar to use the archive when it was initially digitized more than a decade ago, thanks partly to Mr. Hitchcock’s efforts.

      Sophisticated tabulation methods have expanded what researchers can do. Previously a keyword search of the Old Bailey documents, for example, could produce thousands of records that someone would then have to read through laboriously and interpret. The latest tools, developed by scholars from Britain, Canada and the United States, can search, organize and analyze large quantities of information in myriad ways. Months of work can be reduced to a few days or hours. When asked to comment on a brief summary of the new quantitative results that were presented at a National Endowment for the Humanities conference in June, Mr. Langbein was deeply skeptical of promised revelations.

      The Proceedings changed character over time, he said, evolving from “true-crime-type pop literature” to quasi-official reports that were “drastically edited and compressed,” and focused on crimes involving “sex, blood and gore.” Drawing conclusions from something like trial lengths is, therefore, very misleading, he said.

      “The Old Bailey Online project has done a great service in making those sources widely (and costlessly) available,” Mr. Langbein wrote in an e-mail. But he complained that the claims about data mining have “a breathless quality: ‘you can expect big things from us,’ but as yet it’s all method and no results.” He said that the new findings belittle the work of a generation of scholars who focused on the 18th century as the turning point in the evolution of the criminal justice system.

      Mr. Turkel, who developed some of the digital tools, said that data mining reveals unexpected trends and connections that no one would have thought to look for before. Previous scholars “tended to cherry-pick anecdotes without having a sense that it was possible to measure all of that text and treat the whole archive as a single unit,” he said.

      Dan Cohen, a historian of science at George Mason University and the lead United States researcher on the Criminal Intent project, found other revelations in the data. He noticed that in the 1700s there were nearly equal numbers of male and female defendants, but that a century later men outnumbered women nearly 10 to 1. The shift may reflect a change in the type of cases adjudicated, Mr. Cohen said. Adultery and theft of food or animals were crowded out by highway robbery, pickpocketing and other crimes common to an increasingly industrial and urbanized center.

      One exception was bigamy. After 1870 there was a small but significant rise in cases brought against women at the same time that the penalties meted out became much less severe. “In the 1700s bigamy cases were inflamed affairs, with long, drawn-out, rather brutal trials of women, with character witnesses,” Mr. Cohen said. By the late 1800s they were handled with brief summary judgments. He speculates that the “slap on the wrist” may indicate that there was “no longer a need for a moral trial in addition to a criminal proceeding.”

      As Britain expanded its colonial empire around the globe, more husbands were separated from their wives for years at a time, making it more understandable that women might seek new spouses without being able to get proper divorces. The trials may reflect a measure of control that Victorian women began to exercise over their own lives, Mr. Cohen said.

      The results complemented data mining that Mr. Cohen and his colleague Fred Gibbs, a historian of medicine at George Mason, conducted on billions of words published in 19th century books and digitized by Google. An electronic search found a large uptick in references to “loveless” marriages after 1870. “This criminal record of marriage,” he said, “help us to flesh out the darker side.”

  2. tranthanh

    “Ngã Việt chi cơ” (我越之基, “the enterprise of We/Our Việt”) is translated as cơ nghiệp của nước Việt ta.” – I agree, there is no need for “nước” here. But I guess the person who would write 我越之基 can also agree with Nguyen Trai(?)’s expression that “Nga Dai Viet chi quoc”, or might not?!?

    1. leminhkhai

      In that instance Nga is modifying Dai Viet.

      So does that mean that Nga is modifying Viet in Nga Viet and that what was really being referred to was just Viet? By just looking at one example you could argue that, but I’ve seen it used a lot, and that is not the sense I get. Nga Viet is a fuzzy term. It doesn’t clearly correspond to any term or concept we have today. When it was used, it seems to have mainly called to mind the royal enterprise/political tradition. It’s a concept that made sense to the ruling elite at a time before concepts like ethnicity, nationality, citizenship, etc. were part of the political discourse.

Leave a Reply