Announcing Stemmaweb

[Cross-posted from the Tree of Texts project blog]

The Tree of Texts project formally comes to an end in a few days; it’s been a fun two years and it is now time to look at the fruits of our research. We (that is, Tara) gave a talk at the DH 2012 conference in July about the project and its findings; we also participated in a paper led by our colleagues in the Leuven CS department about computational analysis of stemma graph models, which was presented at the CoCoMILE workshop during the European Conference on Artificial Intelligence. We are now engaged in writing the final project paper; following up on the success of our DH talk, we will submit it for inclusion in the DH-related issue of LLC. Alongside all this, work on the publication of proceedings from our April workshop continues apace; nearly all the papers are in and the collection will soon be sent to the publisher.

More excitingly, from the perspective of text scholars and critical editors who have an interest in stemmatic analysis, we have made our analysis and visualization tools available on the Web! We are pleased to present Stemmaweb, which was developed in cooperation with members of the Interedition project and which provides an online interface to examining text collations and their stemmata. Stemmaweb has two homes: (the official KU Leuven site) (Tara’s personal server, less official but much faster)

If you have a Google account or another OpenID account, you can use that to log in; once there you can view the texts that others have made public, and even upload your own. For any of your texts you can create a stemma hypothesis and analyze it with the tools we have used for the project; we will soon provide a means of generating a stemma hypothesis from a phylogenetic tree, and we hope to link our tools to those emerging soon from the STAM group at the Helsinki Institute for Information Technology.

Like almost all tools for the digital humanities, these are highly experimental. Unexpected things might happen, something might go wrong, or you might have a purpose for a tool that we never imagined.  So send us feedback! We would love to hear from you.

Hamburg here I come

As I write this I am on my way to Hamburg for DH2012. I’m very much looking forward to the conference this year, not only because of the wide variety of interesting papers and the chance to explore a city I’ve heard a lot of nice things about, but also because this year I feel like I have some substantial research of my own to contribute.

My speaking slot is on Friday morning (naturally opposite a lot of other interesting and influential speakers, but that seems to be the perpetual curse of DH.)  In preparation for that, I thought I might set down the background for the project I have been working on for the last two years, and discuss a little of what I will be presenting on Friday. After all, if I can set it down in a blog post then I can present it, right?

The project is titled The Tree of Texts, and its aim is to provide a basis for empirical modelling of text transmission. It grows out of the problem of text stemmatology, and specifically the stemmatology of medieval texts that were transmitted through manual copies by scribes who were almost never the author of the original text (if, indeed, a single original text ever existed.)

It is well known that texts vary as they are copied, whether through mistakes, changes in dialect, or intentional adaptation of the text to its context; almost as long as texts have been copied, therefore, scholars have tried in one way or another to get past these variations to what they believe to be the original text.  Even in cases where there was never a written original text, or where the interest of the scholar is more in the adaptation than in the starting point, there is a lot to be gained if we can understand how the text changed over time.

Stemmatology, the formal reconstruction of the genesis of a text, developed as a discipline over the course of the nineteenth century; the most common (“Lachmannian”) method is based on the principle that if two or more manuscripts share a copying error, they are likely to have been copied either one from the other or both from the same (lost) exemplar. There has been a lot of effort, scholarship, and argument on the subject of how one distinguishes ‘error’ from an original (or archetypal) reading, how one distinguishes genealogical error (e.g. the misreading of a few words in a nigh-irreversible way so that the meaning of the entire sentence is changed) from coincidental error (e.g. variation in word spelling or dialect, which probably says more about the scribe than about the manuscript being copied).  The classical Lachmannian method requires the practitioner to decide in advance which variants are likely to have been in the original; more recent and computationally-based neo-Lachmannian methods allow the scholar to withhold that particular pre-judgment, but still require a distinction to be made concerning which shared variants are likely or unlikely to have been coincidental or reversible.

A method that requires the scholar to know the answer in advance was always likely to encounter opposition, and Lachmannian stemmatology has spawned entire sub-disciplines in protest at the sheer arrogance (so an anti-Lachmannian might describe it) of claiming to know in advance what is important and what is trivial. Nevertheless the problem remains: how to trace the history of a text, particularly if we begin with the assumption that we know no more, and perhaps considerably less, than the scribes who made the copies?  The first credible answer was borrowed from the field of evolutionary biology, where they have a similar problem in trying to understand the order in which features of species might have evolved and the specific relationships to each other of members of a group.  This is the discipline of phylogenetics, and there are several statistical methods to reconstruct likely family trees based upon nothing more than the DNA sequences of species living today.  Treat a manuscript as an organism, imagine that its text is its DNA sequence, et voilà – you can create an instant family tree.

And yet phylogenetics, if you ask the Lachmannians and other text scholars besides, has its own problems.  First, the phylogenetic model assumes that any species living today is by definition not an ancestor species, and therefore must appear only at the edge of the family tree; in contrast we certainly still possess manuscripts that served as the ‘parent’ of other extant manuscripts.  Second, in evolutionary terms it is reasonable to model the tree as a bifurcating one – that is, a species only ever divides into two, and then as time progresses either or both of these may divide further.  This also fails to match the manuscript model, where it is easy to see a single text spawning two, three, or ten direct copies. Third, where the evolutionary model is assumed to be continously branching, it is well known that a manuscript can be copied with reference to two, three, or even four exemplars. This is next to impossible to represent in a tree (and indeed is not usually handled in a Lachmannian stemma either, serving more often as a reason why a stemma was not attempted.)  Fourth is the problem of significance of variants–while some scholars will insist that variants should simply not be pre-judged in terms of their significance, most will acknowledge the probable truth that some sorts of variation are more telling than other sorts.  Most phylogenetic programs do not by default take variant significance into account, and most users of phylogenetic trees don’t even try.

In a recent paper, some of the luminaries of text phylogeny argue that none of these problems are insurmountable. Neighbor net diagrams can give some clues regarding multiple text parentage; some more recent and specialized algorithms such as Semstem are able to build trees so that a text can be an ancestor of another text, and so that a text can have more (or even less) than two descendants.  The authors also argued that the problem of significance can be handled trivially in the phylogenetic analysis by anyone who cares to assign weighting factors to the variant sets s/he provides to the program.

While it is undoubtedly true that automated algorithms can handle assignment of significance (that is, weighting), it also remains true that there are only two options for assigning these weightings:

  1. Treat all variants as equal
  2. Assign the weights arbitrarily, according to philological ‘common sense’, personal experience, or any other criterion that takes your fancy.

This is exactly the ‘missing link’ in text stemmatology: what sorts of variants occured in medieval copying, how common were they, how commonly were they copied, and how commonly were they changed?  If we can build a realistic picture of what, statistically speaking, variation actually looked like in medieval times, it will be an enormous step toward reconstructing the stemmata by whatever means the philologist chooses, be it neo-Lachmannian, phylogenetic, or a method yet to be invented.

What we have done in the Tree of Texts project is to create a model for representing text variation, and a model for representing stemmata, and methods for analyzing the text against the stemma in order to answer exactly the questions of what sort of variation occurred when and how.  I’ll be presenting all of these methods on Friday, as well as some preliminary results of the number crunching. If you are at DH I hope to see you there!

Of circumstance and Armenian chroniclers

I promised to start blogging an inventory of my publications back in April. Yes, it’s now July. It turns out that my breezy confidence concerning the ease of discovery of my rights to my own work was…misguided.

My first publication arose from my M.Phil. thesis. The thesis itself was an enormous logic and date-accounting puzzle, which I thought was all kinds of fun but which, when described to fellow students, tended to get the reaction “I’m so sorry, that sounds horribly boring!”  That says something about the geek disposition, I suppose.

The topic of my thesis, and the eventual paper, was the chronological weirdness of the first book of the Chronicle of Matthew of Edessa. There is a back story there, on how a vaguely Byzantium-fancying computer geek came to be writing about an Armenian historical chronicle concerned in large part with a topic (the Crusades) that, had I been asked in 2003, I would have found utterly uninteresting.  It’s also a tale of how the smallest sorts of circumstance can shape a career.

I began grad school on the heels of the Great Dot-com Bust.  My bachelor’s degree was a strange MIT hybrid (“Humanities and Engineering”) which really meant that I had been on course to do a computer science degree when I realized that I could have a lot more fun doing half my coursework in history, and at the end of it I would still probably get a programming job at some Internet startup.  So it came to pass, but I could never shake the urge to go back and give history a more proper study.  In the end the universe did me a perverse sort of favor when my company laid me off just as I was finally resolving to prepare those grad school applications.

This is how I found myself in a room at Exeter College one gorgeous afternoon working out, together with the other new master’s students, what I ought to be doing for the next two years. Among the decisions we needed to make was the language we would study for the examination requirement; the (rather fantastic-sounding) options were Greek, Latin, Armenian, Syriac, Church Slavonic, and Arabic. I had enough Greek and Latin to be getting on with, but my powers as a dead-language autodidact had already failed me once when confronted with Armenian. Why not get some actual tuition in it and see how I did?

Of such whims are career paths made.  Once I had expressed a guarded interest in Armenian language, well, it seemed evident to the assembled dons that I should apply it by studying some Armenian history.  That turned out to be a field so very under-studied that potential thesis topics were lurking under nearly every assigned primary text and journal article.  I resolved eventually to write a thesis on the subject of the Armenian economy of the tenth and eleventh centuries, seeing what we might piece together by looking critically at literary and epigraphic sources. I dutifully began to read, and by August I had a collection of notes on the three main historians of the era (dots indicate approximate note volume):

  • [..]  Aristakes of Lastivert
  • [….]  Stephen of Taron       
  • [……………………………………………………………….]  Matthew of Edessa    

Hm. Clearly my thesis had chosen a direction, even if I hadn’t.  It was not Matthew’s poetic writing, vivid narrative, or historical accuracy that had caught my attention – in the latter case, rather the opposite. How could such a vast history be so very full of such obvious mistakes? Was there any rhyme or reason to them? Could we trust *anything* that Matthew was trying to tell us? If so, what? It took a few months more for the thesis topic to resolve itself to these chronological mistakes, but I got there in the end. The whole process began to turn into an intriguing logic puzzle that I had a lot of fun trying to solve, and it seemed a little unbelievable that no one had beaten me to it.

It took me three years (and another job in industry) to condense the thesis to an article suitable for publication, but I finally submitted it in 2008 to the standard journal for Armenian scholarship, the Revue des études arméniennes. My reward was a charming hand-written letter from the editor acknowledging my contribution and that he would be happy to publish it, though he wondered what my view was on certain issues I hadn’t addressed. I got to pretend for a moment that I was about fifty years older than I am, initiated into the academic community in an era where scholarship was carried on through personal correspondence.

As I have not heard anything from Peeters (and cannot find any information online) concerning author rights, and as I don’t believe I actually signed anything handing over any rights in any event, I have chosen to go with the safest reasonable option for open access: the final version of the article content, before typesetting.

Andrews, Tara L., ‘The Chronology of the Chronicle: An Explanation of the Dating Errors within Book 1 of the Chronicle of Matthew of Edessa’, Revue des études arméniennes 32 (2010): 141-64.

Introduction, inclusion, and open access

For as long as I have been part of the wider Digital Humanities community, I have felt like an outsider. On the periphery. Not part of the “clique”, although for the most part I have found DH people to be pretty welcoming.  So I was a little struck by the reports from the Cologne Dialogue on the Digital Humanities that took place this week, as (according to Twitter) disputant after disputant also claimed to be on the periphery of DH.  What is it with this field, that so many of its apparent members claim that they are not part of the “in-crowd?”
I don’t have a full answer to that (for now), but it was a similar thought process that got me to start this blog.  I’m a Byzantinist, I’m a computer hacker, I combine the two as often as I can–there are really no grounds here for exclusion.  What I did realize is that, more than most geeky pursuits, perceived membership in the “DH club” has almost everything to do with how often and how visibly you speak up.
So, by way of a general introduction, I am going to take a leaf from the book (ebook? blog?) of Melissa Terras, and make a series of posts about the work I have done to date and the publications that my work has led to.  Along the way I will check the open-access policy for each of the publishers, and make sure that anything that can be open access is, and post a link to it.  Unlike Melissa’s, mine will be a chronological history; given my odd hybrid career, it is best to avoid backtracking.  And really I think it is a great idea for us scholars to take advantage of whatever rights our publishers allow us to retain over our own work (which is more than I would have thought, for many journal publishers), and get that work out there and indexed in search engines.
This should be fun! Coming soon: how I ended up learning Armenian, and proof that I am indeed a hopeless nerd.

Coding and collaboration

So here we are in 2012, the Year of Code, and we should all be learning to code! Shouldn’t we? Especially if we belong to this community known as Digital Humanities, a field that is endlessly wrestling with its self-definition. Who’s in, who’s out? Is it really necessary to code? Don’t we have to know our stuff, computationally, if we are to understand what computers can do for us? Does coding culture exclude women, and is this imperative therefore sexist? Wouldn’t we be better off concentrating on being better humanists?

As a historian (since, arguably, 1999) and a coder (since 1996), I have to tell you: it’s not easy.  Sure, the ability to make things, to dream up a system and watch it take shape, to save yourself three days of work with five minutes of command-line scripting, is wonderfully empowering, and I wouldn’t have it any other way.  But along the way, to get to the triumph of having your tests pass and having your program actually work, there is a lot of grunt work, even more frustration, and a lot of time spent looking to your flanks, chasing after problems that aren’t directly related to your actual goal.

The gritty reality of learning to code

This is something I don’t think I have ever seen acknowledged in the Great DH Debates. To learn a little bit of code, enough to be able to manipulate variables and add some logic to a ‘for’ loop and wrap something else in an ‘if’ statement, is not hard at all.  To follow along with the Code Academy lessons, and learn exactly how some of that JavaScript web programming magic actually works, is a fine and productive thing to do.  To import that stuff onto your own website and make something creative and informative out of it is excellent.  But the thing that nobody tells you, and that you don’t have a visceral understanding for until you have been coding (preferably professionally) for a long time, is that, for all the “Eureka” moments, there are a hundred moments of wondering why your test is failing now, finding the misplaced parenthesis lurking in your code, realizing that your computer system upgrade means some libraries have moved around and your programs need to be updated, having the sinking feeling that you have solved this particular annoying data transformation problem three separate ways on four separate occasions, but none of them are exactly appropriate for the case you are facing now, so you will have to mostly reimplement the whole thing. That task you thought would take fifteen minutes has now taken over your entire day.

Or you run into a problem that you haven’t solved before, but it seems so obvious and so necessary that you know it must have been done.  And indeed, you will find eventually that it has been done, but as it is not part of a standard library and the problem is so integrated and/or specific, no one has seen fit to design and release a general-purpose solution for it (which would be far too much overhead anyway.)

Yak shaving

My apologies to anyone whom I lost in the preceding pair of paragraphs. The point I am trying to make actually got a name, long ago in Internet history:

You see, yak shaving is what you are doing when you’re doing some stupid, fiddly little task that bears no obvious relationship to what you’re supposed to be working on, but yet a chain of twelve causal relations links what you’re doing to the original meta-task. [Source]

Yak Shaving is the last step of a series of steps that occurs when you find something you need to do. “I want to wax the car today.”
“Oops, the hose is still broken from the winter. I’ll need to buy a new one at Home Depot.”
“But Home Depot is on the other side of the Tappan Zee bridge and getting there without my EZPass is miserable because of the tolls.”
“But, wait! I could borrow my neighbor’s EZPass…”
“Bob won’t lend me his EZPass until I return the mooshi pillow my son borrowed, though.”
“And we haven’t returned it because some of the stuffing fell out and we need to get some yak hair to restuff it.”
And the next thing you know, you’re at the zoo, shaving a yak, all so you can wax your car. [Source]

In fact, I wonder how many budding coders fully realize how prevalent this is.  You aren’t three levels deep in browser tabs looking for help on some odd JQuery problem you’re having just because you’re inexperienced; you’re there because all coders are there, at some time or another, and the need to do this never goes away.

You may not even be looking for help. Fundamentally, computer programming is a very low-level task, and the “do what I mean” language has never been invented. You might be able to describe the thing you want to do in a single sentence, but then you have to break it down to a series of computer statements, and you have to break some of those down even farther, and you have to be ultra-precise in your interpretation. At some point you will realize that there is some detail of the system that you intended to disregard, but that turns out to be important. There is a parallel to be drawn here with transcription or translation of manuscript texts. It doesn’t get you any credit to speak of, nobody likes doing very much of it, we take shortcuts and then desperately wish we hadn’t because now we have to go re-do some of the work, we all wish we could pass it off to enthusiastic but cheap helpers. Unless the work gets done, though, you will have nothing to show for your actual idea.

I would even say that the problem is worse, the more interesting the task you are trying to do–and let’s face it, the whole reason you’re a digital humanist is that you want to do interesting things that involve the computer, right?  The whole point is to try things that (hopefully) have never been tried before, and certainly to try things you have never tried before.  Unlike software contractors who might be providing Solution A for Company Z with a few improvements learned along the way, nearly everything you do is (or ought to be) in an exploratory direction.  You will constantly run into situations that you don’t understand, you will write and rewrite and refine the precise set of statements that reflect the concept you thought you had adequately coded six months ago, and you will never feel like an expert at this whole programming business.

Bring on the collaboration

Well, it’s time to bring in the experts then, isn’t it?  Here is where we come to another issue that DH (and before that, humanities computing, and before that, academic programming) has been facing for a long time.  What does it mean to collaborate?

The answer to this question, in fact, might depend on your answer to the question “does a digital humanist need to learn to code?”  The answers that I have seen tend to fall into two categories:

  1. No, as long as you can think systematically and understand the possibilities that digital methods open to humanities research, who cares if you know how to run a compiler? That’s what collaboration is for.
  2. Of course you have to learn to code, because otherwise you will never fully understand the possibilities, and anyway you will simply not get anywhere if you sit around waiting for others to provide the tools for your specific problems.

So it is clear in both of these answers that the two themes of methodological theory and programming skill are relevant, and in one answer they are more intertwined than in the other. But how far can collaboration really take us, today, in digital humanities research?

As Andrew Prescott most recently pointed out, in most collaborations between the academic and the programmer, the academic considers him- or herself the lead partner, and it is the responsibility of the programmer to realize the vision that will lead to a successful research outcome.  The vision may well have been shaped by the programmer, but the primary goal was the academic one all along. The dynamic has not disappeared with the establishment of dedicated Departments of Digital Humanities, with DH academic programs. The “traditional” humanist still tends to call the shots; the digital humanist supplies the hired help, and it is then up to him or her to find some means of extracting academic credit for the substantial work that is nevertheless not considered to be academic output worthy of record. In this model, while equal partnerships can happen, they are exceedingly rare. (That said, a properly equal partnership of this form does usually indicate a truly innovative project, since it implies that there is something there that is academically interesting to multiple fields.)

So to make any headway on the tenure track, it seems, the digital humanist must often put him- or herself in the driver’s seat of the project–that is, mostly on the humanities side, and seek collaboration with one or more programmers. This is the model of collaboration implied by those who see no need for digital humanists to do the coding themselves. But in this case there is no balance to be struck. Both the research result and the methodological credit will go to the non-coding humanist, digital or otherwise, who will simply have contracted out the grunt work necessary to build the actual tools. Now the coder is in the same position that the digital humanist occupied in the first scenario, only with even less of the academic credit; it is usually assumed that the coder is not really an academic at all. The work becomes just another programming job, albeit one that makes for good dinner conversation. Thus, while this is a fine model for employment if the humanist can afford it, it is not academic collaboration either.

The fundamental problem with humanities computing (if I may return to the slightly outdated phrase, and revive it to refer specifically to the practice of writing computer programs to solve problems in the humanities) is that an awful lot of the work has an awful lot of yak hair stuck to it. True, the end product might be spectacular. The methodological concepts behind the code might be mind-bendingly innovative. But how many academics can afford either the time to carry these projects through, or the money to hire people who can?

So by all means, get out there, learn to code. Find out what is possible. But understand that the things you want to do are still going to be hard, and forbiddingly time-consuming, without any sort of guarantee that the investment will pay off. If every digital humanist who doesn’t already know how to code gets out there tomorrow and signs up for a class, if the doors to this field are trampled down by techies and early dot-com retirees who really are code wizards and want a change of pace, what then? How will we explain to funders that we haven’t written any papers for the last six months because we were too busy trying to build a computational model for the evolution of Greek iconography from the tenth to the sixteenth centuries, and ran into some problems with databases along the way, and realized halfway through that the model needed to be re-designed to include UV identification of ink types? Put another way, how is our field going to bridge the gap between what we would like to do and what we are able to do?

Χαῖρε, κόσμε (Hello, world)

I’ve been meaning to start this blog for a very long time now, just as soon as I could work out what I might have to talk about. As time passes, though, it becomes increasingly clear that (at least in my own little hybrid sector of the humanities) scholars need a web presence nearly as much as they need a decent list of publications on their CVs.
So here I am, joining the 21st-century version of the Republic of Letters. I’ll be ruminating about topics on the digital humanities, and (I very much hope) topics on Byzantine and Eastern Christian history, in the months to come. I know there are quite a few digital humanists of various stripes in this new Republic; I hope I can ferret out a few more Byzantinists as well!