Home - Kura - Datamodel
an open-source multi-language multi-user linguistics database application
Kura is an application intended to facilitate descriptive and analytic linguistic work in the tradition described by Dixon as 'Basic Theory'. It will be a multi-user, multi-language application with facilities for linking data between languages. Since linguistic data can be available in the form of sound files, manuscript scans and textual data, Kura will be a true multi-media application. Kura will be an extensible application, based on open standards, like SQL, XML and Unicode.
The Summer Institute of Linguistics has for years provided the gratis software package Shoebox, which is a capable single-user linguistics database and analysing tool. However, this is a closed-source application that only runs on non-standard and volatile environments like Microsoft Windows and Apple Macintosh. Also from SIL is CELLAR, the Computing Environment for Linguistics, Literary and Anthropological Research. (Simons and Thomson 1998). This and related projects are described in Nerbonne (1998). Lawler and Dry (1998) give a general introduction to the subject. The Himalaya Languages Project in Leyden has started up a project similar to this at the instigation of the present author, but was unable to pursue the project due to a lack of funding.
The language of development is Python, the back-end any SQL database. The current implementation of the back-end is on MySQL, but PostgresSQL, Oracle, Sybase, DB2, mSQL or Ingres should work too, as long as there is a standard Python DB interface available. Two separate interfaces are intended: a web-server based interface and a graphical interface for the Unix KDE desktop. A modular design will facilitate the development of other interface components.
Other possible choices would have been: storing the data in an XML format (not chosen because storing the parsed data instead of the compiled data improves concurrent access: there will be a module to export the data to an XML file or a HTML file, and the KDE interface uses HTML internally to present the data), using a multi-platform interface for the gui client, such as TkInter or WxPython (the first doesn't do tables, the second uses the far from stable GTk widget set on Unix). Also, with version 2.0, Qt is well on the way to pervasive Unicode support. Due to constraints in terms of available development time and resources, writing an application in C or a C derived language such as Java is out of the question.
Python has the added advantage of being very much a language designed to be usable for the subject specialist who is not a software engineer. Linguists can use Python to add their own modules to Kura.
While the relational model SQL uses is not explicitly suited to hierarchical, ordered data, these problems can be overcome by using well-known and time-honoured design techniques.
At the present time Unicode support is not feasible yet without investing in expensive, closed-source components. However, Qt 2.0 already supports Unicode, the WChar extension package for Python provides Unicode support for Python, and the maintainer of the pyKDE/pyQT interface between Python and KDE/Qt is actively developing a version that will support Qt 2.0 and Unicode. Of the back-end servers, only PostGreSQL and Oracle are already capable of supporting Unicode text, to my current knowledge. Unicode support will be an essential feature of the finished application.
At the core of the application is the raw linguistic data. True linguistic data comes in one of two forms: text and sound. Textual data can either be in the form of original manuscript data or other written materials. Sound data results from field work tapes. Together, these data constitute the corpus for a language. Attributes of this corpus are for instance: place of origin, date of origin, author, recording technique, recording or transcribing linguist. Other attributes are possible.
Linguistic data can be analysed in two ways: on a phonetic/phonological or graphical level or on morphological/syntactical or semantical level. An analysis of linguistic data is often hierarchical in nature: texts divide into sentences, sentences into phrases, phrases into words, words into morphemes and morphemes into sounds. On the other hand, linguistic data is alway linearly ordered: words and sounds follow each other. The relational model is singularly ill equipped for storing linear data, but with some programming complexity this is solvable. However, it is clear that this will be the most complex and error-prone aspects of the Kura project.
Linguistic endeavours are divided into projects, some large, like 'preparing a grammatical description of Kham', some small, like 'the phonological status of centro-palatal stops in Nepali', and others span more than one language: 'the development of the tense system in the Bantu languages'. Projects are carried out by one or more linguists. Proper attribution to analyses presented by the Kura system is important from a point of view of scholarly accountability. Since linguistics as a scholarly discipline is a process-oriented endeavour it is important to preserve the history of analyses. This is done by making the user and the date part of the natural key of every relevant table.
Ultimately Kura will assist the linguist in producing flexible multi-media descriptions of linguistic data, according to the program laid out in Rempt (1999).
The fieldworker provides the language with language data in the form of manuscript data and recordings. The manuscript data and recordings are transcribed and form the basis for the analysed data. The transcribed data is analysed and produced linear phonetic data and linear and hierarchically ordered lexical and structural data.
It is difficult to present an implementation plan for an open source project, dependent as it is upon spare time. The first version of the datamodel was ready by July the 31st. It is intended that a pre-alpha quality release based on a simple MySQL interface and a reduced feature set should be ready by August 20th.
The application prefix for Kura is provisionally lng_, in order to provide a namespace dedicated to the application within the SQL database. An important design decision is whether to allow customisation by using imploded tables, and if so, to what degree. This issue is not only relevant to the usability of the final application, but also to the performance. Provisionally it has been determined to allow user-defined tags, but not to make every attribute user-definable. Therefore there are multiple look-up tables. All look-up tables are suffixed with -code. All look-up tables with information pertinent to the running of the application are suffixed with _sys. All tables have a four letter alias which is used in the naming of keys and many-to-many link tables. Every non-code table has a surrogate primary key. The column is called table_name (without prefix) + nr. All tables that represent data in sequential form have a column seqnr.
The administration of the projects could be further expanded with an interface to a bibliographic database. For now, the lng_references table provides a simple implementation of a bibliographic database, so that analyses can be directly linked to references, via a tag.
The definition of a language must be further expanded, for instance with script, unicode range, sort order, etc.
Every user can work on more than one project; more than one user can work on a project. A user is affiliated to an institution, such as a university. Institutions, users and projects can have home pages. It remains to be seen whether it is relevant for projects and institutions to have addresses, too. In that case, a general address table might be useful.
The most important decision with regards to the language data is whether the multimedia content, namely the recordings and the scans, should be stored in the database, or on the file system. Storing in the database has the advantage of enforced synchronizing, meaning that it will be difficult to sever the connection between the data and the file. Storing in the filesystem makes for a faster and less complicated implementation. For the time being, multi-media content will be stored on the file system, reachable by an URL, to ensure network transparency. Fortunately, both Python and KDE support this feature pervasively. Recordings and scans are transcribed into texts. Every scan, recording and text originates with one project, although the texts can be used in more than one project.
The lng_stream table subdivides the texts in analyzable, complete and reasonably coherent chunks, such as can be used as complete examples.
Elements form the analysis of the structure of the text. In this way they represent the parsed and tagged elements that the text consists of. An element can refer to an item in the lexicon. Elements can be defined recursively, if necessary to the phonetic level.
Lexical items can be practically everything, from proper names to words, morphemes and even sounds. The exact type of lexical item depends upon the tag used. Lexical items can be related to each other using the table lng_lex_lex.
A thesaurus database could be used to dynamically link semantically related items within the lexicon, for a single language, or a range of languages. A many-to-many table links lexical items to each other in a structured fashion, for instance for the the elements of compounds or set phrases.
Tags offer structural information on streams and elements. As many tags as necessary can be added to an element, using the link tables lng_text_tag, lng_stream_tag, lng_element_tag and lng_lex_tag. Tags are unique in value.
The purpose of this chapter is to take a small text and see if it fits into the datamodel shown above. I've taken a random bit from a text from Rutgers (1998: 447), with the transcription adapted to the possibilities of the web.
2. mo.ba buni laett.a.ji.ro laett.a.ji.ro - syal.ae?ae jal.di. that.ELA bond_friend be.PT.DU.REP be.PT.DU.ReMP syal.POS jal.EXH 3. mo.ban.no? syal.dae? ikko ping toks.u.ro. that.ELA.EXF syal.ERG one ping make.->3.REP They were bond friends -- this was the jackal's trick. The jackal built a swing.
The main Kura window uses the explorer interface which is familiar to most users. The user can drill down to the data he or she wishes to view or edit, starting from data, languages, language or users. For instance, when working with the Yamphu Grammar project, the right hand pane will show the users working on that project, the texts, recordings and scans that belong to that project, and so on.
When a certain text, recording or scan is chosen to work with, this view changes to an interlinear view of the text, a screen that allows transcription from a recording or a scan, or a view of the lexicon.
Other entry screens will be made for the maintenance of administrative data - these screens will also be available als HTML forms, using CGI scripts.
For the semi-automated interlinearizing of texts a special interface element will have to be developed.
The first implementation of the back-end datastructure is finished and ready for downloading. This version makes some use of MySQL idiosyncrasies, but those will disappear in the next version.
The interface prototype can be downloaded too.
The central database that stores all linguistic data can be stored on one networked server. The actual configuration depends on the number of concurrent users and the volume of data, but ten concurrent users and a gigabyte of data can be succesfully served from a commodity pc running a dependable server OS like FreeBSD or Linux.
If legacy clients like MacIntoshes and Windows PCs must be supported, a second machine serving the KDE applications and the dynamic web content can be installed. Again, a commodity pc will present sufficient resources in terms of throughput. Currently, it looks like the KDE applications will need about 10 mb per concurrent client. The Mac and Windows machines must then be running one of the available (possibly free) X servers and a web browser.
Lawler, John M. and Helen Aristar Dry. 1998. Using Computers in Linguistics. Routledge.(http://www.routledge.com/routledge/linguistics/using-comp.html)
Nerbonne, John (ed.). 1998. Linguistic Databases. Stanford, CSLI Publications.
Rempt, Boudewijn. 1999. The ideal grammar. (http://rempt.xs4all.nl/conlang/dream.html).
Rutgers, Roland. 1998. Yamphu. Leyden, Research School CNWS.
Simons, Gary F. and John V. Thomson. 'Multilingual Data Processing in the CELLAR Environment', in Nerbonne (1998), 203-234.