C. M. Sperberg-McQueen
19 December 2018
Note October 2022: this document is out of date.
This document describes issues relating to the duplication of data in the TLRR 2 project between a ‘front end’ (a Subversion repository managed by Apache and updated via WebDAV) and a ‘back end’ (a BaseX database on another server, accessed over HTTP through BaseX's HTTP REST interface).
An early design decision in the project was that the front-end copy of any record is the master copy; the back-end database is used as a subsidiary device to allow more powerful searching. Whenever records are updated, first the change is made in the front-end system and then an HTTP message is sent to the back-end server essentially saying “Record so-and-so has changed; update your copy.”
This approach is not necessarily always the right one. It was chosen in part because it seemed a good idea to have a complete revision history in Subversion, to allow recovery from catastrophic errors, and in part because it was familiar from other earlier projects. The original choice was influenced in part by the fact that XQuery search and retrieval were more mature than the update facility and in part by the fact that I didn't know update at all well.
In practice, there are several ways in which the front and back ends can fall out of synch.
When a record is updated in the front end, the XForm can fail to request a corresponding back-end update.
This has not happened (that I am aware of).
The corresponding back-end update can fail.
This has happened when the collection
parameter was
introduced; the change was propagated to some forms and queries but
not to all, and back-end updates failed without the failure being
visible to the user.
When manual updates are made to records and checked in to the Subversion repository, a script must be run to tell the back end to update its copy of those records.
This has apparently happened on one or more occasions.
There are three mechanisms for checking whether the front and back ends are in synch.
The ID-list synchronization check for front and back ends (restricted-access link) asks the user to specify a database and a record type, and then (rather cumbrously) loads lists of IDs for those records from both the front end and the back end, and compares the lists. It thus makes it possible to detect records that have been added or deleted in one database but not the other.
It does not check that records with the same ID are in synch.
The Single-record comparison form allows the user to load one record from front and back ends and compare the two versions. If they appear to differ, the user also has the option of asking that the back end be updated.
Top-level fields are aligned visually to ease comparison. For simplicity of implementation, however, the form does not perform a careful comparison of the corresponding fields; instead it uses two simple heuristics to guess whether they are the same or not.
To check for differences in character-data content, it compares the whitespace-normalized string values of the two fields being compared. This will detect any changes to the text of the field, but it also sometimes detects differences when the back end introduces insigificant whitespace which is not present in the front end.
A more aggressive use of xsl:strip-space
reduced
the number of false reports of difference, but did not eliminate
it entirely. (Some of the residual errors involved
non-meaningful whitespace at the beginning or ending of a block
element, in mixed content.)
Perhaps the comparison normalize-space($a) =
normalize-space($b)
should be supplemented (or even
replaced) by translate(normalize-space($a),' ','') =
translate(normalize-space($b),' ','')
.
To check for differences in markup, it compares the number of element nodes in the two fields being compared; this detects most changes to markup but not simple changes from one element type to another or changes to attribute values.
The heuristics are almost embarrassingly simple-minded, but they appear in practice to serve their purpose: they allow the stylesheet to draw the user's eye to areas of difference. Over-sensitivity in the comparison is irritating but does not subsantive harm: in any case of doubt, it is always safe to ask the back end to update itself; the only cost is unnecessary network use and CPU time.
We do not currently have a web page from which it's possible to request a comparison of all records in the database. It would be a good idea, but we ran out of time for implementation, and it was eventually quicker to do it locally than on the Web. The procedure is as follows.
Use the ID-list synchronization checker to make sure both databases have the same records.
Update the local copy of the front-end database using
svn update
.
Get a list of IDs for the appropriate database from the
back end, by issuing the query for $d in collection("TLRR2")
return base-uri($d)
on the back end (either from the BaseX
dbadmin interface or using curl and appropriate credentials; the
BaseX user must have read access to the database.
Store the results in a file, making sure the last line ends in a newline.
Make a local copy of the back-end database using the bash
commands cat ids | while read u; do echo $u; curl --silent
--user ${USER}:${PW}
"http://modeleditions.blackmesatech.com/BaseX831/rest/${u}" > $u;
done
.
In temporary directories, make whitespace-normalized copies
of the front- and back-end databases by using Saxon to evaluate
the tlrr-pretty-print.xsl
stylesheet on them.
For example: for f in fe/TLRR2/trials/*.xml; do echo $f
...; ~/bin/saxon-he-wrapper.sh $f
../../../lib/tlrr-pretty-print.xsl fe-pretty/TLRR/trials/${f##*/};
done
and similarly for the back end.
Run diff -rqw
on the pretty-printed front and
back ends.
Check any differences manually. Or just update all the relevant records and have done with it; I check the individual files because I want to know why the front and back ends have fallen out of synch.
The pretty-printing stylesheet performs an identity transform;
it is TLRR-specific only in that it has xsl:strip-space
and xsl:preserve-space
elements which list the
elements of the TLRR vocabularies.
Saxon appears to indent more aggressively than xsltproc, which means that running both databases through the pretty-printer with Saxon eliminates a high percentage of the noise differences involving non-significant whitespace which otherwise render comparisons difficult. Using xsltproc, the pretty printing is less effective and the output raises reports of differences which prove uninteresting.
Saxon and BaseX appear to perform very similarly in indentation,
but they use different indentation levels and there are apparently
other differences which cause unhelpful diff output even when the
-w
(ignore whitespace) flag is used. So the
pretty printing needs to be applied to the back end as well
as the front end.
Last updated 19 March 2018
Photo "Foro romano in crepusculo
© 2013 by MauroPPP;
some rights reserved.