Krzysztof Lis
XML-native databases (BaseX, eXist-db) are valuable tools for exploring and processing massive XML data.
In this article, I show an example of loading, exploring and applying business logic to TARIC, the integrated tariff database of the European Union.
Using BaseX, we can load the massive 5GiB dataset directly, without any conversions. Despite the data size, we can use efficient queries to inspect the structure of the data.
The XQuery language provides very flexible queries of XML documents, but requires some mental effort when used for complex algorithms.
Using XML-native technologies avoids conversions to other formats and data-structure design, significantly reducing the development effort.
BaseX is ready to be integrated into a distributed application, for example as a REST data-source.
While the core data-structure of SQL databases is a table with homogenous rows, the XML databases store and query a tree of XML nodes. We evaluate the following XML-native databases:
BaseX [site], [source], [documentation]
author: Christian Grün
license: BSD
project start: 2007
eXist-db [site], [source], [documentation]
authors: Wolfgang Meier, Pierrick Brihayne, Leif-Jöran Olsson, Adam Retter, Dmitriy Shabanov and Dannes Wessels
license: LGPL
project start: 2000
They share many common characteristics and features:
In the experiments that follow, we find that BaseX can deal with bigger files, and executes queries several times faster than eXist-db.
We speculate it may be due to its extensive indexing, or query optimization (the optimized version of the query is shown in the bottom left corner of the window).
Considering those results, we recommend BaseX over eXist-db.
The TARIC database concerns EU import/export tariffs, the classification of goods, agricultural measures, prohibitions and restrictions to import and export, customs currency exchange rates and so on.
We download the dataset for the beginning of 2020: [ISZTAR4 - Tariff Browser - Startup file download].
The size of the decompressed file base-20200101T000000-20200116T013000.xml
is 5.3 GiB.
This poses a challenge for traditional tools such as text editors or DOM parsers,
while an XML-native database is well suited for processing such data.
Inspection/exploration: We want to inspect the document to understand its structure. Many text editors fail to load such a big file, and anyway manual inspection is hard for a document of this size and complexity. The database offers flexible XPath/XQuery queries, capable of efficient exploration of the data - we show examples in a later section.
Memory: It is a common approach to parse the XML file to an in-memory DOM and operate on it with the DOM API of one's programming language of choice. Here the DOM's size might exceed the available memory. A streaming parser (SAX) would be memory efficient, but it is less flexible: we would need to know beforehand what information to keep from the stream and how to organize it into the program's data structures; for example, sorting would be difficult. In contrast, a database is designed to operate on data larger than available memory.
Indexing: The database can index the data and optimize queries, allowing efficient operation at scale.
In this experiment we focus on the section concerning the classification of goods. The goods are arranged into a hierarchy of categories, for example bees are located at Live Animals / Other live animals / Insects / Bees. The database contains only a flat list of categories, and the hierarchy has to be implied from product codes and other properties. We intend to reconstruct the hierarchy, for use for example in a selection GUI. We implement the following:
basex
- BaseX installation directory, decompressed from BaseX931.zipbasex-data
exist-distribution
- eXist-db installation directory, decompressed from exist-distribution-5.1.1-unix.tar.bz2exist-data
data_in
- download input data
startup-file-20200116T013000
base-20200101T000000-20200116T013000.xml
split
data_out
- saved processing outputssrc
- processing code
basex
exist
Let us start with BaseX.
We perform the setup, import and queries using the BaseX GUI
At the time of the experiment, the latest version is 9.3.1
.
There is a Docker image available: basex/basexhttp
, but it is based on JDK 8.
We prefer to use a newer Java (openjdk 11.0.6) which is installed on our Ubuntu system (package default-jre).
For deployment, Docker (potentially with docker-compose
) is a very convenient option and it should be easy to rebuild the image on top of updated OpenJDK.
Download BaseX931.zip and extract it as the basex
subdirectory.
Launch BaseX GUI:
cd basex
bin/basexgui
Apply initial settings:
exist-data
directorySet the admin password by executing the following query:
user:password("admin", "fff")
We load the dataset into BaseX without problems and keeping the memory usage low.
Select Database -> New from the main menu
Specify the input file and database name. We also select the custom XML parser which claims better performance. It seems possible to load files directory from compressed archives, so we could have loaded the data directly from the downloaded ZIP archive (204MiB).
Start the import (press Ok).
The progress indicator also shows memory used, which does not go above 200MiB while loading the file.
Then BaseX builds indexes, using around 500MiB.
The import process is memory efficient and fast, taking slighlty above 6 minues.
Database 'taric' created in 383660.1 ms.
Inspect the created database using Database -> Properties from the main menu.
We can see the various index structures created by BaseX, using tag-names, paths, and text content.
It manages to infer that some ID fields are numeric (hjid
).
Interestingly, the size of the stored database is 4050 MiB, that is less than the input file of 5467 MB - the sotrage format is more efficient than textual XML.
eXist-db is quite similar to BaseX, but instead of a desktop GUI, it has the eXide web interface. As the time of the experiment, the latest version is 5.1.1
.
Like before, there is a Docker image: existdb/existdb
, but it too is based on JDK 8 and we will use our system's newer JVM instead.
Download exist-distribution-5.1.1-unix.tar.bz2 and extract it as the exist-distribution
subdirectory.
Edit the config exist-distribution/etc/conf.xml
. Our modified config file is in src/exist/conf.xml
.
exist-data
:<db-connection ... files="../../exist-data" ... >
...
<recovery ... journal-dir="../../exist-data" .../>
<parser><xml><features>
section.<db-connection>
, disable the <watchdog>
- we hoped this would make the import of a big file possible, but it seems to have had no effect.Launch the eXist-db server:
cd exist-distribution
bash bin/startup.sh
Set the admin password using the Java client:
cd exist-distribution
bash bin/client.sh
xmldb:exist://localhost:8080/exist/xmlrpc
admin
and set its passwordConnect a browser to the web interface at http://localhost:8080
, select eXide to view and edit the internal file system.
There are many ways to import data into eXist-db: [Getting Data into eXist-db]. Below we detail several attempts to load the big XML file into eXist-db, but all of them have ultimately failed. In the end we extracted the goods nomenclature section (~500MiB) which could be loaded. Nevertheless, we conclude eXist-db struggles with big data.
First we try the Java client.
cd exist-distribution
bash bin/client.sh --user admin --password fff --collection /db/taric/02_import --parse ../data_in/startup-file-20200116T013000/base-20200101T000000-20200116T013000.xml
This attempts to load the whole file into memory and crashes:
Caused by: java.lang.OutOfMemoryError: Required array size too large
Using the GUI client yields the same error, but the GUI does not even inform us about the error and instead shows an endless progress bar.
The eXide web interface has an upload feature but already warns us that it can only deal with files up to 100 MiB. True to its word, it silently fails when we supply our big file. No message is shown in the site UI, the only indication of the failure can be found in the JS console.
The database's internal file system is also accessible through WebDAV. There are many programs which can view and manage WebDAV directories, including many OS's file managers. Here we show how to upload the file using the simple CLI program cadaver:
$ cadaver http://localhost:8080/exist/webdav/db
Authentication required for exist on server `localhost':
Username: admin
Password:
cd /exist/webdav/db
mkcol taric
cd taric
mkcol 02_import
cd /exist/webdav/db/taric/02_import
put data_in/startup-file-20200116T013000/base-20200101T000000-20200116T013000.xml taric-base.xml
quit
The upload completes, and the file is stored in /tmp/exist-db-temp-file-manager-10164176935423371398
for processing (we know this location from the logs).
Then the database becomes unresponsive and uses a whole CPU core for several minutes.
When the processing is over, the data is unfortunately nowhere to be found inside the database.
We only get a mysterious log line (exist-distribution/logs/exist.log
):
2020-01-23 17:29:50,879 [qtp42338572-53] WARN (TransactionManager.java [close]:400) - Transaction was not committed or aborted, auto aborting!
No reason is given why the transaction should be aborted, perhaps this is a bug.
We decide to load only the goods nomenclature section which is just ~500MiB.
From the exploration performed with BaseX, we know each section is inside a <IsztarHistoryItem>
element, so we split the file after each occurence of IsztarHistoryItem
.
We adapt the awk
solution described here:
mkdir -p data_in/split
awk 'BEGIN{x="section_0_top";} /IsztarHistoryItem/{x="data_in/split/section_"++i".xml"; next;}{print > x;}' data_in/startup-file-20200116T013000/base-20200101T000000-20200116T013000.xml
The goods nomenclature is located in section_63.xml
. We load it using WebDAV as described above, storing it in /db/taric/02_import/goods_nomenclature.xml
.
cd /exist/webdav/db/taric/02_import
put data_in/split/section_63.xml goods_nomenclature.xml
Having loaded the document, we want to inspect its structure. XPath/XQuery allows us to formulate very flexible queries. Thanks to the index structures, these are executed quickly. Such interactivity would not be possible if we had to read the whole file and do a naive string search.
In the BaseX GUI, we open our database from the top menu: Database -> Open & Manage. Some part of the file is shown in the bottom left result viewer (we can return to view the database document by clicking the house button there).
In the top left part of the window, we can browse the code on the disk. Let us open src/basex
.
The scripts used in this section are in src/basex/03_exploration
.
Clicking on a code file will open in in the editor. Then we run the query using the green arrow above the editor.
First we list the names of the top level elements:
(: List names of top-level elements (one below root) :)
let $doc := doc("/taric/base-20200101T000000-20200116T013000.xml")
let $root := $doc/*
return $root/*/name()
Unfortunately they are just <IsztarHistoryItem>
s:
ResultsInfo
IsztarHistoryItem
IsztarHistoryItem
IsztarHistoryItem
...
Descending one level, we see the interesting section names.
(: List names of sections, 2 levels down from root :)
let $root := doc("/taric/base-20200101T000000-20200116T013000.xml")/*
let $section_names := $root/*/*/name()
return <SectionOverview>
<Num>{count($section_names)}</Num>
<Names>{$section_names}</Names>
</SectionOverview>
The output can be structured as an XML document:
<SectionOverview>
<Num>39</Num>
<Names>totalRecords databaseDate executionDate findFootnoteTypeByDatesResponse findCertificateTypeByDatesResponse findAdditionalCodeTypeByDatesResponse findLanguageByDatesResponseHistory
...
Now we know the goods section is located at $root/IsztarHistoryItem/findGoodsNomenclatureByDatesResponse
.
We count how many items there are in the goods classification hierarchy. They are given as a flat list.
let $root := doc("/taric/base-20200101T000000-20200116T013000.xml")/*
let $goods_section := $root/IsztarHistoryItem/findGoodsNomenclatureByDatesResponse
let $goods_items := $goods_section/*
return count($goods_items)
There are 62313
items.
We run the same query in eXist-db, replacing the top two lines because in eXist-db we have only loaded the goods section:
let $goods_section := doc("/db/taric/02_import/goods_nomenclature.xml")/findGoodsNomenclatureByDatesResponse
The query takes 0.1s in BaseX but 40.1s in eXist-db.
We can also view a chosen item:
return $goods_items[100]
yields:
<GoodsNomenclature xmlns:ns2="http://www.mf.gov.pl/schematy/isztar/ecipSeapHistoriaObi/2014/01">
<hjid>19774026</hjid>
<metainfo>
<opType>C</opType>
<origin>T</origin>
<status>L</status>
<transactionDate>2010-06-25T18:55:25</transactionDate>
</metainfo>
<oid>19774026</oid>
<sid>27722</sid>
<goodsNomenclatureItemId>0102904991</goodsNomenclatureItemId>
...
The TARIC XML is an extract from some object database, inluding the history of changes. The database is a list of transactions - adding, modifying or deleting objects.
Specifically, each object tag contains a <metainfo>
element describing the nature of the change:
the <opType>
can be "C" for Create, "D" for Delete and so on.
Two types of documents are published: snapshots for a given date or differential updates since the last published update. We operate on the snapshot for 2020-01-16, so all transactions are creating objects. We can therefore treat the database as a list of existing objects, and ignore the transaction data.
To be sure, we verify that all opTypes are "C" (create) and that the latest transaction is in the past (2019-12-27).
let $root := doc("/taric/base-20200101T000000-20200116T013000.xml")/*
let $goods_section := $root/IsztarHistoryItem/findGoodsNomenclatureByDatesResponse
let $transactions := $goods_section//metainfo
let $t_optypes := $transactions/opType/text()
let $t_dates := $transactions/transactionDate/xs:dateTime(text())
return <transactions>
<num_transaction_metainfo>{count($transactions)}</num_transaction_metainfo>
<uniqueOpTypes>{distinct-values($t_optypes)}</uniqueOpTypes>
<transactionDates>
<earliest>{min($t_dates)}</earliest>
<latest>{max($t_dates)}</latest>
</transactionDates>
</transactions>
This query takes from 7 to 9 seconds (35.4s in eXist) and confirms that all transactions are Creating objects and all have been in the past.
<transactions>
<num_transaction_metainfo>1206873</num_transaction_metainfo>
<uniqueOpTypes>C</uniqueOpTypes>
<transactionDates>
<earliest>1410-07-15T12:00:00</earliest>
<latest>2019-12-27T19:19:12</latest>
</transactionDates>
</transactions>
The database contains historical data and we need to filter it out to calculate the current state.
The scripts used in this section are in src/basex/04_filter_and_extract
.
Using the <validityStartDate>
and <validityEndDate>
, we filter the currently valid items:
let $root := doc("/taric/base-20200101T000000-20200116T013000.xml")/*
let $goods_section := $root/IsztarHistoryItem/findGoodsNomenclatureByDatesResponse
let $goods_items := $goods_section/*
let $date_now := xs:dateTime("2020-01-20T12:00:00")
let $goods_valid := $goods_items[
(: start date in the past :)
xs:dateTime(validityStartDate/text()) <=$date_now
and
(
(: no end date, or end date in the future :)
empty(validityEndDate)
or
xs:dateTime(validityEndDate/text()) > $date_now
)
]
return <ValidityStats>
<numGoodsAll>{count($goods_items)}</numGoodsAll>
<numGoodsValid>{count($goods_valid)}</numGoodsValid>
</ValidityStats>
It appears that around 1/3 of items are currently valid:
<ValidityStats>
<numGoodsAll>62313</numGoodsAll>
<numGoodsValid>24427</numGoodsValid>
</ValidityStats>
The goods categories are stored in a very verbose way in the original dataset, we extract the relevant and up-to-date information.
This is done by the query src/basex/04_filter_and_extract/extract02_filter_and_extract.xql
and associated functions in taric_goods_extraction.xqm
.
We are interesting in the following information:
<goodsNomenclatureItemId>
and product line from <produclineSuffix>
<goodsDescription>
<goodsNomenclatureIndents>
Since the descriptions and indents may have historical version, we take care to extract the latest. The extracted information about a category looks like this:
<GoodsNomenclature code="0101" indent="0" productline="80" hjid="19773322">
<description lang="DE">Pferde, Esel, Maultiere und Maulesel, lebend</description>
<description lang="EN">Live horses, asses, mules and hinnies</description>
<description lang="FR">Chevaux, ânes, mulets et bardots, vivants</description>
<description lang="PL">Konie, osły, muły i osłomuły, żywe</description>
</GoodsNomenclature>
We save the results to data_out/goods_nomenclature_flat.xml
.
Removing historical data and distilling relevant information reduces the data size (of the goods section) from 500MiB to 12MiB.
...
let $goods_info_useful := for $goods_elem in $goods_items_valid
let $extracted_info := taricxml:extract_goods_info($goods_elem)
order by $extracted_info/code
return $extracted_info
let $goods_nomenclature_flat := <GoodsNomenclatureFlat>{$goods_info_useful}</GoodsNomenclatureFlat>
return (
(: store in the DB, http://docs.basex.org/wiki/Database_Module#db:replace :)
db:replace("taric", "goods_nomenclature_flat.xml", $goods_nomenclature_flat),
(: write output to file system :)
file:write("../data_out/goods_nomenclature_flat.xml", $goods_nomenclature_flat)
)
The provided list of goods has no explicit hierarchy, such as a parent's code, or list of children. We reconstruct the hierarchy based on the following factors.
TARIC code
The code describes a good's place in the hierarchy.
The first two digits specify a high level category, such as 01 Live Animals or 07 Edible Vegetables And Certain Roots And Tubers. The following digits define a hierarchy of finer categories.
Both the broad categories and specific goods are represented by nodes in the database. A high level category will have its digits specified, followed by zeros. For example:
0100000000
LIVE ANIMALS0102000000
Live bovine animals0102310000
BuffaloWe sort by the codes to ensure that children follow their parent node. We remove the trailing zeros and verify that a parent's code is a prefix of its children's codes.
Product line
If two items have the same code, the one with the lower product line is the parent.
Indents If an item with k indents is followed by an item with k+1 indents, the second one is the first one's child.
The tree construction algorithm is implemented in src/basex/05_process_tree/build_hierarchy.xql
.
Broadly speaking, it involves:
The implementation of a complex algorithm in XQuery is somewhat unintuitive.
The language is functional and its data structures are immutable - instead of removing a processed element from a sequence, we return a new sequence without the first element (tail
function).
There is no while
loop, so instead we use a recursive function to loop over all items, in the style of tail recursion.
The basic construct of the language is FLWOR,
a kind of for-each loop, but it can not be terminated in the middle of the sequence;
in this task we do not use it because we do not know the number of an item's children beforehand.
It is unclear how to return multiple values from a function.
We needed to return a tuple (node, sequence)
; we can not use a second sequence, because it would merge with the sequence inside. We used an array which does the job but with tedious syntax:
let $a_and_b := f($x)
let $a := $a_and_b(1)
let $b := $a_and_b(2)
Perhaps we should have used an in-memory XML element.
We save the resulting tree as data_out/goods_nomenclature_tree.xml
and extract a compact representation (of one language) to data_out/goods_nomenclature_tree_readable_en.xml
.
This processing step took 3s.
<G code="" name="Goods Nomenclature">
<G code="01" name="LIVE ANIMALS">
<G code="0101" name="Live horses, asses, mules and hinnies">
<G code="010121" name="Horses">
<G code="010121" name="Pure-bred breeding animals"/>
<G code="010129" name="Other">
<G code="0101291" name="For slaughter"/>
<G code="0101299" name="Other"/>
</G>
</G>
<G code="01013" name="Asses"/>
<G code="01019" name="Other"/>
</G>
<G code="0102" name="Live bovine animals">
<G code="010221" name="Cattle">
<G code="010221" name="Pure-bred breeding animals">
<G code="0102211" name="Heifers (female bovines that have never calved)"/>
<G code="0102213" name="Cows"/>
<G code="0102219" name="Other"/>
</G>
<G code="010229" name="Other">
<G code="01022905" name="Of the sub-genus Bibos or of the sub-genus Poephagus"/>
<G code="0102291" name="Other">
<G code="0102291" name="Of a weight not exceeding 80 kg">
<G code="010229101" name="Young male bovine animals, intended for fattening"/>
<G code="010229102" name="Heifers of the grey, brown or yellow mountain breeds and spotted Pinzgau breed, other than for slaughter"/>
<G code="010229103" name="Heifers of the Schwyz and Fribourg breeds, other than for slaughter"/>
<G code="010229104" name="Heifers of the spotted Simmental breed, other than for slaughter"/>
<G code="010229105" name="Bulls of the Schwyz, Fribourg and spotted Simmental breeds, other than for slaughter"/>
<G code="010229109" name="Other"/>
</G>
...
In addition to processing big XML data, the databases can act as a part of a distributed system as a web-server. We show a simple example: an API to search for a name of a goods category. Our query extracts the matching nodes and their parent hierarchy. The web-API is very easy to specify thanks to RESTXQ.
xquery version "3.1";
module namespace page = 'http://taricxml.fraktal.com.pl/web';
declare function page:tree_search_hierarchical($tree as node()?, $lang as xs:string, $query as xs:string) as node()*
{
let $name := $tree/description[@lang=$lang]/text()
let $this_node_matches := contains(lower-case($name), $query)
let $children_matching := $tree/GoodsTree ! page:tree_search_hierarchical(., $lang, $query)
return if($this_node_matches or not(empty($children_matching))) then
<G code="{$tree/@code}" name="{$name}">
{ $children_matching }
</G>
else
()
};
(: http://docs.basex.org/wiki/RESTXQ#Query_Parameters :)
declare
%rest:GET
%rest:path("goods/search")
%rest:query-param("query", "{$text_query}", "")
%rest:query-param("lang", "{$lang}", "EN")
function page:hello($text_query as xs:string, $lang as xs:string) {
if (string-length($text_query) = 0) then
<error>No query specified</error>
else
let $doc := db:open("taric-base", "goods_nomenclature_tree.xml")/GoodsTree
let $lang_up := upper-case($lang)
let $result_hierarchy := page:tree_search_hierarchical($doc, $lang_up, lower-case($text_query))
return <response>
<query>{$text_query}</query>
<lang>{$lang_up}</lang>
<result-hierarchy>{$result_hierarchy}</result-hierarchy>
</response>
};
To launch the web service:
src/basex/06_search/tree_search.xql
to basex/webapp/goods/tree_search.xql
cd basex && bash bin/basexhttp
http://localhost:8984/goods/search?query=name&lang=EN
For example http://localhost:8984/goods/search?query=parrot
yields:
<response>
<query>parrot</query>
<lang>EN</lang>
<result-hierarchy>
<G code="" name="Goods Nomenclature">
<G code="01" name="LIVE ANIMALS">
<G code="0106" name="Other live animals">
<G code="010631" name="Birds">
<G code="010632" name="Psittaciformes (including parrots, parakeets, macaws and cockatoos)"/>
</G>
</G>
</G>
</G>
</result-hierarchy>
</response>
This query is not very efficient, it could be improved using full text features [doc 1] [doc 2].
BaseX proves to be a valuable tool for dealing with massive XML data. It can be used for inspection, business logic, and web-services. It achieves good performance thanks to indexes and query optimization.
The XQuery language provides very flexible queries of XML documents, but requires some mental effort when used for complex algorithms.
Using XML-native technologies for this task saves a lot of development effort, as no conversion to other formats was needed.
BaseX is ready to be deployed as part of an application, for example as a REST data-source.