Implementing an extensible XML processor

A side effect of implementing tax-related software in Germany is that I work a lot more with XML than I used to.

The electronic interface of the German fiscal administration, ELSTER, uses XML as it's core technology for transferring tax documents. They provide an XML Schema for each type of document and a client library for validation and submission of documents that adhere to a given schema.

To facilitate the integration of ELSTER into various software projects, I want to implement a generic and extensible XML parsing library that reads a given XML Schema and produces valid XML documents.

Normalizing XML Schemas

A core feature of the XML processor will be schema normalization. The structure of XML schemas can become very complex due to a variety of features:

- Referencing other schema files using xs:include or xs:import elements

- Defining types globally or inline

- Referencing elements using the ref attribute

- Referencing types using the type attribute

- Using substitution groups to substitute elements for other elements

- Using abstract elements

The first step after parsing the schema is to normalize it to a form where it is simple to reason about and process it in next steps.

During normalization, all referenced schemas are merged recursively with the initial schema. In this step, it's important to preserve the target namespace of each schema.

Next, all inline types are transformed into global types which are referenced by the elements they were contained in.

Referenced elements are resolved.

Substitution Groups are converted into xs:choice types.

https://www.uni-rostock.de/storages/uni-rostock/Alle_IEF/Informatik/Homepages/Meike_Klettke/Students/SA_tobias_tiedt.pdf

Mapping domain data to XML elements

I see multiple approaches to mapping data from the domain layer to corresponding XML elements while adhering to the XML Schema:

- Using a mapping interface (for example a TypeScript interface) which translates between both layers

- Using an object-oriented structure where each XML element is represented as a class

- Using a dynamic context retrieval system which the XML layer uses to retrieve data for a given XML element from the domain layer

The last approach seems the most flexible and scalable. However, I could use a combination of both approaches to make the context retrieval type-safe. The XML processor could generate TypeScript types for the expected context.

Some questions still remains:

How to identify data points in the context and map them to a specific element? How to determine whether the data point is inserted as a child or an attribute value?

Serializing XML documents

Instead of explicitly keeping track of namespace declarations per element in the XML object model, each element and attribute should track their own namespace. The serializer then takes care of inserting namespace declarations where necessary during recursive serialization of the XML tree.