Wednesday, December 22, 2010

Continual Aggregate Hub and the Data Warehouse

Devlopment in this area makes this article obsolete. 

The CAH will contain aggregates represented as xml-documents. These aggregates are tuned for the usage patters relevant for the process executed, aka. the primary products. Many other, less relevant but not unimportant, usage patterns do exist (for more secondary products). These are often not so timely relevant for the main process and can wait. We see that the data warehouse (DWH) or a more dedicated operational data store (ODS) has the role of fulfilling the purpose of the secondary producs. (The CAH is a ODS to some extent). The CAH emits all new aggregates produced and will enable the DWH to be more operational. The DWH can query and collect data at any interval it is capable of. The Aggregates are also clearly defined and identified, so it makes the ETL process simpler. Furthermore the all details are available for querying in the CAH so the DW does not need to keep them. These capabilities will lessen the burden on the DWH.
Although the CAH stores data as xml, the DWH may store this as it is best suited.
Creative Commons License
Continual Aggregate Hub and the Data Warehouse by Tormod Varhaugvik is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Continual data hub architecture and restaurant

Continual Aggregate Hub architecture and restaurant

The continual aggregate hub (CAH) have these core (albeit high level) architectural elements:
  • Component based and layered
  • Process oriented
  • Open standards
  • Object oriented
  • Test, re-run and simulate
  • and last; it should behave as a restaurant
CAH processing architecture

Component based and layered
The main concern is to protect the business logic form the technical domain. Freedom to deploy! This is more or less obvious, but the solution should consist of small compact Modules with a clear functional purpose, embedded in a container that abstracts technical constraints form functional. This is roughly dependency injection and the Java container.
The components and the layers should be independent so that they can scale out.

The TaxInfo data store and cache is preferably some in-memory based architecture (aka. GigaSpace…), with relevant services around it. A Module will deploy service components into this architecture. These services peek (consume) or poke (produce) into the aggregates that the Model own. The services are used by humans or other systems. The Bounded Context of DDD spans vertically User Interface Architecture, TaxInfo and Processing Architecture.

The User Interface Architecture is a browser based GUI platform where the user can go from one page to another without knowing what system he is actually working with. We do not need any portal product, but some sets of common components, standards and a common security architecture.

Process oriented
The Modules should cooperate by producing Aggregates. The architecture must continually produce Aggregates to facilitate simple consumption. This will enhance and make a base for parallel processing, sharding and linear scalability. It will also make the process more robust.
Manual and automated processes cooperate through process state and common Case understanding.
A functional domain consist of Modules for automated processing and Modules for human interaction. Human interaction is mainly querying or inspecting a set of data, or making manual change to data.
In between processing modules there is either a queue or some sort of buffer, or the Module may process at will.

Open standards
The solution we implement will live for many years (20-30?) so we must choose open standards we assume to live for some time. But also those architectural paradigms that will last longer than the standards themselves. This stack will never consist of one technology.
We must protect our investment in business logic (see components based), and we must protect the data we collect and store.
As of today we say Java, HTML, and XML (for the permanent data store).

Object oriented
Business logic within our domain is best represented in an object oriented manner. Any aggregate has a object oriented model and logic with basic behavior (a basic Module). Business logic can then perform on these aggregates and do higher level logic.
By having a 1:1 relationship between stored data and basic behavior (on the application layer), locking and IO will benefit.

Test, re-run and simulate
By having discrete processing steps with defined modules and aggregates, testing is enhanced.
All systems have errors and these often affect data that has been produced over some time. By having defined aggregates and a journal archive of all events (the Process state component), it is possible to just truncate some aggregates, and restart the fixed Modules.
Simulation is always on the wish-list of the business. Simulation modules can easily sit side-by-side the real ones, by produce to specific aggregates for the simulated result. Simulated aggregates will only be valid in a limited context and not affect the real process flow.

The Continual Aggregate Hub restaurant
Think of the CAH as a restaurant. Within the restaurant there are tables with customers who place orders, that are served by waiters, and kitchen with cooks. A table in the CAH is a collection of Aggregates, in our domain the “Tax Family”. (All tax families are independent and can be processed in parallel.) The waiters serve tables that has changes that need to be served. The change results in an order with the relevant aggregates and brings it forward to the kitchen. All orders are placed in a queue, (or order buffer of some sort) and the cook process orders at capacity. It is important that the cook does not run to the shop for every ingredient, but that the kitchen has the resources necessary for processing the order. So the kitchen is autonomous (then it can scale by having more cooks, or specialized cooks), but of course has a defined protocol (types of orders and known aggregates). When the dish (aggregate) is finished it is brought back to the table (it is stored).
So the main message is that the cook must not go to the shop while cooking. I think this is where many systems have the wrong approach; somewhere in the code the program does some search here or there. Don’t collect information as you calculate. Separate concerns; collect data, processing it and storing the result!

Creative Commons License
Continual data hub architecture and restaurant by Tormod Varhaugvik is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Tuesday, November 16, 2010

Concept for a aggregate-store and processing architecture

Continual aggregate hub

In our domain, tax administration, we are working on information systems design for handling our challenges. We collect information about real world assets (“values”) and calculate some fee on them. These assets have multi-dimentional capabilities. Within this information systems design, we define an aggregate-store and processing architecture that can handle large volumes and uptime, deal with rules and regulations that change from year to year, be robust and maintainable with though lifetime requirements, and provide effective parallel processing of independent tax-subjects.

We would like to present this design for you; the continual aggregate hub
  • Continual; ongoing processing and multi-dimentional information (re-used in different business domains)
  • Aggregate; as defined Domain Driven Design, stored as xml-documents
  • Hub; one place to share Aggregates

We are inspired of Domain Driven Design, and have used design elements from Tuple Space, BASE, SOA, CQRS-pattern, Operational Data Store (ODS), Integration Patterns, and experience from operating large distributed environments. Definitions from Domain Driven Design are used in the text; Modules, Entities, Aggregates, Services and Repositories.
We believe that other systems with similar characteristics may benefit from the design, and it is a great fit for In-Memory and Big Data platforms. 
(January 2013: A Proof of Concept is documented here, and we are now actually building our systems on this design)
(June 2013: This is now i production. BIG DATA advisory)

Illustration 1. Continual Data Hub

Continual aggregate hub (CAH)
This is a Repository, is the core of the solution and its task is to keep different sets of Aggregates. The repository is aggregate-agnostic. It does not impose any schema on these; it is up to the producer and consumers to understand the content (and for them it must be verbose). The only thing the repository mandates is a common header for all documents. The header describes who the information belongs to, who produced it, what type it is, for what time it is legitimate, and when it was delivered. This means that all Aggregates share these metadata. An Aggregate is data that some Module produced and that is meaningful for the domain the Module handles. The Aggregate is materialized in the Repository as an xml document, to provide flexibility and preserve lifetime requirements. The Aggregate is the product and purpose of the Module that produced it. The Aggregate should be in a state that makes it valid for a period of time, and that is relevant for the domain.
Together with the CAH is a process state store and some sort of process engine. Its task is to hold the event for the produced Aggregate and make sure that subsequent Services are executed. (see Enterprise wide business process handling) This is our soft state and contain high level events that are meaningful outside the Module. These Services may be a Module. This may very well follow the CQRS pattern.
(October 2014: We now know the design of the Case Container within "Process state & handling")

A Module lives from being fed by Aggregates that has changed since last processing. It may either be one Aggregate or several, it depends on the task and what content the Module needs to do its work. The feeding may either be a live event, batched set at some frequency, or it may feed when it feels like it. When a Module is finished with its task its Aggregate (or several) is sent to the CAH. The Module is the sole owner of its Aggregates and is the only source for it. A Module should ideally have no state other than the Aggregates, and therefore have a temporary life in the processing environment. It may though be a large “silo-sort” of system, that just interacts with the CAH.
(Module and Aggregate Design)

An Aggregate is some distinct set of data that has loose coupling to other Aggregates, but they may share Entities. Just as the Modules are distinct because they are loosely coupled the data must also be so. Clearly some types of enterprise applications is not suited for this, if the data model is large and there is no clear loose coupling. The Aggregate encapsulates the enterprise data for parallel processing, ensures the lifetime requirements, and makes it easier to handle migration to newer systems. Also the Aggregate may be tagged with security levels so that we have a general way of handling access. Entities should map to objects that have behaviour, and an Aggregate should map to an object structure providing a 1:1 mapping between data and basic behaviour on them. This will also reduce IO and locking as compared to classic OR-mapping. (see Aggregate persistence for the Enterprise Application )

The architecture provides for event driven processes, but the consequence of one event is not necessarily handled real-time. Any Module feeding must be able to consume as best suited, or even re-run if there is a fault in the data produced. And there may very well not be any Module to handle an event, the Module for the task may be introduced later. The events make produced Aggregates form a logical queue and we take ownership to the events (process state). The event must not be hidden within some messaging engine or queue product.

This is continual handling of aggregates, that acts as a bus for its producing and consuming parties, but stores the aggregates for subsequent consumption. Data that anyway would have moved from one module to another (over an ESB), is stored centrally, once. This means that the modules participating in the process does not need to have any permanent store or state itself. This is good because we can exchange one module with another implementation without migrating any data. It also means that things may be repeated (because of errors maybe) and that a data consumer does not depend on the (uptime of) module that produced it (loose coupling). The system that is the producer (owner) of some Aggregate must deliver functionality (some sort of plug-in) to the CAH, so that it can be understood by a consuming system or shown in some GUI.

We believe that uptime requirements and the ability to change and to understand the total system is enhanced with this design. Functionally we isolate the production of an aggregate from the consumption of it, and it will make it easier to reach our uptime requirements. Data is always prepared and ready. The CAH only handles querying and accepting new versions of an aggregate, not producing new aggregates. A major problem with many large systems is that nobody really knows what data the system produces, and we know for sure that a lot intermediate data is stored unnecessary. Also large systems tend to have to many responsibilities and gets tied down. In our proposed architecture, if there is need for a new product (an aggregate), just make a new module and introduce a new aggregate. The data warehouse is an important element in our total architecture, for running reports, because we do not want to mess up our production flow with secondary products. (see CAH and Data Warehouse.html)

We also see that testing is enhanced because we have a defined input and output on a rich functional level, and the modules participating in the process does not need to know about the CAH. For the scaling we see that this nicely handles linear scalability since we now have independent sets of data (aggregates) that are organised on distinct set of owners. Any calculation in the modules does (in most cases) not span different owners of aggregates.

No system is made from scratch and the solution gives legacy systems a chance to participate and its aggregates to have the uptime that is required. With the right interfaces and aggregates it can be connected to the solution, at its own pace. A legacy system may be large, have its own state and use long time. (We here will have a problem with redundancy and things getting out of sync, but we know all previous data and state. So we are better equipped to handle these situations) (see Migration strategy and the cloud)

A module may also include some manual tasks. In this context the task is propagated to a task list (a module handling manual tasks to users). The task points to a module where the human may do its work. It is transparent to the architecture if a task is human or not. Only the module knows. (see CAH architecture.html)

As you may have noticed the process flow is not orchestrated, but choreographed. Experience with the SOA view where everything is process and services, oversees the importance of state and that some set of data must be seen together in a domain. So a major part of handling a process is actually done within the domain (in a module), and there is no loose coupling between the data and the process. So we say respect the modules, and let modules cooperate in getting the enterprise process done. If there is a need for a orchestrated process, build a module for it.

Our domain requirements
Below is our wheel that illustrates a continuous process that acts on information from the outside, and responds back. We seek to respond to events in the real world as they happen, and not later on. Information would be salary being paid to a person, the savings and interest of a bank account, or any IRS-form etc.. This information is collected as Values and they are subject for tax calculation or control purposes. We need to introduce new Values and validations quickly. To enable good quality information we need to immediately respond with a validation result, and maybe also with the tax that should be paid. The wheel must handle different tax domains (inheritance, VAT, income tax…) side-by-side, because many tasks in such a process is generic and information is in many cases common. Also a challenge is the variations in data format and rules (within the same domain) from year to year, so it must offer different handling based on legitimate periods (eg. fiscal year) side-by-side. The wheel only handles information about real objects and the tax imposed on a subject, not about subjects themselves (address or id of persons, companies or organizations), nor the money that is the imposed fee. But of course, these are also processes (systems / modules) that the wheel must cooperate with.
A regular citizen with normal economical life actually does not need to do any actions. We collect and pre-fill everything for them.
On the volume side there is approx 300 million Values on a yearly basis, which result in 20 million tax-forms (Fixed values) and the system must handle 10 years of data (and corresponding logic) because of complaints and detected frauds.
Also we must use less than 2 weeks on putting a new type of Value into production.
Illustration 2. CAH Tax domain

Processing Modules

Reporter channel
Any subject that report about other subjects they are handling. Such as employers, payroll systems, bank systems or stock trading systems. These report directly to us. Information is stored as Values ("assets" from the real world). This Module is highly generic, but will have different types of validation.
Self Service channel
Any subject that either wants to look at information reported on themselves, or wants to report own data. Own data may be from systems or via forms on the web. Information is stored as Values. This Module is highly generic, but will have different types of validation.
Validate, consolidate and fix
Information stored in Values are aggregated and fixed after a validation process where the legitimacy of the information is done. There are many groups of fixation; income, savings, property, etc.
There will exist Modules for each tax domain, and they will be quite complex and involve manual handling.
The product (aggregate) of this step is relevant for self service Module so that the subject can object to the result, and claim other Values than the ones collected.
Fee calculation
A fee is calculated based on the Fixed values. The Module produces the Imposed fee.
This Module will exist in many flavours, each for different tax domains.
This module interacts with the accounting system and produce the  Deducted amount. Most subjects have prepaid tax directly from their employers or pay some pre-filled amount during the year.
Also at this step is the production of a tax-card back to the payroll systems instructing them to change tax-percentage if necessary.

The “Tuple Space” Repository with Aggregates. All aggregates have this in common: who the information belongs to (the tax subject), who produced it, what type it is (a reference to an xsd maybe), for what time it is legitimate (eg. Fiscal year or period), and when it was created. An aggregate is never changed but comes in new versions.
TaxInfo should internally have a high degree in partitioning freedom so that we can scale “out of the box”.
Process state &  handling
Soft state, event log and choreography.
The Process state and its xml-documents are structured and referenced in the design of the Case Container.

Creative Commons License
Concept for a aggregate-store and processing architecture by Tormod Varhaugvik is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.