Semantic richness of source content formats

Post by **FirstLight** » Wed Mar 02, 2022 2:34 pm

I've been unabashedly public about the differences in content source formats as they relate to the emerging world of graph-driven content and microcontent. I define microcontent, not as whole files, but substructures within files (such as only a set of steps within a multipage topic file - one of the countless examples). The differentiating factor I believe is the scope of granularity for retrieval, assembly, and delivery.

I believe the difference between DITA (or any DOM-based format), is that they are infinitely superior to any other format like Markdown or RST because they is predictably algorithmically (machine) processable based on multi-level containment and separation of structure from logic. A format like Markdown for example is fine for casual contribution where others can't create structured content (and it can be embedded in a DITA collection or transformed no less). Shops have gone to these poor-semantic content source formats for ease-of-authoring and cost (free editors abound). However, the ease-of-authoring argument is all but moot with far more visual structured authoring tools on the market. I've been harsh pubically asserting that those that thought DITA was getting "long-in-the-tooth" and went to other non-DOM formats as an overall strategy are in for a rude awakening when they get to the advanced ontological/graph-based AI/ML world that's now on the near horizon; they will rue the day when they'll be craving microcontent and ML-driven Content-as-a-Service (CaaS.

Now that doesn't mean that we can ignore these source formats that are semantically poor. More often than not we have to live with them and include them along with binary large objects (BLOBs) such as PowerPoints, Excel, PDF, MP4s, raster graphs, and so on).

This makes me ask a series of questions ripe for discussion:

So what do we do with mixed content formats?
Will folks that use formats such as MD and RST simply be limited to whole file object usage?
Will they add semantics to create subfile objects and containment for classification and extraction? WIll doing that make them infinitely more complex than structured formats that they originally wanted to avoid in the first place?
Do we use DITA to wrap non-DITA objects to classify and use them?