Home | Exchange | FAQ | Software & Standards
The Future of Internet Publishing
by Norbert H. Mikula
Senior Online Information Architect, DataChannel, Inc.
Introduction
Traditional Internet publishing
Examples of Internet publishing formats
Advantages of XML to Internet publishing
XML and Distributed Computing
XML and Meta-Data
Enhanced hyperlinks through XML
Emerging stylesheet standards
DataChannel and XML
Conclusion
References
Introduction
Little has captured the imagination of publishers like the idea of disseminating information over the Internet. This paper looks first at the requirements that shaped Internet publishing in the past, then at some of the more successful formats. Next, the focus shifts to emerging stylesheet standards and XML [eXtensible Markup Language], each of which fills important gaps in Internet publishing.
Finally, the conceptual model of Internet publishing is challenged and revised. Network administrators are facing the demand to accommodate Internet publishing with existing bandwidth, while for business units the mandate is making sure that the right information gets to the right person at the right time. The next generation of publishing standards will satisfy these goals.
New standards for document formatting, metadata, document rendering, and for document/information interchange protocols all point naturally to the next conceptual leap: meta-content routing.
Traditional Internet publishing
Delivery of structured information via the World Wide Web has long been a challenge to professional publishers. There are four basic requirements that determine the optimal format for electronic publishing:
- structure/markup
- standards
- rendering
- acceptance/support.
Structure/Markup
Structure and Markup here refer to a format's capability to preserve contextual information and semantics about certain parts of a document. Structure allows the representation of the hierarchical structure that is inherent to most documents dealt with on a day to day basis. The simplest example is a book, which consists of a title, followed by a table of contents, followed by chapters that are broken down into sections and subsections. It is structure that allows statements like "I want to read the 3rd section of the 1st chapter of that book."Markup enables the creator of a document to preserve information about parts of a document. For instance to be able to mark explicitly the title enables powerful processing of the document.
Document formats vary in terms of how much structure and markup they allow.
Standards
Publishers must manage large collections of documents, often of a similar type and on similar topics. It is in their interest to create documents using formats that can be processed or modified for a long time after the original creation.A standard should also be independent, not under the control of a single vendor. Industry consortiums or similar groups of widespread participants are better suited to control a standard.
Rendering
Human beings are visual. Information needs to be prepared in a way that is visually appealing to the reader. Although parts of the process may be machine-to-machine transfer, at some point a person becomes involved.Rendering information should always be kept separate from the content of a document. This guarantees that the rendering can be context driven. The content of a document might be rendered as a paper print-out from a laser printer, audio from a speech-synthesis program, or the textured output of a Braille device.
As rendering is of special importance to electronic publishing, a separate section is devoted to the discussion of this particular subject.
Acceptance/Support
Acceptance and support of a document format is especially important when the targeted audience is large (everyone on the Internet, for example). In such a case, the tools to read and manipulate the format must be easily available.This paper evaluates four document formats with respect to the criteria defined above: PDF, SGML, HTML, and XML. All are important approaches to Internet Publishing.
Examples of Internet publishing formats
PDF (Portable Document Format) is a proprietary format from from Adobe Systems, Inc. that enables publishers to maintain a consistent look and feel of published documents across different platforms. In terms of maintaining a consistent page-layout across platforms, PDF is a powerful tool. What PDF does not do, however, is preserve structure and contextual information about a document. It is also a proprietary format of a single company, and requires a proprietary reader.This rules out the employment of PDF in business applications that require machine driven post-processing and other tasks that require indexing and querying. Finally, PDF does not separate rendering and content, which makes it unsuitable for many application scenarios.
SGML
SGML (Standard Generalized Markup Language) is an international standard (ISO 8879) for the markup of electronic documents. SGML enables the creation of documents with varying degrees of structure and contextual information. SGML has a high level of acceptance in the professional publishing industry and is used successfully in many large-scale systems.Transferring SGML documents from one place to another is ideal since no information is lost during that process. An SGML based electronic publishing system seems an obvious solution.
The problem with SGML is that it is complex. SGML based Internet publishing requires that all participating parties use SGML systems in their existing infrastructure. SGML is as complex as it is powerful. The complexity of SGML and the effort it takes to understand and build SGML systems makes the use of SGML very costly. SGML software is traditionally very expensive and SGML experts hard to find (when found, their salaries tend to be "upper level"). For these reasons, SGML is not widely used. The SGML solution breaks down on the acceptance criteria.
HTML
The Hypertext Markup Language (HTML) is an application of SGML. HTML was specifically designed to provide a lightweight means for the markup of documents to be published on the World Wide Web. The widespread success of HTML can be attributed to many factors. The two most important are probably its restricted vocabulary (making it easy to understand) and its support by numerous tools.As successful as HTML has been, certain problems have arisen. The simplicity of HTML resulted in documents with limited rendering information. There is very little opportunity to include information about the document itself. Finally, HTML has only limited support of hierarchies.
Publishers that maintain their documents in SGML format can translate these repositories to HTML for publishing on the Internet. Much is lost, however, in the translation from SGML to HTML; and once converted to HTML, the richness of the original document is lost.
XML
EXtensible Markup Language (XML) is an effort of leading SGML experts to combine the power of SGML with the acceptance and simplicity of HTML. These experts are coordinated within a working-group of the World Wide Web Consortium (W3C).XML is a subset of SGML which to a large extent is formulated by simply leaving out features which are rarely used or cause problems in terms of processing speed.
The W3C is the organization that controls the standardization of many WWW related formats and protocols such as HTML and HTTP. XML has all the characteristics of a standard as it is controlled by the W3C. There is further reason in considering XML as being a reliable standard since it is explicitly compatible with SGML in the sense that any system conforming to the SGML standard is also able to read and process XML documents.
Advantages of XML to Internet publishing
Since its first days in the public light, XML has been very successful. This success can be seen in the presence of a multitude of free or inexpensive software systems. XML has the clear potential to become the lingua franca for information exchange on the WWW.
XML is not in competition with SGML or HTML. Rather, XML has been designed to fill the gap between the two standards.
Go to page 2 of 2
© 1998 DataChannel, Inc.
Copyright 2002 Jupitermedia Corporation, All Rights Reserved.
Legal Notices | Licensing, Reprints, & Permissions | Privacy Policy | Advertising on Intranet Journal
Home | eXchange | F A Q | Find | Register |