- Text, HTML, and XML
- Web pages are most often created using HTML (HyperText Markup
Language). A Web page can, however, consist of just plain text, or a
combination of HTML and pre-formatted text.
A great deal of information is lost in the process of converting data
into HTML:
- <H1>New Millennium Software Company</H1>
<P>
144 West Villa Theresa Dr.<BR>
Phoenix, AZ 85023<BR>
Telephone: 602-368-8141
XML (eXtensible Markup Language) is a more structured language for
the representation of data on the Web:
- <COMPANY>
<NAME>New Millennium Software Company</NAME>
<ADDRESS>
<STREET>144 West Villa Theresa</STREET>
<CITY>Phoenix</CITY>
<STATE>AZ </STATE>
<ZIP>85023</ZIP>
</ADDRESS>
<PHONE>602-368-8141</PHONE>
</COMPANY>
- Parsing
- Parsing is the process of reading a Text, HTML, or XML Web page to
discover the structure of the document. The result of parsing is an
object-based representation of the elements of a Web page.
Objects correspond to the HTML or XML elements that are discovered
during the parsing. After parsing, arrays of like objects can be
accessed by either zero-based indexing or wildcard matching.
For the HTML example above, object references can be used to access
the company address and phone number:
- doc.p[0].line[0-1].text
- doc.p[0].line['*Telephone*'].text
For the XML example above, the object references are more
straightforward:
- doc.company[0].address[0].text
- doc.company[0].phone[0].text
- Pattern Matching
- Pattern matching is an alternate way of collecting arrays of objects
from a parsed document.
In pattern matching, the elements of an HTML or XML document are
viewed as a stream of tokens. For example, given the following HTML:
- <UL>
<LI> Games for only: <B>$19.95</B>
<LI> ART: $100.95
<LI> Novel for only:
<B>$19.95</B>
</UL>
A pattern of "LI B" would collect every instance where the <LI> and
<B> tokens occur in the specified order; in this case, only the lines
for Shirts and Sweaters would be collected.
- HTTP/HTTPS
- HTTP (Hyper Text Transfer Protocol)is the protocol used to transfer
documents over the Web. HTTPS requires SSL (Secure Sockets Layer)
libraries, available as a separate product from webMethods.
Web Automation applications behave exactly like Web browsers in their
use of HTTP/HTTPS to submit requests to Web Servers. In fact, Web
servers cannnot distinguish a Web Automation application from a Web
browser.
- Web Servers
- Web servers respond to HTTP requests by delivering a stream of data
(typically Text, HTML, or XML) to a calling application (typically a
browser).
Web servers can deliver documents from a local file system, invoke
CGI-BIN scripts, or access databases and legacy systems through any
number of integration technologies. Regardless of the source of the
data, Web servers always speak HTTP.
Web Automation applications leverage the fact that Web servers
provide a common protocol for requesting data from diverse back-end
systems.
UNICODE:
NEW STANDARD A standard for representing characters as integers.
Unlike ASCII, which uses 8 bits for each character, Unicode uses 16
bits, which means that it can represent more than 65,000 unique
characters. This is a bit of overkill for English and Western-European
languages, but it is necessary for some other languages, such as Greek,
Chinese and Japanese. Many analysts believe that as the software
industry becomes increasingly global, Unicode will eventually supplant
ASCII as the standard character coding format. |
|
A document type definition (DTD) is a series
of definitions for element types, attributes, entities and notations.
DTD provides the concept formal markup declarations.
Markup declarations:
<! ELEMENT Q-AND-A (QUESTION,ANSWER) +>
<!-- This allows: question, answer, question, answer ... -->
<!ELEMENT QUESTION (#PCDATA) +>
<!!-- Questions are just make up of text -->
<!ELEMENT ANSWER (#PCDATA)+>
<!-- Answeres are just text -->
Well-formedness and validity
XML rules consist of two notions of correct: well-formed
document is a document that is intelligible markup.
Using the right word in the appropriate locations is validity.
Valid documents declare conformance to DTD.
HYPERLINKS: Extended Links
XLink provides a notation to extract combined information from
related links! Partial web information may be dangerous!
XML extended links furthermore point to multiple resources.
Instead of linking to one word you link to multiple definitions
simultaneously.
Stylesheets
Stylesheets provide personalized visual formats for webpages based upon
the style we want! Cascading Style Sheets (CSS) provide
standardized ways of visually structuring the formats of web pages.
Extensible Stylesheet Language (XSL) combines many features from CSS, with
inclusion of ISO's DSSSL stylesheet language.
XSL is extensible as XML.
Module for XSL
This module implements the W3C's XSLT specification.
XML::XSLT makes use of XML::DOM and LWP::Simple,
while XML::DOM uses XML::Parser. Therefore XML::Parser, XML::DOM
and LWP::Simple have to be installed properly for XML::XSLT to
run. IE5 and IE6 have the DOM embedded.

The stylesheets and the documents may be passed as filenames, file
handles regular strings, string references or DOM-trees. Functions
that require sources (e.g. new), will accept either a named parameter or
simply the argument.
Either of the following are allowed:
my $xslt = XML::XSLT->new($xsl);
my $xslt = XML::XSLT->new(Source => $xsl);
In documentation, the named parameter `Source' is always shown, but it
is never required.
- new(Source => $xml [, %args])
- Returns a new XSLT parser object. Valid flags are:
-
DOMparser_args
- Hashref of arguments to pass to the XML::DOM::Parser object's
parse method.
- variables
- Hashref of variables and their values for the stylesheet.
- base
- Base of URL for file inclusion.
- debug
- Turn on debugging messages.
- warnings
- Turn on warning messages.
- indent
- Starting amount of indention for debug messages. Defaults to 0.
-
indent_incr
- Amount to indent each level of debug message. Defaults to 1.
- open_xml(Source
=> $xml [, %args])
- Gives the XSLT object new XML to process. Returns an
XML::DOM object corresponding to the XML.
- base
- The base URL to use for opening documents.
-
parser_args
- Arguments to pass to the parser.
- open_xsl(Source
=> $xml, [, %args])
- Gives the XSLT object a new stylesheet to use in processing
XML. Returns an XML::DOM object corresponding to the stylesheet.
Any arguments present are passed to the XML::DOM::Parser.
- base
- The base URL to use for opening documents.
- parser_args
- Arguments to pass to the parser.
process(%variables)
- Processes the previously loaded XML through the stylesheet using the
variables set in the argument.
-
transform(Source => $xml [, %args])
- Processes the given XML through the stylesheet. Returns an
XML::DOM object corresponding to the transformed XML. Any arguments
present are passed to the XML::DOM::Parser.
- serve(Source => $xml [,
%args])
- Processes the given XML through the stylesheet. Returns a string
containing the result. Example:
use XML::XSLT qw(serve);
$xslt = XML::XSLT->new($xsl);
print $xslt->serve $xml;
- If true, then prepends the appropriate HTTP headers (e.g.
Content-Type, Content-Length);
Defaults to true.
-
xml_declaration
- If true, then the result contains the appropriate <?xml?> header.
Defaults to true.
-
xml_version
- The version of the XML.
Defaults to 1.0.
- doctype
- The type of DOCTYPE this document is. Defaults to SYSTEM.
- toString
- Returns the result of transforming the XML with the stylesheet as a
string.
- to_dom
- Returns the result of transforming the XML with the stylesheet as an
XML::DOM object.
- media_type
- Returns the media type (aka mime type) of the object.
- dispose
- Executes this method
on each XML::DOM object.
|