|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.w3c.tidy.Lexer
public class Lexer
Lexer for html parser.
Given a file stream fp it returns a sequence of tokens. GetToken(fp) gets the next token UngetToken(fp) provides one level undo The tags include an attribute list: - linked list of attribute/value nodes - each node has 2 null-terminated strings. - entities are replaced in attribute values white space is compacted if not in preformatted mode If not in preformatted mode then leading white space is discarded and subsequent white space sequences compacted to single space chars. If XmlTags is no then Tag names are folded to upper case and attribute names to lower case. Not yet done: - Doctype subset and marked sections
Field Summary | |
---|---|
protected short |
badAccess
for accessibility errors. |
protected short |
badChars
for bad char encodings. |
protected boolean |
badDoctype
set if html or PUBLIC is missing. |
protected short |
badForm
for mismatched/mispositioned form tags. |
protected short |
badLayout
for bad style errors. |
protected int |
columns
at start of current token. |
protected Configuration |
configuration
configuration. |
protected int |
doctype
version as given by doctype (if any). |
protected short |
errors
count of errors. |
protected java.io.PrintWriter |
errout
error output stream. |
protected boolean |
excludeBlocks
Netscape compatibility. |
protected boolean |
exiled
true if moved out of table. |
static short |
IGNORE_MARKUP
state: ignore markup. |
static short |
IGNORE_WHITESPACE
state: ignore whitespace. |
protected StreamIn |
in
file stream. |
protected Node |
inode
Inline stack for compatibility with Mosaic. |
protected int |
insert
for inferring inline tags. |
protected boolean |
insertspace
when space is moved after end tag. |
protected java.util.Stack |
istack
stack. |
protected int |
istackbase
start of frame. |
protected boolean |
isvoyager
true if xmlns attribute on html element. |
protected byte[] |
lexbuf
Lexer character buffer parse tree nodes span onto this buffer which contains the concatenated text contents of all of the elements. |
protected int |
lexlength
allocated. |
protected int |
lexsize
used. |
protected int |
lines
lines seen. |
static short |
MIXED_CONTENT
state: mixed content. |
static short |
PREFORMATTED
state: preformatted. |
protected boolean |
pushed
true after token has been pushed back. |
protected Report |
report
report. |
protected Node |
root
Root node is saved here. |
protected boolean |
seenEndBody
already seen end body tag? |
protected boolean |
seenEndHtml
already seen end html tag? |
protected short |
state
state of lexer's finite state machine. |
protected Style |
styles
used for cleaning up presentation markup. |
protected Node |
token
current node. |
protected int |
txtend
end of current node. |
protected int |
txtstart
start of current node. |
protected short |
versions
bit vector of HTML versions. |
protected short |
warnings
count of warnings in this document. |
protected boolean |
waswhite
used to collapse contiguous white space. |
Constructor Summary | |
---|---|
Lexer(StreamIn in,
Configuration configuration,
Report report)
Instantiates a new Lexer. |
Method Summary | |
---|---|
void |
addByte(int c)
Adds a byte to lexer buffer. |
void |
addCharToLexer(int c)
Store char c as UTF-8 encoded byte stream. |
boolean |
addGenerator(Node root)
Add meta element for Tidy. |
void |
addStringLiteral(java.lang.String str)
calls addCharToLexer for any char in the string. |
void |
addStringToLexer(java.lang.String str)
Adds a string to lexer buffer. |
short |
apparentVersion()
Return the html version used in document. |
boolean |
canPrune(Node element)
Can the given element be removed? |
void |
changeChar(byte c)
Substitute the last char in buffer. |
boolean |
checkDocTypeKeyWords(Node doctype)
Check system keywords (keywords should be uppercase). |
AttVal |
cloneAttributes(AttVal attrs)
Clones an attribute value and add eventual asp or php node to node list. |
Node |
cloneNode(Node node)
Clones a node and add it to node list. |
void |
deferDup()
Defer duplicates when entering a table or other element where the inlines shouldn't be duplicated. |
boolean |
endOfInput()
Has end of input stream been reached? |
short |
findGivenVersion(Node doctype)
Examine DOCTYPE to identify version. |
boolean |
fixDocType(Node root)
Fixup doctype if missing. |
void |
fixHTMLNameSpace(Node root,
java.lang.String profile)
Fix xhtml namespace. |
void |
fixId(Node node)
duplicate name attribute as an id and check if id and name match. |
boolean |
fixXmlDecl(Node root)
Ensure XML document starts with <? |
Node |
getCDATA(Node container)
Create a text node for the contents of a CDATA element like style or script which ends with </foo> for some foo. |
Node |
getToken(short mode)
Gets a token. |
short |
htmlVersion()
Choose what version to use for new doctype. |
java.lang.String |
htmlVersionName()
Choose what version to use for new doctype. |
Node |
inferredTag(java.lang.String name)
Generates and inserts a new node. |
int |
inlineDup(Node node)
This has the effect of inserting "missing" inline elements around the contents of blocklevel elements such as P, TD, TH, DIV, PRE etc. |
Node |
insertedToken()
|
static boolean |
isCSS1Selector(java.lang.String buf)
In CSS1, selectors can contain only the characters A-Z, 0-9, and Unicode characters 161-255, plus dash (-); they cannot start with a dash or a digit; they can also contain escaped characters and any Unicode character as a numeric code (see next item). |
boolean |
isPushed(Node node)
Is the node in the stack? |
static boolean |
isValidAttrName(java.lang.String attr)
Check if attr is a valid name. |
Node |
newLineNode()
Adds a new line node. |
Node |
newNode()
Creates a new node and add it to nodelist. |
Node |
newNode(short type,
byte[] textarray,
int start,
int end)
Creates a new node and add it to nodelist. |
Node |
newNode(short type,
byte[] textarray,
int start,
int end,
java.lang.String element)
Creates a new node and add it to nodelist. |
Node |
parseAsp()
parser for ASP within start tags Some people use ASP for to customize attributes Tidy isn't really well suited to dealing with ASP This is a workaround for attributes, but won't deal with the case where the ASP is used to tailor the attribute value. |
java.lang.String |
parseAttribute(boolean[] isempty,
Node[] asp,
Node[] php)
consumes the '>' terminating start tags. |
AttVal |
parseAttrs(boolean[] isempty)
Parse tag attributes. |
void |
parseEntity(short mode)
Parse an html entity. |
Node |
parsePhp()
PHP is like ASP but is based upon XML processing instructions, e.g. |
int |
parseServerInstruction()
Invoked when < is seen in place of attribute value but terminates on whitespace if not ASP, PHP or Tango this routine recognizes ' and " quoted strings. |
char |
parseTagName()
Parses a tag name. |
java.lang.String |
parseValue(java.lang.String name,
boolean foldCase,
boolean[] isempty,
int[] pdelim)
Parse an attribute value. |
void |
popInline(Node node)
Pop a copy of an inline node from the stack. |
protected boolean |
preContent(Node node)
Is content acceptable for pre elements? |
void |
pushInline(Node node)
Push a copy of an inline node onto stack but don't push if implicit or OBJECT or APPLET (implicit tags are ones generated from the istack) One issue arises with pushing inlines when the tag is already pushed. |
boolean |
setXHTMLDocType(Node root)
Adds a new xhtml doctype to the document. |
void |
ungetToken()
|
protected void |
updateNodeTextArrays(byte[] oldtextarray,
byte[] newtextarray)
Update oldtextarray in the current nodes. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final short IGNORE_WHITESPACE
public static final short MIXED_CONTENT
public static final short PREFORMATTED
public static final short IGNORE_MARKUP
protected StreamIn in
protected java.io.PrintWriter errout
protected short badAccess
protected short badLayout
protected short badChars
protected short badForm
protected short warnings
protected short errors
protected int lines
protected int columns
protected boolean waswhite
protected boolean pushed
protected boolean insertspace
protected boolean excludeBlocks
protected boolean exiled
protected boolean isvoyager
protected short versions
protected int doctype
protected boolean badDoctype
protected int txtstart
protected int txtend
protected short state
protected Node token
protected byte[] lexbuf
protected int lexlength
protected int lexsize
protected Node inode
protected int insert
protected java.util.Stack istack
protected int istackbase
protected Style styles
protected Configuration configuration
protected boolean seenEndBody
protected boolean seenEndHtml
protected Report report
protected Node root
Constructor Detail |
---|
public Lexer(StreamIn in, Configuration configuration, Report report)
in
- StreamInconfiguration
- configuation instancereport
- report instance, for reporting errorsMethod Detail |
---|
public Node newNode()
public Node newNode(short type, byte[] textarray, int start, int end)
type
- node type: Node.ROOT_NODE | Node.DOCTYPE_TAG | Node.COMMENT_TAG | Node.PROC_INS_TAG | Node.TEXT_NODE |
Node.START_TAG | Node.END_TAG | Node.START_END_TAG | Node.CDATA_TAG | Node.SECTION_TAG | Node. ASP_TAG |
Node.JSTE_TAG | Node.PHP_TAG | Node.XML_DECLtextarray
- array of bytes contained in the Nodestart
- start positionend
- end position
public Node newNode(short type, byte[] textarray, int start, int end, java.lang.String element)
type
- node type: Node.ROOT_NODE | Node.DOCTYPE_TAG | Node.COMMENT_TAG | Node.PROC_INS_TAG | Node.TEXT_NODE |
Node.START_TAG | Node.END_TAG | Node.START_END_TAG | Node.CDATA_TAG | Node.SECTION_TAG | Node. ASP_TAG |
Node.JSTE_TAG | Node.PHP_TAG | Node.XML_DECLtextarray
- array of bytes contained in the Nodestart
- start positionend
- end positionelement
- tag name
public Node cloneNode(Node node)
node
- Node
public AttVal cloneAttributes(AttVal attrs)
attrs
- original AttVal
protected void updateNodeTextArrays(byte[] oldtextarray, byte[] newtextarray)
oldtextarray
in the current nodes.
oldtextarray
- previous text arraynewtextarray
- new text arraypublic Node newLineNode()
public boolean endOfInput()
true
if end of input stream been reachedpublic void addByte(int c)
c
- byte to addpublic void changeChar(byte c)
c
- new charpublic void addCharToLexer(int c)
c
- char to storepublic void addStringToLexer(java.lang.String str)
str
- String to addpublic void parseEntity(short mode)
mode
- modepublic char parseTagName()
public void addStringLiteral(java.lang.String str)
str
- input Stringpublic short htmlVersion()
public java.lang.String htmlVersionName()
public boolean addGenerator(Node root)
root
- root node
true
if the tag has been addedpublic boolean checkDocTypeKeyWords(Node doctype)
doctype
- doctype node
public short findGivenVersion(Node doctype)
doctype
- doctype node
public void fixHTMLNameSpace(Node root, java.lang.String profile)
root
- root Nodeprofile
- current profilepublic boolean setXHTMLDocType(Node root)
root
- root node
true
if a doctype has been addedpublic short apparentVersion()
public boolean fixDocType(Node root)
root
- root node
false
if current version has not been identifiedpublic boolean fixXmlDecl(Node root)
<?XML version="1.0"?>
. Add encoding attribute if not using
ASCII or UTF-8 output.
root
- root node
public Node inferredTag(java.lang.String name)
name
- tag name
public Node getCDATA(Node container)
container
- container node
public void ungetToken()
public Node getToken(short mode)
mode
- one of the following:
MixedContent
-- for elements which don't accept PCDATAPreformatted
-- white spacepreserved as isIgnoreMarkup
-- for CDATA elements such as script, stylepublic Node parseAsp()
href='<%=rsSchool.Fields("ID").Value%>'
where the ASP that generates the attribute value is
masked from Tidy by the quotemarks.
public Node parsePhp()
<?php ... ?>
.
public java.lang.String parseAttribute(boolean[] isempty, Node[] asp, Node[] php)
isempty
- flag is passed as array so it can be modifiedasp
- asp Node, passed as array so it can be modifiedphp
- php Node, passed as array so it can be modified
public int parseServerInstruction()
public java.lang.String parseValue(java.lang.String name, boolean foldCase, boolean[] isempty, int[] pdelim)
name
- attribute namefoldCase
- fold case?isempty
- is attribute empty? Passed as an array reference to allow modificationpdelim
- delimiter, passed as an array reference to allow modification
public static boolean isValidAttrName(java.lang.String attr)
attr
- String to check, must be non-null
true
if attr is a valid name.public static boolean isCSS1Selector(java.lang.String buf)
buf
- css selector name
true
if the given string is a valid css1 selector namepublic AttVal parseAttrs(boolean[] isempty)
isempty
- is tag empty?
public void pushInline(Node node)
<p><em> text <p><em> more text
Shouldn't be mapped to
<p><em> text </em></p><p><em><em> more text </em></em>
node
- Node to be pushedpublic void popInline(Node node)
node
- Node to be poppedpublic boolean isPushed(Node node)
node
- Node
true
is the node is found in the stackpublic int inlineDup(Node node)
<i><h1>italic heading</h1></i>
which is then treated as
equivalent to <h1><i>italic heading</i></h1>
This is implemented by setting the lexer
into a mode where it gets tokens from the inline stack rather than from the input stream.
node
- original node
public Node insertedToken()
public boolean canPrune(Node element)
element
- node
true
if he element can be removedpublic void fixId(Node node)
node
- Node to check for name/it attributespublic void deferDup()
protected boolean preContent(Node node)
node
- content
true
if node is acceptable in pre elements
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |