{Html解析器.最近因为用到Html解析功能.在网上找了几款Delphi版本的,结果发现解析复杂的HTML都有一些问题.没办法自己写了一款,经测试到现在没遇到任何解析不了的Html.wr960204 武稀松 2013http://www.raysoftware.cn/?p=370感谢牛人杨延哲在HTML语法和CSS语法方面的帮助.Thank Yang Yanzhe.http://www.pockhero.com/本版本只支持DelphiXE3之后的版本.如果用早期Delphi请使用HTMLParser.pas文件.支持Windows,MacOSX,iOS,Android平台,完全去掉了对指针的使用.防止以后易博龙去掉移动平台对指针的支持.脱离了对旧版本的支持,甩掉包袱开发起来真的很爽!
}unit HtmlParser_XE3UP;interface{$IF (defined(IOS) and defined(CPUARM)) or defined(ANDROID)}
{$DEFINE MOBILE_DEV}
{$ENDIF}constLowStrIndex = Low(string); // 移动平台=0,个人电脑平台=1type{$IFNDEF MSWINDOWS}{ 接口使用WideString是为了可以给例如C++,VB等语言使用.但是如果离开了Windows平台,其他平台是没有WideString这个COM的数据类型的.}WideString = string;
{$ENDIF}IHtmlElement = interface;IHtmlElementList = interface;IHtmlElement = interface['{8C75239C-8CFA-499F-B115-7CEBEDFB421B}']// function GetOwner: IHtmlElement; stdcall;function GetTagName: WideString; safecall;function GetContent: WideString; safecall;function GetOrignal: WideString; safecall;function GetChildrenCount: Integer; stdcall;function GetChildren(Index: Integer): IHtmlElement; stdcall;function GetCloseTag: IHtmlElement; stdcall;function GetInnerHtml(): WideString; safecall;function GetOuterHtml(): WideString; safecall;function GetInnerText(): WideString; safecall;function GetAttributes(Key: WideString): WideString; safecall;function GetSourceLineNum(): Integer; stdcall;function GetSourceColNum(): Integer; stdcall;// 属性是否存在function HasAttribute(AttributeName: WideString): Boolean; stdcall;{ 用CSS选择器语法查找Element,不支持"伪类"CSS Selector Style search,not support Pseudo-classes.http://www.w3.org/TR/CSS2/selector.html}function SimpleCSSSelector(const selector: WideString): IHtmlElementList; stdcall;// 枚举属性function EnumAttributeNames(Index: Integer): WideString; safecall;property TagName: WideString read GetTagName;property ChildrenCount: Integer read GetChildrenCount;property Children[index: Integer]: IHtmlElement read GetChildren; default;property CloseTag: IHtmlElement read GetCloseTag;property Content: WideString read GetContent;property Orignal: WideString read GetOrignal;// property Owner: IHtmlElement read GetOwner;// 获取元素在源代码中的位置property SourceLineNum: Integer read GetSourceLineNum;property SourceColNum: Integer read GetSourceColNum;//property InnerHtml: WideString read GetInnerHtml;property OuterHtml: WideString read GetOuterHtml;property InnerText: WideString read GetInnerText;property Attributes[Key: WideString]: WideString read GetAttributes;end;IHtmlElementList = interface['{8E1380C6-4263-4BF6-8D10-091A86D8E7D9}']function GetCount: Integer; stdcall;function GetItems(Index: Integer): IHtmlElement; stdcall;property Count: Integer read GetCount;property Items[Index: Integer]: IHtmlElement read GetItems; default;end;function ParserHTML(const Source: WideString): IHtmlElement; stdcall;implementationusesSysUtils, generics.Collections;typeTStringDictionary = TDictionary<string, string>;TPropDictionary = TDictionary<string, WORD>;TStringDynArray = TArray<string>;constWhiteSpace =[' ', #13, #10, #9];// CSS Attribute Compare OperatorOperatorChar =['=', '!', '*', '~', '|', '^', '$'];MaxListSize = Maxint div 16;// TagPropertytpBlock = $01;tpInline = $02;tpEmpty = $04;tpFormatAsInline = $08;tpPreserveWhitespace = $10;tpInlineOrEmpty = tpInline or tpEmpty;typeTAttrOperator = (aoExist, aoEqual, aoNotEqual, aoIncludeWord, aoBeginWord, aoBegin, aoEnd, aoContain);PAttrSelectorItem = ^TAttrSelectorItem;TAttrSelectorItem = recordKey: string;AttrOperator: TAttrOperator;Value: string;end;TSelectorItemRelation = (sirNONE, sirDescendant, sirChildren, sirYoungerBrother, sirAllYoungerBrother);PCSSSelectorItem = ^TCSSSelectorItem;TCSSSelectorItem = recordRelation: TSelectorItemRelation;szTag: string;Attributes: array of TAttrSelectorItem;end;TCSSSelectorItems = array of TCSSSelectorItem;PCSSSelectorItems = ^TCSSSelectorItems;TCSSSelectorItemGroup = array of TCSSSelectorItems;//TSourceContext = recordprivatefunction GetCharOfCurrent(Index: Integer): Char; inline;publicCode: string;CodeIndex: Integer;LineNum: Integer;ColNum: Integer;CurrentChar: Char;
{$IFDEF DEBUG}currentCode: PChar;
{$ENDIF}procedure IncSrc(); overload; inline;procedure IncSrc(Step: Integer); overload; inline;procedure setCode(const ACode: string); inline;function ReadStr(UntilChars: TSysCharSet): string; inline;function PeekStr(Index: Integer): string; overload; inline;function PeekStr(): string; overload; inline;function subStr(Index, Count: Integer): string; overload; inline;function subStr(Count: Integer): string; overload; inline;procedure SkipBlank(); inline;property charOfCurrent[Index: Integer]: Char read GetCharOfCurrent;end;TAttributeItem = recordKey, Value: string;end;TAttributeDynArray = TArray<TAttributeItem>;TIHtmlElementList = class;THtmlElement = class;THtmlElementList = TList<THtmlElement>;THtmlElement = class(TInterfacedObject, IHtmlElement)protected// IHtmlElement// function GetOwner: IHtmlElement; stdcall;function GetTagName: WideString; safecall;function GetContent: WideString; safecall;function GetOrignal: WideString; safecall;function GetChildrenCount: Integer; stdcall;function GetChildren(Index: Integer): IHtmlElement; stdcall;function GetCloseTag: IHtmlElement; stdcall;function GetInnerHtml(): WideString; safecall;function GetOuterHtml(): WideString; safecall;function GetInnerText(): WideString; safecall;function GetAttributes(Key: WideString): WideString; safecall;function GetSourceLineNum(): Integer; stdcall;function GetSourceColNum(): Integer; stdcall;// 属性是否存在function HasAttribute(AttributeName: WideString): Boolean; stdcall;{ 用CSS选择器语法查找Element,不支持"伪类"CSS Selector Style search,not support Pseudo-classes.http://www.w3.org/TR/CSS2/selector.html}function SimpleCSSSelector(const selector: WideString): IHtmlElementList; stdcall;// 枚举属性function EnumAttributeNames(Index: Integer): WideString; safecall;property TagName: WideString read GetTagName;property ChildrenCount: Integer read GetChildrenCount;property Children[index: Integer]: IHtmlElement read GetChildren; default;property CloseTag: IHtmlElement read GetCloseTag;property Content: WideString read GetContent;property Orignal: WideString read GetOrignal;// property Owner: IHtmlElement read GetOwner;// 获取元素在源代码中的位置property SourceLineNum: Integer read GetSourceLineNum;property SourceColNum: Integer read GetSourceColNum;//property InnerHtml: WideString read GetInnerHtml;property OuterHtml: WideString read GetOuterHtml;property InnerText: WideString read GetInnerText;property Attributes[Key: WideString]: WideString read GetAttributes;privateFClosed: Boolean;//FOwner: THtmlElement;FCloseTag: IHtmlElement;FTagName: string;FIsCloseTag: Boolean;FContent: string;FOrignal: string;FSourceLine: Integer;FSourceCol: Integer;//FAttributes: TStringDictionary;FChildren: TIHtmlElementList;procedure _GetHtml(IncludeSelf: Boolean; Sb: TStringBuilder);procedure _GetText(IncludeSelf: Boolean; Sb: TStringBuilder);procedure _SimpleCSSSelector(const ItemGroup: TCSSSelectorItemGroup; r: TIHtmlElementList);procedure _Select(Item: PCSSSelectorItem; Count: Integer; r: TIHtmlElementList; OnlyTopLevel: Boolean = false);publicconstructor Create(AOwner: THtmlElement; AText: string; ALine, ACol: Integer);destructor Destroy; override;end;TIHtmlElementList = class(TInterfacedObject, IHtmlElementList)private// IHtmlElementListfunction GetItems(Index: Integer): IHtmlElement; stdcall;function GetCount: Integer; stdcall;protectedFList: TList<IHtmlElement>;procedure SetItems(Index: Integer; const Value: IHtmlElement); inline;function Add(Value: IHtmlElement): Integer; inline;procedure Delete(Index: Integer); inline;procedure Clear(); inline;publicconstructor Create;destructor Destroy; override;function IndexOf(Item: IHtmlElement): Integer;// IHtmlElementListproperty Items[index: Integer]: IHtmlElement read GetItems write SetItems; default;property Count: Integer read GetCount;end;function SplitStr(ACharSet: TSysCharSet; AStr: string): TStringDynArray;
varL, I: Integer;S: string;StrChar: Char;
beginResult := nil;if Length(AStr) <= 0 thenExit;I := Low(AStr);L := Low(AStr);StrChar := #0;while I <= High(AStr) dobeginif CharInSet(AStr[I], ['''', '"']) thenif StrChar = #0 thenStrChar := AStr[I]else if StrChar = AStr[I] thenStrChar := #0;// 不在字符串中,分隔符才生效if StrChar = #0 thenif CharInSet(AStr[I], ACharSet) thenbeginif I > L thenbeginS := Copy(AStr, L{$IF (LowStrIndex = 0)}  + 1{$ENDIF}, I - L);SetLength(Result, Length(Result) + 1);Result[Length(Result) - 1] := S;end;L := I + 1;end;Inc(I);end;if (I > L) thenbeginS := Copy(AStr, L{$IF (LowStrIndex = 0)}  + 1{$ENDIF}, I - L);SetLength(Result, Length(Result) + 1);Result[Length(Result) - 1] := S;end;
end;function StrRight(const Value: string; Count: Integer): string;
varstart: Integer;
beginstart := Length(Value) - Count + 1;if start <= 0 thenResult := ValueelseResult := Copy(Value, start, Count);
end;function StrLeft(const Value: string; Count: Integer): string;
beginResult := Copy(Value, LowStrIndex, Count);
end;// ComapreAttrfunction _aoExist(const Item: TAttrSelectorItem; E: THtmlElement): Boolean;
beginResult := E.FAttributes.ContainsKey(Item.Key);
end;function _aoEqual(const Item: TAttrSelectorItem; E: THtmlElement): Boolean;
beginResult := E.FAttributes.ContainsKey(Item.Key) and (E.FAttributes[Item.Key] = Item.Value);
end;function _aoNotEqual(const Item: TAttrSelectorItem; E: THtmlElement): Boolean;
beginResult := E.FAttributes.ContainsKey(Item.Key) and (E.FAttributes[Item.Key] <> Item.Value);
end;function _aoIncludeWord(const Item: TAttrSelectorItem; E: THtmlElement): Boolean;
varS: TStringDynArray;I: Integer;
beginResult := false;if not E.FAttributes.ContainsKey(Item.Key) thenExit;Result := True;S := SplitStr(WhiteSpace, E.FAttributes[Item.Key]);for I := Low(S) to High(S) doif S[I] = Item.Value thenExit;Result := false;
end;function _aoBeginWord(const Item: TAttrSelectorItem; E: THtmlElement): Boolean;
varS: TStringDynArray;I: Integer;
beginResult := false;if not E.FAttributes.ContainsKey(Item.Key) thenExit;S := SplitStr((WhiteSpace + ['_', '-']), E.FAttributes[Item.Key]);Result := (Length(S) > 0) and (S[0] = Item.Value);
end;function _aoBegin(const Item: TAttrSelectorItem; E: THtmlElement): Boolean;
varattr, Value: string;
beginResult := false;if not E.FAttributes.ContainsKey(Item.Key) thenExit;attr := E.FAttributes[Item.Key];Value := Item.Value;Result := (Length(attr) > Length(Value)) and (StrLeft(attr, Length(Value)) = Value);
end;function _aoEnd(const Item: TAttrSelectorItem; E: THtmlElement): Boolean;
varattr, Value: string;
beginResult := false;if not E.FAttributes.ContainsKey(Item.Key) thenExit;attr := E.FAttributes[Item.Key];Value := Item.Value;Result := (Length(attr) > Length(Value)) and (StrRight(attr, Length(Value)) = Value);
end;function _aoContain(const Item: TAttrSelectorItem; E: THtmlElement): Boolean;
beginResult := false;if not E.FAttributes.ContainsKey(Item.Key) thenExit;Result := Pos(Item.Value, E.FAttributes[Item.Key]) > 0;
end;typeTFNCompareAttr = function(const Item: TAttrSelectorItem; E: THtmlElement): Boolean;constAttrCompareFuns: array[TAttrOperator] of TFNCompareAttr = (_aoExist, _aoEqual, _aoNotEqual, _aoIncludeWord, _aoBeginWord, _aoBegin, _aoEnd, _aoContain);function ConvertEntities(S: string): string; forward;function GetTagProperty(const TagName: string): WORD; forward;procedure DoError(const Msg: string);
beginraise Exception.Create(Msg);
end;procedure _ParserAttrs(var sc: TSourceContext; var Attrs: TAttributeDynArray);
varItem: TAttributeItem;
beginSetLength(Attrs, 0);while True dobeginsc.SkipBlank();if sc.CurrentChar = #0 thenBreak;Item.Key := sc.ReadStr((WhiteSpace + [#0, '=']));Item.Value := '';sc.SkipBlank;if sc.CurrentChar = '=' thenbeginsc.IncSrc;sc.SkipBlank;Item.Value := sc.ReadStr((WhiteSpace + [#0]));end;SetLength(Attrs, Length(Attrs) + 1);Attrs[Length(Attrs) - 1] := Item;end;
end;procedure _ParserNodeItem(S: string; var ATagName: string; var Attrs: TAttributeDynArray);
varsc: TSourceContext;
beginsc.setCode(S);sc.SkipBlank;ATagName := UpperCase(sc.ReadStr((WhiteSpace + [#0, '/', '>'])));_ParserAttrs(sc, Attrs);
end;function CreateTextElement(AOwner: THtmlElement; AText: string; ALine, ACol: Integer): THtmlElement;
beginResult := THtmlElement.Create(AOwner, AText, ALine, ACol);with Result dobeginFContent := ConvertEntities(AText);FTagName := '#TEXT';FClosed := True;end;
end;function CreateScriptElement(AOwner: THtmlElement; AText: string; ALine, ACol: Integer): THtmlElement;
beginResult := THtmlElement.Create(AOwner, AText, ALine, ACol);with Result dobeginFContent := ConvertEntities(AText);FTagName := '#SCRIPT';FClosed := True;end;
end;function CreateStyleElement(AOwner: THtmlElement; AText: string; ALine, ACol: Integer): THtmlElement;
beginResult := THtmlElement.Create(AOwner, AText, ALine, ACol);with Result dobeginFContent := ConvertEntities(AText);FTagName := '#STYLE';FClosed := True;end;
end;function CreateCommentElement(AOwner: THtmlElement; AText: string; ALine, ACol: Integer): THtmlElement;
beginResult := THtmlElement.Create(AOwner, AText, ALine, ACol);with Result dobeginFContent := ConvertEntities(AText);FTagName := '#COMMENT';FClosed := True;end;
end;function CreateTagElement(AOwner: THtmlElement; AText: string; ALine, ACol: Integer): THtmlElement;
varStrs: TStringDynArray;I: Integer;Attrs: TAttributeDynArray;
beginResult := THtmlElement.Create(AOwner, AText, ALine, ACol);with Result dobegin// TODO 解析TagName和属性if AText = '' thenExit;// 去掉两头的<if AText[Low(AText)] = '<' thenAText := StrRight(AText, Length(AText) - 1);if AText = '' thenExit;if AText[High(AText)] = '>' thenAText := StrLeft(AText, Length(AText) - 1);// 检查是关闭节点,还是单个已经关闭的节点if AText = '' thenExit;FClosed := AText[High(AText)] = '/';FIsCloseTag := AText[LowStrIndex] = '/';if FIsCloseTag thenAText := StrRight(AText, Length(AText) - 1);if FClosed thenAText := StrLeft(AText, Length(AText) - 1);//_ParserNodeItem(AText, FTagName, Attrs);for I := Low(Attrs) to High(Attrs) doFAttributes.AddOrSetValue(LowerCase(Attrs[I].Key), ConvertEntities(Attrs[I].Value));end;
end;function CreateDocTypeElement(AOwner: THtmlElement; AText: string; ALine, ACol: Integer): THtmlElement;
beginResult := THtmlElement.Create(AOwner, AText, ALine, ACol);with Result dobeginFContent := ConvertEntities(AText);FTagName := '#DOCTYPE';FClosed := True;if FContent = '' thenExit;if FContent[1] = '<' thenDelete(FContent, 1, 1);if FContent = '' thenExit;if FContent[Length(FContent)] = '>' thenDelete(FContent, Length(FContent), 1);FContent := Trim(Copy(Trim(FContent), 9, Length(FContent)));end;
end;procedure _ParserHTML(const Source: string; AElementList: THtmlElementList);
varBeginLineNum, BeginColNum: Integer;sc: TSourceContext;function IsEndOfTag(TagName: string): Boolean;beginResult := false;if sc.charOfCurrent[1] = '/' thenbeginResult := UpperCase(sc.subStr(sc.CodeIndex + 2, Length(TagName))) = UpperCase(TagName);end;end;function PosCharInTag(AChar: Char): Boolean;varStrChar: Char;beginResult := false;StrChar := #0;while True dobeginif sc.CurrentChar = #0 thenBreak;if sc.CurrentChar = '"' thenbeginif StrChar = #0 thenStrChar := sc.CurrentCharelseStrChar := #0;end;if (sc.CurrentChar = AChar) and (StrChar = #0) thenbeginResult := True;Break;end;sc.IncSrc;end;end;function ParserStyleData(): string;varoldIndex: Integer;beginoldIndex := sc.CodeIndex;if sc.subStr(4) = '<!--' thenbeginsc.IncSrc(5);while True dobeginif sc.CurrentChar = #0 thenDoError(Format('未完结的Style行:%d;列:%d;', [sc.LineNum, sc.ColNum]))else if sc.CurrentChar = '>' thenbeginif (sc.charOfCurrent[-1] = '-') and (sc.charOfCurrent[-2] = '-') thenbeginsc.IncSrc;sc.SkipBlank();Break;end;end;sc.IncSrc;end;endelsewhile True dobegincase sc.CurrentChar of#0:beginBreak;end;'<':beginif IsEndOfTag('style') thenbeginBreak;end;end;end;sc.IncSrc;end;Result := sc.subStr(oldIndex, sc.CodeIndex - oldIndex);end;function ParserScriptData(): string;varoldIndex: Integer;stringChar: Char;PreIsblique: Boolean;beginoldIndex := sc.CodeIndex;stringChar := #0;sc.SkipBlank();if sc.subStr(4) = '<!--' thenbeginsc.IncSrc(5);while True dobeginif sc.CurrentChar = #0 thenDoError(Format('未完结的Script行:%d;列:%d;', [sc.LineNum, sc.ColNum]))else if sc.CurrentChar = '>' thenbeginif (sc.charOfCurrent[-1] = '-') and (sc.charOfCurrent[-2] = '-') thenbeginsc.IncSrc;sc.SkipBlank();Break;end;end;sc.IncSrc;end;endelsebeginwhile True dobegincase sc.CurrentChar of#0:Break;'"', '''': // 字符串beginstringChar := sc.CurrentChar;PreIsblique := false;sc.IncSrc();while True dobeginif sc.CurrentChar = #0 thenBreak;if (sc.CurrentChar = stringChar) and (not PreIsblique) thenBreak;if sc.CurrentChar = '\' thenPreIsblique := not PreIsbliqueelsePreIsblique := false;sc.IncSrc;end;end;'/': // 注释beginsc.IncSrc();case sc.CurrentChar of'/': // 行注释beginwhile True dobeginif CharInSet(sc.CurrentChar, [#0, #$0A]) thenbeginBreak;end;sc.IncSrc();end;end;'*': // 块注释beginsc.IncSrc();sc.IncSrc();while True dobeginif sc.CurrentChar = #0 thenBreak;if (sc.CurrentChar = '/') and (sc.charOfCurrent[-1] = '*') thenbeginBreak;end;sc.IncSrc();end;end;end;end;'<':beginif IsEndOfTag('script') thenbeginBreak;end;end;end;sc.IncSrc();end;end;Result := sc.subStr(oldIndex, sc.CodeIndex - oldIndex)end;varElementType: (EtUnknow, EtTag, EtDocType, EtText, EtComment);OldCodeIndex: Integer;tmp: string;Tag: THtmlElement;
beginsc.setCode(Source);while sc.CodeIndex <= high(sc.Code) dobeginElementType := EtUnknow;OldCodeIndex := sc.CodeIndex;BeginLineNum := sc.LineNum;BeginColNum := sc.ColNum;if sc.CurrentChar = #0 thenBreak;// "<"开头的就是Tag之类的if sc.CurrentChar = '<' thenbeginsc.IncSrc;if sc.CurrentChar = '!' then // 注释beginElementType := EtComment;sc.IncSrc;case sc.CurrentChar of'-': // <!--  -->beginsc.IncSrc; // -while True dobeginif not PosCharInTag('>') thenDoError('LineNum:' + IntToStr(BeginLineNum) + '无法找到Tag结束点:' + sc.subStr(100))else if (sc.charOfCurrent[-1] = '-') and (sc.charOfCurrent[-2] = '-') thenbeginsc.IncSrc;Break;end;sc.IncSrc;end;end;'[': // <![CDATA[.....]]>beginsc.IncSrc; //while True dobeginif not PosCharInTag('>') thenDoError('LineNum:' + IntToStr(BeginLineNum) + '无法找到Tag结束点:' + sc.subStr(100))else if (sc.charOfCurrent[-1] = ']') thenbeginsc.IncSrc;Break;end;sc.IncSrc;end;end;else // <!.....>beginif UpperCase(sc.PeekStr()) = 'DOCTYPE' thenbeginElementType := EtDocType;sc.IncSrc; //if PosCharInTag('>') thensc.IncSrcelseDoError('LineNum:' + IntToStr(BeginLineNum) + '无法找到Tag结束点:' + sc.subStr(100));endelsebeginsc.IncSrc; //if PosCharInTag('>') thensc.IncSrcelseDoError('LineNum:' + IntToStr(BeginLineNum) + '无法找到Tag结束点:' + sc.subStr(100));end;end;end;endelse if sc.CurrentChar = '?' then // <?...?>  XMLbeginElementType := EtComment;sc.IncSrc; //while True dobeginif not PosCharInTag('>') thenDoError('LineNum:' + IntToStr(BeginLineNum) + '无法找到Tag结束点:' + sc.subStr(100))else if (sc.charOfCurrent[-1] = '?') thenbeginsc.IncSrc;Break;end;sc.IncSrc;end;endelse // 正常节点beginElementType := EtTag;sc.IncSrc;if PosCharInTag('>') thensc.IncSrcelseDoError('LineNum:' + IntToStr(BeginLineNum) + '无法找到Tag结束点:' + sc.subStr(100));end;tmp := sc.subStr(OldCodeIndex, sc.CodeIndex - OldCodeIndex);endelse // 不是"<"开头的 那就是纯文本节点beginElementType := EtText;while True dobeginif CharInSet(sc.CurrentChar, [#0, '<']) thenBreak;sc.IncSrc;end;tmp := sc.subStr(OldCodeIndex, sc.CodeIndex - OldCodeIndex);end;//// ShowMessage(sc.subStr(30));case ElementType ofEtUnknow:beginDoError('LineNum:' + IntToStr(BeginLineNum) + '无法解析的内容:' + sc.subStr(100));end;EtDocType:beginTag := CreateDocTypeElement(nil, tmp, BeginLineNum, BeginColNum);AElementList.Add(Tag);end;EtTag:beginTag := CreateTagElement(nil, tmp, BeginLineNum, BeginColNum);AElementList.Add(Tag);//if (UpperCase(Tag.FTagName) = 'SCRIPT') and (not Tag.FIsCloseTag) and (not Tag.FClosed) thenbegin// 读取ScriptBeginLineNum := sc.LineNum;BeginColNum := sc.ColNum;tmp := ParserScriptData();Tag := CreateScriptElement(nil, tmp, BeginLineNum, BeginColNum);AElementList.Add(Tag);endelse if (UpperCase(Tag.FTagName) = 'STYLE') and (not Tag.FIsCloseTag) and (not Tag.FClosed) thenbegin// 读取StyleBeginLineNum := sc.LineNum;BeginColNum := sc.ColNum;tmp := ParserStyleData();Tag := CreateStyleElement(nil, tmp, BeginLineNum, BeginColNum);AElementList.Add(Tag);end;end;EtText:beginTag := CreateTextElement(nil, tmp, BeginLineNum, BeginColNum);Tag.FSourceLine := BeginLineNum;Tag.FSourceCol := BeginColNum;AElementList.Add(Tag);end;EtComment:beginTag := CreateCommentElement(nil, tmp, BeginLineNum, BeginColNum);Tag.FSourceLine := BeginLineNum;Tag.FSourceCol := BeginColNum;AElementList.Add(Tag);end;end;end;////
end;function BuildTree(ElementList: THtmlElementList): THtmlElement;
varI, J: Integer;E: THtmlElement;T: THtmlElement;FoundIndex: Integer;TagProperty: WORD;
beginResult := THtmlElement.Create(nil, '', 0, 0);Result.FTagName := '#DOCUMENT';Result.FClosed := false;ElementList.Insert(0, Result);I := 1;while I < ElementList.Count dobeginE := ElementList[I] as THtmlElement;TagProperty := GetTagProperty(E.FTagName);// 空节点,往下找,如果下一个带Tag的节点不是它的关闭节点,那么自动关闭FoundIndex := -1;if E.FIsCloseTag thenbeginfor J := (I - 1) downto 0 dobeginT := ElementList[J] as THtmlElement;if (not T.FClosed) and (T.FTagName = E.FTagName) and (not T.FIsCloseTag) thenbeginFoundIndex := J;Break;end;end;// 如果往上找,找不到的话这个关闭Tag肯定是无意义的.if FoundIndex > 0 thenbeginfor J := (I - 1) downto FoundIndex dobeginT := ElementList[J] as THtmlElement;T.FClosed := True;end;(ElementList[FoundIndex] as THtmlElement).FCloseTag := E;endelsebeginE.Free;end;ElementList.Delete(I);Continue;endelsebeginfor J := (I - 1) downto 0 dobeginT := ElementList[J] as THtmlElement;if not T.FClosed thenbeginif ((GetTagProperty(T.FTagName) and tpEmpty) <> 0) thenT.FClosed := TrueelsebeginT.FChildren.Add(E);E.FOwner := T;Break;end;end;end;end;Inc(I);end;Result.FClosed := True;
end;function ParserHTML(const Source: WideString): IHtmlElement; stdcall;
varElementList: THtmlElementList;
beginElementList := THtmlElementList.Create;_ParserHTML(Source, ElementList);Result := BuildTree(ElementList);ElementList.Free;
end;
{$REGION '转换表之类的'}vargEntities: TStringDictionary;typeTEntityItem = recordKey: string;Value: WideChar;end;constEntityTable: array[0..252 - 1] of TEntityItem = ((Key: '&nbsp;';Value: WideChar(160)), (Key: '&iexcl;';Value: WideChar(161)), (Key: '&cent;';Value: WideChar(162)), (Key: '&pound;';Value: WideChar(163)), (Key: '&curren;';Value: WideChar(164)), (Key: '&yen;';Value: WideChar(165)), (Key: '&brvbar;';Value: WideChar(166)), (Key: '&sect;';Value: WideChar(167)), (Key: '&uml;';Value: WideChar(168)), (Key: '&copy;';Value: WideChar(169)), (Key: '&ordf;';Value: WideChar(170)), (Key: '&laquo;';Value: WideChar(171)), (Key: '&not;';Value: WideChar(172)), (Key: '&shy;';Value: WideChar(173)), (Key: '&reg;';Value: WideChar(174)), (Key: '&macr;';Value: WideChar(175)), (Key: '&deg;';Value: WideChar(176)), (Key: '&plusmn;';Value: WideChar(177)), (Key: '&sup2;';Value: WideChar(178)), (Key: '&sup3;';Value: WideChar(179)), (Key: '&acute;';Value: WideChar(180)), (Key: '&micro;';Value: WideChar(181)), (Key: '&para;';Value: WideChar(182)), (Key: '&middot;';Value: WideChar(183)), (Key: '&cedil;';Value: WideChar(184)), (Key: '&sup1;';Value: WideChar(185)), (Key: '&ordm;';Value: WideChar(186)), (Key: '&raquo;';Value: WideChar(187)), (Key: '&frac14;';Value: WideChar(188)), (Key: '&frac12;';Value: WideChar(189)), (Key: '&frac34;';Value: WideChar(190)), (Key: '&iquest;';Value: WideChar(191)), (Key: '&Agrave;';Value: WideChar(192)), (Key: '&Aacute;';Value: WideChar(193)), (Key: '&Acirc;';Value: WideChar(194)), (Key: '&Atilde;';Value: WideChar(195)), (Key: '&Auml;';Value: WideChar(196)), (Key: '&Aring;';Value: WideChar(197)), (Key: '&AElig;';Value: WideChar(198)), (Key: '&Ccedil;';Value: WideChar(199)), (Key: '&Egrave;';Value: WideChar(200)), (Key: '&Eacute;';Value: WideChar(201)), (Key: '&Ecirc;';Value: WideChar(202)), (Key: '&Euml;';Value: WideChar(203)), (Key: '&Igrave;';Value: WideChar(204)), (Key: '&Iacute;';Value: WideChar(205)), (Key: '&Icirc;';Value: WideChar(206)), (Key: '&Iuml;';Value: WideChar(207)), (Key: '&ETH;';Value: WideChar(208)), (Key: '&Ntilde;';Value: WideChar(209)), (Key: '&Ograve;';Value: WideChar(210)), (Key: '&Oacute;';Value: WideChar(211)), (Key: '&Ocirc;';Value: WideChar(212)), (Key: '&Otilde;';Value: WideChar(213)), (Key: '&Ouml;';Value: WideChar(214)), (Key: '&times;';Value: WideChar(215)), (Key: '&Oslash;';Value: WideChar(216)), (Key: '&Ugrave;';Value: WideChar(217)), (Key: '&Uacute;';Value: WideChar(218)), (Key: '&Ucirc;';Value: WideChar(219)), (Key: '&Uuml;';Value: WideChar(220)), (Key: '&Yacute;';Value: WideChar(221)), (Key: '&THORN;';Value: WideChar(222)), (Key: '&szlig;';Value: WideChar(223)), (Key: '&agrave;';Value: WideChar(224)), (Key: '&aacute;';Value: WideChar(225)), (Key: '&acirc;';Value: WideChar(226)), (Key: '&atilde;';Value: WideChar(227)), (Key: '&auml;';Value: WideChar(228)), (Key: '&aring;';Value: WideChar(229)), (Key: '&aelig;';Value: WideChar(230)), (Key: '&ccedil;';Value: WideChar(231)), (Key: '&egrave;';Value: WideChar(232)), (Key: '&eacute;';Value: WideChar(233)), (Key: '&ecirc;';Value: WideChar(234)), (Key: '&euml;';Value: WideChar(235)), (Key: '&igrave;';Value: WideChar(236)), (Key: '&iacute;';Value: WideChar(237)), (Key: '&icirc;';Value: WideChar(238)), (Key: '&iuml;';Value: WideChar(239)), (Key: '&eth;';Value: WideChar(240)), (Key: '&ntilde;';Value: WideChar(241)), (Key: '&ograve;';Value: WideChar(242)), (Key: '&oacute;';Value: WideChar(243)), (Key: '&ocirc;';Value: WideChar(244)), (Key: '&otilde;';Value: WideChar(245)), (Key: '&ouml;';Value: WideChar(246)), (Key: '&divide;';Value: WideChar(247)), (Key: '&oslash;';Value: WideChar(248)), (Key: '&ugrave;';Value: WideChar(249)), (Key: '&uacute;';Value: WideChar(250)), (Key: '&ucirc;';Value: WideChar(251)), (Key: '&uuml;';Value: WideChar(252)), (Key: '&yacute;';Value: WideChar(253)), (Key: '&thorn;';Value: WideChar(254)), (Key: '&yuml;';Value: WideChar(255)), (Key: '&fnof;';Value: WideChar(402)), (Key: '&Alpha;';Value: WideChar(913)), (Key: '&Beta;';Value: WideChar(914)), (Key: '&Gamma;';Value: WideChar(915)), (Key: '&Delta;';Value: WideChar(916)), (Key: '&Epsilon;';Value: WideChar(917)), (Key: '&Zeta;';Value: WideChar(918)), (Key: '&Eta;';Value: WideChar(919)), (Key: '&Theta;';Value: WideChar(920)), (Key: '&Iota;';Value: WideChar(921)), (Key: '&Kappa;';Value: WideChar(922)), (Key: '&Lambda;';Value: WideChar(923)), (Key: '&Mu;';Value: WideChar(924)), (Key: '&Nu;';Value: WideChar(925)), (Key: '&Xi;';Value: WideChar(926)), (Key: '&Omicron;';Value: WideChar(927)), (Key: '&Pi;';Value: WideChar(928)), (Key: '&Rho;';Value: WideChar(929)), (Key: '&Sigma;';Value: WideChar(931)), (Key: '&Tau;';Value: WideChar(932)), (Key: '&Upsilon;';Value: WideChar(933)), (Key: '&Phi;';Value: WideChar(934)), (Key: '&Chi;';Value: WideChar(935)), (Key: '&Psi;';Value: WideChar(936)), (Key: '&Omega;';Value: WideChar(937)), (Key: '&alpha;';Value: WideChar(945)), (Key: '&beta;';Value: WideChar(946)), (Key: '&gamma;';Value: WideChar(947)), (Key: '&delta;';Value: WideChar(948)), (Key: '&epsilon;';Value: WideChar(949)), (Key: '&zeta;';Value: WideChar(950)), (Key: '&eta;';Value: WideChar(951)), (Key: '&theta;';Value: WideChar(952)), (Key: '&iota;';Value: WideChar(953)), (Key: '&kappa;';Value: WideChar(954)), (Key: '&lambda;';Value: WideChar(955)), (Key: '&mu;';Value: WideChar(956)), (Key: '&nu;';Value: WideChar(957)), (Key: '&xi;';Value: WideChar(958)), (Key: '&omicron;';Value: WideChar(959)), (Key: '&pi;';Value: WideChar(960)), (Key: '&rho;';Value: WideChar(961)), (Key: '&sigmaf;';Value: WideChar(962)), (Key: '&sigma;';Value: WideChar(963)), (Key: '&tau;';Value: WideChar(964)), (Key: '&upsilon;';Value: WideChar(965)), (Key: '&phi;';Value: WideChar(966)), (Key: '&chi;';Value: WideChar(967)), (Key: '&psi;';Value: WideChar(968)), (Key: '&omega;';Value: WideChar(969)), (Key: '&thetasym;';Value: WideChar(977)), (Key: '&upsih;';Value: WideChar(978)), (Key: '&piv;';Value: WideChar(982)), (Key: '&bull;';Value: WideChar(8226)), (Key: '&hellip;';Value: WideChar(8230)), (Key: '&prime;';Value: WideChar(8242)), (Key: '&Prime;';Value: WideChar(8243)), (Key: '&oline;';Value: WideChar(8254)), (Key: '&frasl;';Value: WideChar(8260)), (Key: '&weierp;';Value: WideChar(8472)), (Key: '&image;';Value: WideChar(8465)), (Key: '&real;';Value: WideChar(8476)), (Key: '&trade;';Value: WideChar(8482)), (Key: '&alefsym;';Value: WideChar(8501)), (Key: '&larr;';Value: WideChar(8592)), (Key: '&uarr;';Value: WideChar(8593)), (Key: '&rarr;';Value: WideChar(8594)), (Key: '&darr;';Value: WideChar(8595)), (Key: '&harr;';Value: WideChar(8596)), (Key: '&crarr;';Value: WideChar(8629)), (Key: '&lArr;';Value: WideChar(8656)), (Key: '&uArr;';Value: WideChar(8657)), (Key: '&rArr;';Value: WideChar(8658)), (Key: '&dArr;';Value: WideChar(8659)), (Key: '&hArr;';Value: WideChar(8660)), (Key: '&forall;';Value: WideChar(8704)), (Key: '&part;';Value: WideChar(8706)), (Key: '&exist;';Value: WideChar(8707)), (Key: '&empty;';Value: WideChar(8709)), (Key: '&nabla;';Value: WideChar(8711)), (Key: '&isin;';Value: WideChar(8712)), (Key: '&notin;';Value: WideChar(8713)), (Key: '&ni;';Value: WideChar(8715)), (Key: '&prod;';Value: WideChar(8719)), (Key: '&sum;';Value: WideChar(8721)), (Key: '&minus;';Value: WideChar(8722)), (Key: '&lowast;';Value: WideChar(8727)), (Key: '&radic;';Value: WideChar(8730)), (Key: '&prop;';Value: WideChar(8733)), (Key: '&infin;';Value: WideChar(8734)), (Key: '&ang;';Value: WideChar(8736)), (Key: '&and;';Value: WideChar(8743)), (Key: '&or;';Value: WideChar(8744)), (Key: '&cap;';Value: WideChar(8745)), (Key: '&cup;';Value: WideChar(8746)), (Key: '&int;';Value: WideChar(8747)), (Key: '&there4;';Value: WideChar(8756)), (Key: '&sim;';Value: WideChar(8764)), (Key: '&cong;';Value: WideChar(8773)), (Key: '&asymp;';Value: WideChar(8776)), (Key: '&ne;';Value: WideChar(8800)), (Key: '&equiv;';Value: WideChar(8801)), (Key: '&le;';Value: WideChar(8804)), (Key: '&ge;';Value: WideChar(8805)), (Key: '&sub;';Value: WideChar(8834)), (Key: '&sup;';Value: WideChar(8835)), (Key: '&nsub;';Value: WideChar(8836)), (Key: '&sube;';Value: WideChar(8838)), (Key: '&supe;';Value: WideChar(8839)), (Key: '&oplus;';Value: WideChar(8853)), (Key: '&otimes;';Value: WideChar(8855)), (Key: '&perp;';Value: WideChar(8869)), (Key: '&sdot;';Value: WideChar(8901)), (Key: '&lceil;';Value: WideChar(8968)), (Key: '&rceil;';Value: WideChar(8969)), (Key: '&lfloor;';Value: WideChar(8970)), (Key: '&rfloor;';Value: WideChar(8971)), (Key: '&lang;';Value: WideChar(9001)), (Key: '&rang;';Value: WideChar(9002)), (Key: '&loz;';Value: WideChar(9674)), (Key: '&spades;';Value: WideChar(9824)), (Key: '&clubs;';Value: WideChar(9827)), (Key: '&hearts;';Value: WideChar(9829)), (Key: '&diams;';Value: WideChar(9830)), (Key: '&quot;';Value: WideChar(34)), (Key: '&amp;';Value: WideChar(38)), (Key: '&lt;';Value: WideChar(60)), (Key: '&gt;';Value: WideChar(62)), (Key: '&OElig;';Value: WideChar(338)), (Key: '&oelig;';Value: WideChar(339)), (Key: '&Scaron;';Value: WideChar(352)), (Key: '&scaron;';Value: WideChar(353)), (Key: '&Yuml;';Value: WideChar(376)), (Key: '&circ;';Value: WideChar(710)), (Key: '&tilde;';Value: WideChar(732)), (Key: '&ensp;';Value: WideChar(8194)), (Key: '&emsp;';Value: WideChar(8195)), (Key: '&thinsp;';Value: WideChar(8201)), (Key: '&zwnj;';Value: WideChar(8204)), (Key: '&zwj;';Value: WideChar(8205)), (Key: '&lrm;';Value: WideChar(8206)), (Key: '&rlm;';Value: WideChar(8207)), (Key: '&ndash;';Value: WideChar(8211)), (Key: '&mdash;';Value: WideChar(8212)), (Key: '&lsquo;';Value: WideChar(8216)), (Key: '&rsquo;';Value: WideChar(8217)), (Key: '&sbquo;';Value: WideChar(8218)), (Key: '&ldquo;';Value: WideChar(8220)), (Key: '&rdquo;';Value: WideChar(8221)), (Key: '&bdquo;';Value: WideChar(8222)), (Key: '&dagger;';Value: WideChar(8224)), (Key: '&Dagger;';Value: WideChar(8225)), (Key: '&permil;';Value: WideChar(8240)), (Key: '&lsaquo;';Value: WideChar(8249)), (Key: '&rsaquo;';Value: WideChar(8250)), (Key: '&euro;';Value: WideChar(8364)));function HexToChar(Value: string): Char;
varI: Integer;W: WORD;
beginW := 0;for I := Low(Value) to High(Value) dobegincase Value[I] of'0'..'9':W := (W shl 4) or (ord(Value[I]) - ord('0'));'a'..'f':W := (W shl 4) or (ord(Value[I]) - ord('a') + 10);'A'..'F':W := (W shl 4) or (ord(Value[I]) - ord('A') + 10);elseW := 0;end;end;Result := Char(W);
end;function DecToChar(Value: string): Char;
varI: Integer;W: WORD;
beginW := 0;for I := Low(Value) to High(Value) dobegincase Value[I] of'0'..'9':W := 10 * W + (ord(Value[I]) - ord('0'));elseW := 0;end;end;Result := Char(W);
end;function ConvertEntities(S: string): string;
vartmp: string;I, p: Integer;Sb: TStringBuilder;
beginif Length(S) <= 3 thenExit(S);if Pos('&#', S) > 0 thenbeginS[low(S)] := S[low(S)];end;Sb := TStringBuilder.Create;I := 0;while I < Length(S) dobeginif S.Chars[I] = '&' thenbeginp := S.IndexOf(';', I);if p >= 0 thenbegintmp := LowerCase(S.Substring(I, p - I + 1));if (Length(tmp) > 2) and (tmp.Chars[1] = '#') thenbeginif (Length(tmp) > 3) and (tmp.Chars[2] = '$') thenSb.Append(HexToChar(tmp.Substring(3, Length(tmp) - 4)))elseSb.Append(DecToChar(tmp.Substring(2, Length(tmp) - 3)));endelse if gEntities.ContainsKey(tmp) thenSb.Append(gEntities[tmp])elseSb.Append(tmp);Inc(I, Length(tmp));endelsebeginSb.Append(S.Chars[I]);Inc(I);end;endelsebeginSb.Append(S.Chars[I]);Inc(I);end;end;Result := Sb.ToString;FreeAndNil(Sb);
end;function ConvertWhiteSpace(S: string): string;
varSb: TStringBuilder;I: Integer;PreIssWhite, ThisIsWhite: Boolean;
beginSb := TStringBuilder.Create;PreIssWhite := false;for I := Low(S) to High(S) dobeginThisIsWhite := CharInSet(S[I], WhiteSpace);if ThisIsWhite thenbeginif not PreIssWhite thenSb.Append(S[I]);PreIssWhite := True;endelsebeginSb.Append(S[I]);PreIssWhite := false;end;end;Result := Sb.ToString;Sb.Free;
end;constBlockTags: array[0..59 - 1] of string = ('HTML', 'HEAD', 'BODY', 'FRAMESET', 'SCRIPT', 'NOSCRIPT', 'STYLE', 'META', 'LINK', 'TITLE', 'FRAME', 'NOFRAMES', 'SECTION', 'NAV', 'ASIDE', 'HGROUP', 'HEADER', 'FOOTER', 'P', 'H1', 'H2', 'H3', 'H4', 'H5', 'H6', 'UL', 'OL', 'PRE', 'DIV', 'BLOCKQUOTE', 'HR', 'ADDRESS', 'FIGURE', 'FIGCAPTION', 'FORM', 'FIELDSET', 'INS', 'DEL', 'S', 'DL', 'DT', 'DD', 'LI', 'TABLE', 'CAPTION', 'THEAD', 'TFOOT', 'TBODY', 'COLGROUP', 'COL', 'TR', 'TH', 'TD', 'VIDEO', 'AUDIO', 'CANVAS', 'DETAILS', 'MENU', 'PLAINTEXT');InlineTags: array[0..56 - 1] of string = ('OBJECT', 'BASE', 'FONT', 'TT', 'I', 'B', 'U', 'BIG', 'SMALL', 'EM', 'STRONG', 'DFN', 'CODE', 'SAMP', 'KBD', 'VAR', 'CITE', 'ABBR', 'TIME', 'ACRONYM', 'MARK', 'RUBY', 'RT', 'RP', 'A', 'IMG', 'BR', 'WBR', 'MAP', 'Q', 'SUB', 'SUP', 'BDO', 'IFRAME', 'EMBED', 'SPAN', 'INPUT', 'SELECT', 'TEXTAREA', 'LABEL', 'BUTTON', 'OPTGROUP', 'OPTION', 'LEGEND', 'DATALIST', 'KEYGEN', 'OUTPUT', 'PROGRESS', 'METER', 'AREA', 'PARAM', 'SOURCE', 'TRACK', 'SUMMARY', 'COMMAND', 'DEVICE');EmptyTags: array[0..14 - 1] of string = ('META', 'LINK', 'BASE', 'FRAME', 'IMG', 'BR', 'WBR', 'EMBED', 'HR', 'INPUT', 'KEYGEN', 'COL', 'COMMAND', 'DEVICE');FormatAsInlineTags: array[0..19 - 1] of string = ('TITLE', 'A', 'P', 'H1', 'H2', 'H3', 'H4', 'H5', 'H6', 'PRE', 'ADDRESS', 'LI', 'TH', 'TD', 'SCRIPT', 'STYLE', 'INS', 'DEL', 'S');PreserveWhitespaceTags: array[0..4 - 1] of string = ('PRE', 'PLAINTEXT', 'TITLE', 'TEXTAREA');vargTagProperty: TPropDictionary;function GetTagProperty(const TagName: string): WORD;
varKey, S: string;
beginResult := 0;Key := UpperCase(TagName);if gTagProperty.ContainsKey(Key) thenResult := gTagProperty[UpperCase(TagName)]elseExit;
end;function ParserCSSSelector(const Value: string): TCSSSelectorItemGroup;
varsc: TSourceContext;function AddAttr(var Item: TCSSSelectorItem): PAttrSelectorItem;beginSetLength(Item.Attributes, Length(Item.Attributes) + 1);Result := @Item.Attributes[Length(Item.Attributes) - 1];end;function ParserAttr(): TAttrSelectorItem;varoldIndex: Integer;tmp: string;stringChar: Char;beginsc.IncSrc(); // [Result.Key := '';Result.AttrOperator := aoEqual;Result.Value := '';// Keysc.SkipBlank();oldIndex := sc.CodeIndex;while not CharInSet(sc.CurrentChar, (WhiteSpace + OperatorChar + [']', #0])) dosc.IncSrc();Result.Key := sc.subStr(oldIndex, sc.CodeIndex - oldIndex);Result.Key := LowerCase(Result.Key);// Operatorsc.SkipBlank();oldIndex := sc.CodeIndex;case sc.CurrentChar of'=', '!', '*', '~', '|', '^', '$':beginsc.IncSrc;if sc.CurrentChar = '=' thensc.IncSrc;end;']':beginResult.AttrOperator := aoExist;sc.IncSrc;Exit;end;elsebeginDoError(Format('无法解析CSS Attribute操作符[%d,%d]', [sc.LineNum, sc.ColNum]));end;end;tmp := sc.subStr(oldIndex, sc.CodeIndex - oldIndex);if Length(tmp) >= 1 thenbegincase tmp[LowStrIndex] of'=':Result.AttrOperator := aoEqual;'!':Result.AttrOperator := aoNotEqual;'*':Result.AttrOperator := aoContain;'~':Result.AttrOperator := aoIncludeWord;'|':Result.AttrOperator := aoBeginWord;'^':Result.AttrOperator := aoBegin;'$':Result.AttrOperator := aoEnd;end;end;// Valuesc.SkipBlank();oldIndex := sc.CodeIndex;if CharInSet(sc.CurrentChar, ['"', '''']) thenstringChar := sc.CurrentCharelsestringChar := #0;sc.IncSrc();while True dobeginif stringChar = #0 thenbeginif CharInSet(sc.CurrentChar, (WhiteSpace + [#0, ']'])) thenBreak;endelse if (sc.CurrentChar = stringChar) thenbeginsc.IncSrc();Break;end;sc.IncSrc();end;Result.Value := sc.subStr(oldIndex, sc.CodeIndex - oldIndex);// SetString(Result.Value, oldP, P - oldP);if (stringChar <> #0) and (Length(Result.Value) >= 2) thenResult.Value := Copy(Result.Value, 2, Length(Result.Value) - 2);Result.Value := ConvertEntities(Result.Value);//sc.SkipBlank();if sc.CurrentChar = ']' thensc.IncSrcelseDoError(Format('无法解析Attribute值[%d,%d]', [sc.LineNum, sc.ColNum]));end;procedure ParserItem(var Item: TCSSSelectorItem);vartmp: string;pAttr: PAttrSelectorItem;beginsc.SkipBlank();while True dobegincase sc.CurrentChar of#0, ',', ' ':Break;'.': // classbeginsc.IncSrc();pAttr := AddAttr(Item);pAttr^.Key := 'class';pAttr^.AttrOperator := aoIncludeWord;pAttr^.Value := sc.ReadStr((WhiteSpace + OperatorChar + ['[', ']', '"', '''', ',', '.', '#', #0]));end;'#': // idbeginsc.IncSrc();pAttr := AddAttr(Item);pAttr^.Key := 'id';pAttr^.AttrOperator := aoEqual;pAttr^.Value := sc.ReadStr((WhiteSpace + OperatorChar + ['[', ']', '"', '''', ',', '.', '#', #0]));end;'[': // attributebeginpAttr := AddAttr(Item);pAttr^ := ParserAttr();end;'/':beginsc.IncSrc();if sc.CurrentChar = '*' then // /**/beginsc.IncSrc();sc.IncSrc();while True dobeginif (sc.CurrentChar = '/') and (sc.charOfCurrent[-1] = '*') thenbeginsc.IncSrc;Break;end;sc.IncSrc;end;end;end;elsebeginItem.szTag := UpperCase(sc.ReadStr((WhiteSpace + ['[', ']', '"', '''', ',', '.', '#', #0])));end;end;end;end;function AddItems(var Group: TCSSSelectorItemGroup): PCSSSelectorItems;beginSetLength(Group, Length(Group) + 1);Result := @Group[Length(Group) - 1];end;function AddItem(var Items: TCSSSelectorItems): PCSSSelectorItem;beginSetLength(Items, Length(Items) + 1);Result := @Items[Length(Items) - 1];Result^.Relation := sirNONE;end;varTag: string;pitems: PCSSSelectorItems;pItem: PCSSSelectorItem;
beginsc.setCode(Value);//pitems := AddItems(Result);pItem := AddItem(pitems^);while True dobeginsc.SkipBlank;ParserItem(pItem^);sc.SkipBlank;case sc.CurrentChar of',':beginsc.IncSrc();pitems := AddItems(Result);pItem := AddItem(pitems^);end;'>':beginsc.IncSrc();pItem := AddItem(pitems^);pItem^.Relation := sirChildren;end;'+':beginsc.IncSrc();pItem := AddItem(pitems^);pItem^.Relation := sirYoungerBrother;end;'~':beginsc.IncSrc();pItem := AddItem(pitems^);pItem^.Relation := sirAllYoungerBrother;end;#0:Break;elsebeginpItem := AddItem(pitems^);pItem^.Relation := sirDescendant;end;end;end;
end;procedure Init();
varI: Integer;Key: string;S: WORD;
begingEntities := TStringDictionary.Create();gTagProperty := TPropDictionary.Create;for I := low(EntityTable) to high(EntityTable) dobegingEntities.Add(EntityTable[I].Key, EntityTable[I].Value);end;//for I := low(BlockTags) to high(BlockTags) dogTagProperty.AddOrSetValue(BlockTags[I], tpBlock);for I := low(InlineTags) to high(InlineTags) dogTagProperty.AddOrSetValue(InlineTags[I], tpInline);for I := low(EmptyTags) to high(EmptyTags) dobeginKey := EmptyTags[I];if gTagProperty.ContainsKey(Key) thenS := gTagProperty[Key]elseS := 0;S := S or tpEmpty;gTagProperty.AddOrSetValue(Key, S);end;for I := low(FormatAsInlineTags) to high(FormatAsInlineTags) dobeginKey := FormatAsInlineTags[I];if gTagProperty.ContainsKey(Key) thenS := gTagProperty[Key]elseS := 0;S := S or tpFormatAsInline;gTagProperty.AddOrSetValue(Key[I], S);end;for I := low(PreserveWhitespaceTags) to high(PreserveWhitespaceTags) dobeginKey := PreserveWhitespaceTags[I];if gTagProperty.ContainsKey(Key) thenS := gTagProperty[Key]elseS := 0;S := S or tpPreserveWhitespace;gTagProperty.AddOrSetValue(PreserveWhitespaceTags[I], S);end;
end;procedure UnInit();
begingTagProperty.Free;gEntities.Free;
end;
{$ENDREGION '转换表之类的'}
{ TIHtmlElementList }function TIHtmlElementList.Add(Value: IHtmlElement): Integer;
beginResult := FList.Add(Value);
end;procedure TIHtmlElementList.Clear;
beginFList.Clear;
end;constructor TIHtmlElementList.Create;
begininherited Create;FList := TList<IHtmlElement>.Create;
end;procedure TIHtmlElementList.Delete(Index: Integer);
beginFList.Delete(Index);
end;destructor TIHtmlElementList.Destroy;
beginFList.Free;inherited Destroy;
end;function TIHtmlElementList.GetCount: Integer;
beginResult := FList.Count;
end;function TIHtmlElementList.GetItems(Index: Integer): IHtmlElement;
beginResult := FList[Index];
end;function TIHtmlElementList.IndexOf(Item: IHtmlElement): Integer;
beginResult := FList.IndexOf(Item)
end;procedure TIHtmlElementList.SetItems(Index: Integer; const Value: IHtmlElement);
beginFList[Index] := Value;
end;{ THtmlElement }constructor THtmlElement.Create(AOwner: THtmlElement; AText: string; ALine, ACol: Integer);
begininherited Create;FAttributes := TStringDictionary.Create();FChildren := TIHtmlElementList.Create;FOwner := AOwner;FOrignal := AText;FSourceLine := ALine;FSourceCol := ACol;
end;destructor THtmlElement.Destroy;
beginFChildren.Free;FAttributes.Free;inherited Destroy;
end;function THtmlElement.EnumAttributeNames(Index: Integer): WideString;
varAttrs: TStringDynArray;
beginResult := '';Attrs := FAttributes.Keys.ToArray;if Index < Length(Attrs) thenResult := Attrs[Index];
end;function THtmlElement.GetAttributes(Key: WideString): WideString;
beginResult := '';Key := LowerCase(Key);if FAttributes.ContainsKey(Key) thenResult := FAttributes[Key];
end;function THtmlElement.GetChildren(Index: Integer): IHtmlElement;
beginResult := FChildren[Index];
end;function THtmlElement.GetChildrenCount: Integer;
beginResult := FChildren.Count;
end;function THtmlElement.GetCloseTag: IHtmlElement;
beginResult := FCloseTag;
end;function THtmlElement.GetContent: WideString;
beginResult := FContent;
end;procedure THtmlElement._GetHtml(IncludeSelf: Boolean; Sb: TStringBuilder);
varI: Integer;E: THtmlElement;
beginif IncludeSelf thenSb.Append(FOrignal);for I := 0 to FChildren.Count - 1 dobeginE := FChildren[I] as THtmlElement;E._GetHtml(True, Sb);end;if IncludeSelf and (FCloseTag <> nil) then(FCloseTag as THtmlElement)._GetHtml(True, Sb);
end;procedure THtmlElement._GetText(IncludeSelf: Boolean; Sb: TStringBuilder);
varI: Integer;E: THtmlElement;
beginif IncludeSelf and (FTagName = '#TEXT') thenbeginSb.Append(FContent);end;for I := 0 to FChildren.Count - 1 dobeginE := FChildren[I] as THtmlElement;E._GetText(True, Sb);end;
end;procedure THtmlElement._Select(Item: PCSSSelectorItem; Count: Integer; r: TIHtmlElementList; OnlyTopLevel: Boolean);function _Filtered(): Boolean;varI: Integer;beginResult := false;if (Item^.szTag = '') or (Item^.szTag = '*') or (Item^.szTag = FTagName) thenbeginfor I := Low(Item^.Attributes) to High(Item^.Attributes) doif not AttrCompareFuns[Item^.Attributes[I].AttrOperator](Item^.Attributes[I], Self) thenExit;Result := True;end;end;varf: Boolean;I, SelfIndex: Integer;PE, E: THtmlElement;Next: PCSSSelectorItem;
begin// ShowMessage(item^.szTag);// ShowMessage(item^.Attributes[0].Key + ' ' + item^.Attributes[0].Value);f := _Filtered();if f thenbeginif (Count = 1) thenbeginif (r.IndexOf(Self as IHtmlElement) < 0) thenr.Add(Self as IHtmlElement);endelse if Count > 1 thenbeginNext := Item;Inc(Next);PE := Self.FOwner;if PE = nil thenSelfIndex := -1elseSelfIndex := PE.FChildren.IndexOf(Self as IHtmlElement);case Next^.Relation ofsirDescendant, sirChildren:beginfor I := 0 to FChildren.Count - 1 dobeginE := FChildren[I] as THtmlElement;E._Select(Next, Count - 1, r, Next^.Relation = sirChildren);end;end;sirAllYoungerBrother, sirYoungerBrother:beginif (PE <> nil) and (SelfIndex >= 0) thenfor I := (SelfIndex + 1) to PE.FChildren.Count - 1 dobeginE := PE.FChildren[I] as THtmlElement;if (Length(E.FTagName) = 0) or (E.FTagName[LowStrIndex] <> '#') thenbeginE._Select(Next, Count - 1, r, True);if (Next^.Relation = sirYoungerBrother) thenBreak;end;end;end;end;end;end;if not OnlyTopLevel thenfor I := 0 to FChildren.Count - 1 dobeginE := FChildren[I] as THtmlElement;E._Select(Item, Count, r);end;
end;procedure THtmlElement._SimpleCSSSelector(const ItemGroup: TCSSSelectorItemGroup; r: TIHtmlElementList);
varI: Integer;
beginfor I := Low(ItemGroup) to High(ItemGroup) dobegin_Select(@ItemGroup[I][0], Length(ItemGroup[I]), r);end;
end;function THtmlElement.GetInnerHtml: WideString;
varSb: TStringBuilder;
beginSb := TStringBuilder.Create;_GetHtml(false, Sb);Result := Sb.ToString;Sb.Free;
end;function THtmlElement.GetInnerText: WideString;
varSb: TStringBuilder;
beginSb := TStringBuilder.Create;_GetText(True, Sb);Result := Sb.ToString;Sb.Free;
end;function THtmlElement.GetOrignal: WideString;
beginResult := FOrignal;
end;function THtmlElement.GetOuterHtml: WideString;
varSb: TStringBuilder;
beginSb := TStringBuilder.Create;_GetHtml(True, Sb);Result := Sb.ToString;Sb.Free;
end;function THtmlElement.GetSourceColNum: Integer;
beginResult := FSourceCol;
end;function THtmlElement.GetSourceLineNum: Integer;
beginResult := FSourceLine;
end;function THtmlElement.GetTagName: WideString;
beginResult := FTagName;
end;function THtmlElement.HasAttribute(AttributeName: WideString): Boolean;
beginResult := FAttributes.ContainsKey(LowerCase(AttributeName));
end;function THtmlElement.SimpleCSSSelector(const selector: WideString): IHtmlElementList;
varr: TIHtmlElementList;
beginr := TIHtmlElementList.Create;_SimpleCSSSelector(ParserCSSSelector(selector), r);Result := r as IHtmlElementList;
end;{ TSourceContext }function TSourceContext.subStr(Index, Count: Integer): string;
beginResult := System.Copy(Code, Index{$IF (LowStrIndex = 0)}  + 1{$ENDIF}, Count);
end;function TSourceContext.subStr(Count: Integer): string;
beginResult := subStr(CodeIndex, Count);
end;function TSourceContext.ReadStr(UntilChars: TSysCharSet): string;
varoldIndex: Integer;stringChar: Char;
beginSkipBlank;oldIndex := CodeIndex;if CharInSet(CurrentChar, ['"', '''']) thenstringChar := CurrentCharelsestringChar := #0;IncSrc;while True dobeginif stringChar = #0 thenbeginif CharInSet(CurrentChar, UntilChars) thenBreak;endelse if (CurrentChar = stringChar) thenbeginIncSrc;Break;end;IncSrc;end;Result := subStr(oldIndex, CodeIndex - oldIndex);if (stringChar <> #0) and (Length(Result) >= 2) thenResult := System.Copy(Result, 2, Length(Result) - 2);
end;function TSourceContext.GetCharOfCurrent(Index: Integer): Char;
beginResult := Code[CodeIndex + Index];
end;procedure TSourceContext.IncSrc;
beginif CurrentChar = #10 thenbeginInc(LineNum);ColNum := 1;endelseInc(ColNum);Inc(CodeIndex);CurrentChar := Code[CodeIndex];
{$IFDEF DEBUG}currentCode := PChar(@Code[CodeIndex]);
{$ENDIF}
end;procedure TSourceContext.IncSrc(Step: Integer);
varI: Integer;
beginfor I := 0 to Step - 1 doIncSrc();
end;function TSourceContext.PeekStr: string;
beginResult := PeekStr(CodeIndex);
end;procedure TSourceContext.setCode(const ACode: string);
beginCurrentChar := #0;Code := ACode;LineNum := 1;ColNum := 1;CodeIndex := Low(Code);if Length(ACode) > 0 thenbeginCurrentChar := Code[CodeIndex];
{$IFDEF DEBUG}currentCode := PChar(@Code[CodeIndex]);
{$ENDIF}end;end;function TSourceContext.PeekStr(Index: Integer): string;
varoldIndex: Integer;
beginResult := '';oldIndex := Index;while not CharInSet(Code[Index], (WhiteSpace + ['/', '>'])) doInc(Index);Result := subStr(oldIndex, Index - oldIndex);
end;procedure TSourceContext.SkipBlank();
beginwhile CharInSet(CurrentChar, WhiteSpace) doIncSrc();
end;initializationInit();finalizationUnInit();end.
FListNode := parserHtml(sResponseHtml);tmpIHtmlElementList := FListNode.SimpleCSSSelector('.item-main');for k := 0 to tmpIHtmlElementList.Count - 1 dobegintryFDivItemsRoot := tmpIHtmlElementList.Items[k];//YeartmpIHtmlElementList2 := FDivItemsRoot.SimpleCSSSelector('.s-gold-supplier-year-icon');if tmpIHtmlElementList2.Count > 0 thenbegintmpShopInfo.years := tmpIHtmlElementList2.Items[0].InnerText;end;......... end;

利用武稀松版HtmlParser解析Html,使用CSS选择器定位节点相关推荐

  1. 在.net中运用HTMLParser解析网页的原理和方法

    本文介绍了.net 版的一个HTMLParser网页解析开源类库(Winista.HTMLParser)的功能特性.工作原理和使用方法.对于使用.net进行Web信息提取的开发人员进行了一次HTMLP ...

  2. htmlparser解析网站时服务器返回的文件编码和页面编码不一致问题

    1.用htmlparser解析http://gz.fang.com/抛出如下异常: Exception in thread "main" org.htmlparser.util.E ...

  3. a标签在微信iOS版本的解析没有问题,但是在安卓版就解析不出来

    <a>标签在微信iOS版本的解析没有问题,但是在安卓版就解析不出来.例如<a href="xx.xx.xx.xx">立即绑定</a> ,在安卓版 ...

  4. 用HTMLParser解析html时报错:No module named 'htmlentitydefs'

    python3.6用HTMLParser解析html时报错 No module named 'htmlentitydefs'或No module named 'markupbase' 先上代码 fro ...

  5. 计算机操作系统汤晓丹第四版+指导解析

    1.计算机操作系统汤晓丹第四版+指导解析 链接:https://pan.baidu.com/s/1iTcKmfrkiuETYhdE1DU0Vw 提取码:私信 --来自百度网盘超级会员V3的分享 备用下 ...

  6. 无机化学(第四版)(上册、下册)课后答案 宋天佑 徐家宁 程功臻 王莉版答案解析 高等教育出版社 第5章课后题答案 习题解答

    无机化学(第四版)(上册.下册)课后答案 宋天佑 徐家宁 程功臻 王莉版答案解析 高等教育出版社 第5章课后题答案 完整版答案看文章末尾处 上册 第1章 化学基础知识 第2章 化学热力学基础 第3章 ...

  7. 高中数学选择填空题快速提分技巧_Word版含解析

    本文作者:vxbomath 高中数学是高考大三主科中的最难科目,也是最容易拉分的科目,高考选择填空题占分数高,高考想这两部分题目的满分,既容易也不容易.本文章主要以高中数学选择题快速提分技巧 推荐阅读 ...

  8. 使用HtmlParser解析HTML

    原文地址为: 使用HtmlParser解析HTML 如果要对HTML进行 解析,提取HTML的数据或者修改HTML数据, HtmlParser是一个不错的选择. 使用HtmlParser可以解析本地和 ...

  9. HTMLParser解析html详解

    HTMLParser具有小巧,快速的优点,缺点是相关文档比较少(英文的也少),很多功能需要自己摸索.对于初学者还是要费一些功夫的,而一旦上手以后,会发现HTMLParser的结构设计很巧妙,非常实用, ...

最新文章

  1. basequickadapter详解_在kotlin中如何使用BaseQuickAdapter适配器
  2. 【深度学习】Swin-Transformer和EfficientNet对比分析
  3. Apache Commons ArrayUtils.toString(Object)与JDK Arrays.toString(Object)
  4. [引]生成加密和解密的密钥
  5. 【js】JavaScript parser实现浅析
  6. HTML5的web开发,基于Html5技术的WEB开发
  7. 最热门的100个Web2.0网站
  8. linux运行脚本运行不了,解决linux 运行自动化脚本浏览器无法启动问题
  9. 一个具有对象计数功能的基类
  10. 每天一个实用小技巧!巧妙利用Mac标记,分类同类文件
  11. 教你微软原版win7下载安装教程
  12. Cordova+Vue实现Android APP开发
  13. IDEA教育版申请流程
  14. html向下的箭头符号,向下的箭头符号
  15. 颜色恒常性 传统算法(AWB)
  16. 杜国光博士,基于视觉的机器人抓取--物体定位,位姿估计到抓取估计课堂笔记
  17. 工作日志----统一工号补充了啦
  18. 字幕滚动效果---非常酷哦
  19. 据说Kivy可以将Python程序弄成App来玩,所以 安装Kivy。关于安装Kivy失败后的解决方案过程
  20. win10 安装sqlserver2008

热门文章

  1. 关于学历——全面了解下学信网的各种学历信息
  2. 实验室计算机管理系统,实验室管理系统
  3. 算法笔记- K均值(K-Means)
  4. 计算机注册表知识,win7注册表小常识
  5. turbotax原理_使用OpenTaxSolver作为TurboTax的开源替代品
  6. 预防颈椎之痛的几个小技巧
  7. 写数据到word模板代码
  8. const指针与指向const对象的指针
  9. 太太太太太卷了,累了
  10. 再读目标检测--ssd深度解析