Epub解析 -- Pageturner

上一篇文章 Epub介绍介绍了Epub文件的格式，了解了其内部的结构，这一篇文章就从使用了EpubLib的开源项目 Pageturner 来讲讲，怎样解析 Epub 文件。

一、书籍构造

1、基本信息

从前面的介绍我们可以知道，书籍的相关信息，可以从 content.opf文件里面的 metadata里面获取得到。在开源项目 Pageturner 里面，用来存储书籍相关信息的类是 Metadata 类，具体的字段如下：

public class Metadata implements Serializable {private static final long serialVersionUID = -2437262888962149444L;public static final String DEFAULT_LANGUAGE = "en";private boolean autoGeneratedId = true;//作者private List<Author> authors = new ArrayList<Author>();//捐赠人private List<Author> contributors = new ArrayList<Author>();//日期private List<Date> dates = new ArrayList<Date>();//语言private String language = DEFAULT_LANGUAGE;//其他属性private Map<QName, String> otherProperties = new HashMap<QName, String>();private List<String> rights = new ArrayList<String>();//书籍名称private List<String> titles = new ArrayList<String>();private List<Identifier> identifiers = new ArrayList<Identifier>();private List<String> subjects = new ArrayList<String>();private String format = MediatypeService.EPUB.getName();private List<String> types = new ArrayList<String>();//描述private List<String> descriptions = new ArrayList<String>();//出版社private List<String> publishers = new ArrayList<String>();private Resource coverImage;public Metadata() {identifiers.add(new Identifier());autoGeneratedId = true;}}

负责解析 meta 信息的是 PackageDocumentMetadataReader,根据 ResourceUtil.getAsDocument工具类，从对应的文件里面构造 Document对象，然后根据书籍封面、书籍名字对应的字段，解析出需要的基础信息。

 public static void read(Resource packageResource, EpubReader epubReader, Book book, Resources resources) throws UnsupportedEncodingException, SAXException, IOException, ParserConfigurationException {Document packageDocument = ResourceUtil.getAsDocument(packageResource);String packageHref = packageResource.getHref();resources = fixHrefs(packageHref, resources);readGuide(packageDocument, epubReader, book, resources);// Books sometimes use non-identifier ids. We map these here to legal onesMap<String, String> idMapping = new HashMap<String, String>();resources = readManifest(packageDocument, packageHref, epubReader, resources, idMapping);book.setResources(resources);readCover(packageDocument, book);      book.setMetadata(PackageDocumentMetadataReader.readMetadata(packageDocument, book.getResources()));book.setSpine(readSpine(packageDocument, epubReader, book.getResources(), idMapping));// if we did not find a cover page then we make the first page of the book the cover pageif (book.getCoverPage() == null && book.getSpine().size() > 0) {book.setCoverPage(book.getSpine().getResource(0));}}

2、目录

获取到书籍的基本信息之后，我们需要进一步进行目录的解析。同理，目录是从 toc.ncx 进行获取的，Pageturner 的目录获取时序图如下：

在我们首次打开一本书籍的时候，会先根据书籍的位置，进行解析，获取得到对应书籍的相关基础信息和目录。首先调用 BookView 的 loadText 方法，而真正进行书籍内容加载的是 TextLoader ，里面封装了包括章节内容获取的相关方法，在介绍目录的获取之前，我们先看一下每个目录item 对应的数据结构，

public class TOCReference extends TitledResourceReference implements Serializable {private static final long serialVersionUID = 5787958246077042456L;private List<TOCReference> children;public TOCReference() {this(null, null, null);}
}public class TitledResourceReference extends ResourceReference implements Serializable {private static final long serialVersionUID = 3918155020095190080L;private String fragmentId;private String title;public TitledResourceReference(Resource resource) {this(resource, null);}public TitledResourceReference(Resource resource, String title) {this(resource, title, null);}public TitledResourceReference(Resource resource, String title, String fragmentId)        {super(resource);this.title = title;this.fragmentId = fragmentId;}
}

可以看到，主要的字段有四个，

String title：目录名称
String fragmentId：目录对应的id
List children，主要是记录其子目录的内容，也就是说，他是支持多级嵌套的形态。
Resource resource 这个比较重要，存储每个目录item 的详细信息

其对应的成员如下

public class Resource implements Serializable {private static final long serialVersionUID = 1043946707835004037L;private String id;private String title;private String href;private MediaType mediaType;private String inputEncoding = Constants.CHARACTER_ENCODING;private byte[] data;  }

接着介绍目录的读取，对目录进行获取的是 NCXDocument 的 readTOCReferences，如下：

 public static Resource read(Book book, EpubReader epubReader) {Resource ncxResource = null;if(book.getSpine().getTocResource() == null) {log.error("Book does not contain a table of contents file");return ncxResource;}try {ncxResource = book.getSpine().getTocResource();if(ncxResource == null) {return ncxResource;}Document ncxDocument = ResourceUtil.getAsDocument(ncxResource);Element navMapElement = DOMUtil.getFirstElementByTagNameNS(ncxDocument.getDocumentElement(), NAMESPACE_NCX, NCXTags.navMap);TableOfContents tableOfContents = new TableOfContents(readTOCReferences(navMapElement.getChildNodes(), book));book.setTableOfContents(tableOfContents);} catch (Exception e) {log.error(e.getMessage(), e);}return ncxResource;}private static List<TOCReference> readTOCReferences(NodeList navpoints, Book book) {if(navpoints == null) {return new ArrayList<TOCReference>();}List<TOCReference> result = new ArrayList<TOCReference>(navpoints.getLength());for(int i = 0; i < navpoints.getLength(); i++) {Node node = navpoints.item(i);if (node.getNodeType() != Document.ELEMENT_NODE) {continue;}if (! (node.getLocalName().equals(NCXTags.navPoint))) {continue;}TOCReference tocReference = readTOCReference((Element) node, book);result.add(tocReference);}return result;}

其基本流程也是先从对应的toc.ncx文件构造出 Document 对象，然后获取标签 navMap的内容元素Element，在readTOCReferences方法，根据里面的NodeList元素进行解析。下图是解析完成的一个item的数据，从图中解析完的一个item 数据，可以更加容易理解这个解析流程。

以上就是书籍信息构造的整个流程。接下来继续看下，阅读过程，是怎样获取每一章的内容，并且，每一章的内容，是怎样跟 css 文件结合起来，进行图文混排等富文本的渲染的。

二、章节内容

章节内容的获取方式如上。当我们点击阅读某一个章节的时候，会创建一个LoadTextTask任务，进行章节的获取工作，在 Resource 对象里面能够获取对应章节内容的数据流，

 public Reader getReader() throws IOException {return new XmlStreamReader(new ByteArrayInputStream(getData()), getInputEncoding());}

章节内容获取最重要的就是html 文件和 css 样式如何结合起来使用的，下面通过 htmlSpanner.fromHtml来详细介绍一下。

1、先将所有的数据解析成TagNode，这个过程是通过 HtmlCleaner来完成的。

    public TagNode clean(Reader reader, final CleanTimeValues cleanTimeValues) throws IOException {cleanTimeValues._openTags = new OpenTags();cleanTimeValues._headOpened = false;cleanTimeValues._bodyOpened = false;cleanTimeValues._headTags.clear();cleanTimeValues.allTags.clear();setPruneTags(properties.pruneTags, cleanTimeValues);cleanTimeValues.htmlNode = createTagNode("html", cleanTimeValues);cleanTimeValues.bodyNode = createTagNode("body", cleanTimeValues);cleanTimeValues.headNode = createTagNode("head", cleanTimeValues);cleanTimeValues.rootNode = null;cleanTimeValues.htmlNode.addChild(cleanTimeValues.headNode);cleanTimeValues.htmlNode.addChild(cleanTimeValues.bodyNode);HtmlTokenizer htmlTokenizer = new HtmlTokenizer(reader, properties, transformations, tagInfoProvider) {@Overridevoid makeTree(List<BaseToken> tokenList) {HtmlCleaner.this.makeTree( tokenList, tokenList.listIterator(tokenList.size() - 1), cleanTimeValues );}@OverrideTagNode createTagNode(String name) {return HtmlCleaner.this.createTagNode(name, cleanTimeValues); }};htmlTokenizer.start();List<BaseToken> nodeList = htmlTokenizer.getTokenList();closeAll(nodeList, cleanTimeValues);createDocumentNodes(nodeList, cleanTimeValues);calculateRootNode(cleanTimeValues);// if there are some nodes to prune from treeif ( cleanTimeValues.pruneNodeSet != null && !cleanTimeValues.pruneNodeSet.isEmpty() ) {Iterator iterator = cleanTimeValues.pruneNodeSet.iterator();while (iterator.hasNext()) {TagNode tagNode = (TagNode) iterator.next();TagNode parent = tagNode.getParent();if (parent != null) {parent.removeChild(tagNode);}}}cleanTimeValues.rootNode.setDocType( htmlTokenizer.getDocType() );return cleanTimeValues.rootNode;}

当我们开始调用 htmlTokenizer.start();的时候，就会进行章节内容数据的读取遍历，其中content 方法就是读取完一个标签。也就是创建了一个ContentNode。

    private boolean content() throws IOException {while ( !isAllRead() ) {if (isValidXmlCharSafe()) {saveCurrentSafe();}go();if ( isCharSimple('<') ) {break;}}return addSavedAsContent();}

2、遍历每个node(如果是文案的那种，会有样式属性)，获取到对应的数据，进行样式填充。

    public Spannable fromTagNode(TagNode node, HtmlSpanner.CancellationCallback cancellationCallback) {SpannableStringBuilder result = new SpannableStringBuilder();SpanStack stack = new SpanStack();//构造spannable 数据，没有处理样式，样式部分只是进行存储(表明哪里到哪里，使用哪种style)。this.applySpan(result, node, stack, cancellationCallback);//处理样式stack.applySpans(this, result);return result;}public void handleTagNode(TagNode node, SpannableStringBuilder builder, int start, int end, Style useStyle, SpanStack stack) {if (useStyle.getDisplayStyle() == DisplayStyle.BLOCK) {this.appendNewLine(builder);if (useStyle.getMarginBottom() != null) {StyleValue styleValue = useStyle.getMarginBottom();if (styleValue.getUnit() == Unit.PX) {if (styleValue.getIntValue() > 0) {this.appendNewLine(builder);stack.pushSpan(new VerticalMarginSpan(styleValue.getIntValue()), builder.length() - 1, builder.length());}} else if (styleValue.getFloatValue() > 0.0F) {this.appendNewLine(builder);stack.pushSpan(new VerticalMarginSpan(styleValue.getFloatValue()), builder.length() - 1, builder.length());}}}stack.pushSpan(new StyleCallback(this.getSpanner().getFontResolver().getDefaultFont(), useStyle, start, builder.length()));}public final void handleTagNode(TagNode node, SpannableStringBuilder builder, int start, int end, SpanStack spanStack) {Style styleFromCSS = spanStack.getStyle(node, this.getStyle());this.handleTagNode(node, builder, start, end, styleFromCSS, spanStack);}

样式主要起作用的就是 StyleCallback，这里会根据css 的样式进行适配。


public class StyleCallback implements SpanCallback {private int start;private int end;private FontFamily defaultFont;private Style useStyle;public StyleCallback(FontFamily defaultFont, Style style, int start, int end) {this.defaultFont = defaultFont;this.useStyle = style;this.start = start;this.end = end;}public void applySpan(HtmlSpanner spanner, SpannableStringBuilder builder) {if (this.useStyle.getFontFamily() != null || this.useStyle.getFontStyle() != null || this.useStyle.getFontWeight() != null) {FontFamilySpan originalSpan = this.getFontFamilySpan(builder, this.start, this.end);FontFamilySpan newSpan;if (this.useStyle.getFontFamily() == null && originalSpan == null) {newSpan = new FontFamilySpan(this.defaultFont);} else if (this.useStyle.getFontFamily() != null) {newSpan = new FontFamilySpan(this.useStyle.getFontFamily());} else {newSpan = new FontFamilySpan(originalSpan.getFontFamily());}if (this.useStyle.getFontWeight() != null) {newSpan.setBold(this.useStyle.getFontWeight() == FontWeight.BOLD);} else if (originalSpan != null) {newSpan.setBold(originalSpan.isBold());}if (this.useStyle.getFontStyle() != null) {newSpan.setItalic(this.useStyle.getFontStyle() == FontStyle.ITALIC);} else if (originalSpan != null) {newSpan.setItalic(originalSpan.isItalic());}builder.setSpan(newSpan, this.start, this.end, 33);}if (spanner.isUseColoursFromStyle() && this.useStyle.getBackgroundColor() != null && this.useStyle.getBorderStyle() == null) {builder.setSpan(new BackgroundColorSpan(this.useStyle.getBackgroundColor()), this.start, this.end, 33);}if (this.useStyle.getBorderStyle() != null) {builder.setSpan(new BorderSpan(this.useStyle, this.start, this.end, spanner.isUseColoursFromStyle()), this.start, this.end, 33);}StyleValue styleValue;if (this.useStyle.getFontSize() != null) {styleValue = this.useStyle.getFontSize();if (styleValue.getUnit() == Unit.PX) {if (styleValue.getIntValue() > 0) {builder.setSpan(new AbsoluteSizeSpan(styleValue.getIntValue()), this.start, this.end, 33);}} else if (styleValue.getFloatValue() > 0.0F) {builder.setSpan(new RelativeSizeSpan(styleValue.getFloatValue()), this.start, this.end, 33);}}if (spanner.isUseColoursFromStyle() && this.useStyle.getColor() != null) {builder.setSpan(new ForegroundColorSpan(this.useStyle.getColor()), this.start, this.end, 33);}if (this.useStyle.getTextAlignment() != null) {AlignmentSpan alignSpan = null;switch(this.useStyle.getTextAlignment()) {case LEFT:alignSpan = new AlignNormalSpan();break;case CENTER:alignSpan = new CenterSpan();break;case RIGHT:alignSpan = new AlignOppositeSpan();}builder.setSpan(alignSpan, this.start, this.end, 33);}if (this.useStyle.getTextIndent() != null) {styleValue = this.useStyle.getTextIndent();int marginStart;for(marginStart = this.start; marginStart < this.end && builder.charAt(marginStart) == '\n'; ++marginStart) {;}int marginEnd = Math.min(this.end, marginStart + 1);Log.d("StyleCallback", "Applying LeadingMarginSpan from " + marginStart + " to " + marginEnd + " on text " + builder.subSequence(marginStart, marginEnd));if (styleValue.getUnit() == Unit.PX) {if (styleValue.getIntValue() > 0) {builder.setSpan(new Standard(styleValue.getIntValue(), 0), marginStart, marginEnd, 33);}} else if (styleValue.getFloatValue() > 0.0F) {builder.setSpan(new Standard((int)(10.0F * styleValue.getFloatValue()), 0), marginStart, marginEnd, 33);}}if (this.useStyle.getMarginLeft() != null) {styleValue = this.useStyle.getMarginLeft();if (styleValue.getUnit() == Unit.PX) {if (styleValue.getIntValue() > 0) {builder.setSpan(new Standard(styleValue.getIntValue()), this.start, this.end, 33);}} else if (styleValue.getFloatValue() > 0.0F) {builder.setSpan(new Standard((int)(10.0F * styleValue.getFloatValue())), this.start, this.end, 33);}}}private FontFamilySpan getFontFamilySpan(SpannableStringBuilder builder, int start, int end) {FontFamilySpan[] spans = (FontFamilySpan[])builder.getSpans(start, end, FontFamilySpan.class);return spans != null && spans.length > 0 ? spans[spans.length - 1] : null;}
}

经过上面的处理之后，就会得到章节内容带有样式属性的SpannableStringBuilder，然后直接在textview 上即可进行展示。
到此，整个epub 文件的解析、展示流程就基本结束了。