使用Tika进行文件类型校验

文章目录

使用Tika进行文件类型校验
- Tika是什么
- 如何使用Tika进行文件类型校验
- Tika文件类型校验存在的问题
- - 问题发生的过程
  - 源码剖析
  - 代码优化
- 使用时注意
- 总结

Tika是什么

我们都知道，普通的文件后缀校验并不能校验出这个文件的类型，大部分的文件类型校验都是通过获取文件的魔数来判断文件的类型，因为对于大多数类型文件来说他的魔数是固定的（例如class文件的魔数就是：CA FE ）。所以目前大部分网络上找到的处理方案是将各个文件的魔数放倒Map集合中，然后通过获取文件的魔数，从Map集合查找对应的文件类型。但是同类型的文件的魔数真的都是固定的么？事实上并不是这样的，mp4文件的魔数就不是固定的。那就是意味着，你放了一个mp4的魔数，下次检测mp4文件的时候并不能保证校验通过！

因此推荐使用 Apache 下的一个解析类库：Tika 进行文件类型的校验。Tika不仅可以进行文件类型的校验，还可以对文件的内容进行解析，功能强大，本文只针对Tika文件的类型校验进行讲解。

如何使用Tika进行文件类型校验

那么如何使用Tika进行文件类型的校验那，非常的简单

引入依赖

<dependency><groupId>org.apache.tika</groupId><artifactId>tika-core</artifactId><version>1.28.1</version>
</dependency>

见代码

public class Test {public static void main(String[] args) throws IOException, TikaException {Tika tika = new Tika();String mimeType = tika.detect(new File("G:\\test.zip"));System.out.println(mimeType);}
}输出：application/zip

可见使用非常的简单，事实上还是想简单了，实际在自己测试的过程中又发现了很多问题。

Tika文件类型校验存在的问题

问题发生的过程

在做zip包导入导出的功能时，使用了tika来校验上传的文件类型，在自测的过程中发现，当我用上面的代码去解析 .xlsx文件时发现得到的结果也是 application/zip，问题出在哪呢？然后我用editplus打开zip文件和xlsx文件获取他们的魔数，发现它们头四位的魔数居然是一致的！然后我又研究了下他的detect相关的api（放出部分api）：

    public String detect(InputStream stream) throws IOException {return detect(stream, new Metadata());}/*** Detects the media type of the given document. The type detection is* based on the content of the given document stream and the name of the* document.* <p>* If the document stream supports the* {@link InputStream#markSupported() mark feature}, then the stream is* marked and reset to the original position before this method returns.* Only a limited number of bytes are read from the stream.* <p>* The given document stream is <em>not</em> closed by this method.** @since Apache Tika 0.9* @param stream the document stream* @param name document name* @return detected media type* @throws IOException if the stream can not be read*/public String detect(InputStream stream, String name) throws IOException {Metadata metadata = new Metadata();metadata.set(Metadata.RESOURCE_NAME_KEY, name);return detect(stream, metadata);}

从上面代码我们可以看到Tika还支持传入文件的名称，为什么要提供传入文件名称的api方法，是不是意味着他知道有这种情况？所以我用第二个方法重新尝试了下，这次正确的解析出了文件的类型：application/vnd.openxmlformats-officedocument.spreadsheetml.sheet。为什么会这样？还有就是他返回的文件类型格式显然没有达到我想要的预期，他这个文件类型这么复杂，其他的类型我要怎么比对文件的格式，难道要我一个个文件试过去然后建立映射？apache他的解析库应该不会这么设计，网上资料有限，为了探究他正确的使用姿势，不得不研究下它的源码了。

源码剖析

可以直接猜测下，Tika要解析这么多文件类型，他一定有自己的类型库，所以我根据上面xlxs返回的复杂文件类型，全局搜了下，找到了它包下面的：tika-mimetypes.xml 文件，这个文件可以理解就是他的文件类型库（截取部分）：

  <mime-type type="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"><_comment>Office Open XML Workbook</_comment><glob pattern="*.xlsx"/><sub-class-of type="application/x-tika-ooxml"/></mime-type><mime-type type="application/x-tika-ooxml"><sub-class-of type="application/zip"/><!-- Only works if the Content Types or rels file is the first zip entry --><magic priority="50"><match value="PK\003\004" type="string" offset="0"><match value="[Content_Types].xml" type="string" offset="30"/><match value="_rels/.rels" type="string" offset="30"/></match></magic></mime-type>

但从这个xml文件的结构中我们可以猜测下，在Tika的体系中，类型应该是有父子集关系的，还有就是有个glob pattern，我可以断定他一定是做了类型的映射！

然后我们在搜索下 Tika是从哪些地方加载了这个xml文件，发现是在：org.apache.tika.mime.MimeTypes#getDefaultMimeTypes(java.lang.ClassLoader) 方法内使用了xml文件，于是研究下 MimeTypes 这个类的方法，发现它有一个方法：

    /*** Returns the registered media type with the given name (or alias).* The named media type is automatically registered (and returned) if* it doesn't already exist.** @param name media type name (case-insensitive)* @return the registered media type with the given name or alias* @throws MimeTypeException if the given media type name is invalid*/
public MimeType forName(String name) throws MimeTypeException{... ...
}

从注释中我猜测这个方法其实就是映射的方法，于是写代码验证下：

public class Test {public static void main(String[] args) throws IOException, TikaException {Tika tika = new Tika();MimeTypes defaultMimeTypes = MimeTypes.getDefaultMimeTypes();MimeType mimeType = defaultMimeTypes.forName("application/vnd.openxmlformats-officedocument.spreadsheetml.sheet");System.out.println(mimeType.getExtension());}
}
输出：.xlsx

结果证明我的猜想是正确的！

接下来，在来研究下detect方法的源码实现，源码较长，直接放出关键代码：

public MediaType detect(InputStream input, Metadata metadata)throws IOException {List<MimeType> possibleTypes = null;// Get type based on magic prefixif (input != null) {input.mark(getMinLength());try {byte[] prefix = readMagicHeader(input);possibleTypes = getMimeType(prefix);} finally {input.reset();}}// Get type based on resourceName hint (if available)String resourceName = metadata.get(Metadata.RESOURCE_NAME_KEY);if (resourceName != null) {... ... 略if (name != null) {MimeType hint = getMimeType(name);// For server-side scripting languages, we cannot rely on the filename to detect the mime typeif (!(isHttp && hint.isInterpreted())) {// If we have some types based on mime magic, try to specialise//  and/or select the type based on that// Otherwise, use the type identified from the namepossibleTypes = applyHint(possibleTypes, hint);}}}... ... 略if (possibleTypes == null || possibleTypes.isEmpty()) {// Report that we don't know what it isreturn MediaType.OCTET_STREAM;} else {return possibleTypes.get(0).getType();}}

上面代码中有两行，可以看出，其实Tika也是通过校验文件的魔数来确认文件的类型的。

byte[] prefix = readMagicHeader(input);
possibleTypes = getMimeType(prefix);

那么当魔数一致时，Tika是如何解决文件的区分的？下面的代码给了答案，上面我们提到，传入文件名称（文件后缀.xxx）就可以实现文件的校验。它的处理原理就是下面代码：

    String resourceName = metadata.get(Metadata.RESOURCE_NAME_KEY);if (resourceName != null) { if (name != null) {MimeType hint = getMimeType(name);if (!(isHttp && hint.isInterpreted())) {possibleTypes = applyHint(possibleTypes, hint);}}}private List<MimeType> applyHint(List<MimeType> possibleTypes, MimeType hint) {if (possibleTypes == null || possibleTypes.isEmpty()) {return Collections.singletonList(hint);} else {for (int i=0; i<possibleTypes.size(); i++) {final MimeType type = possibleTypes.get(i);if (hint.equals(type) ||registry.isSpecializationOf(hint.getType(), type.getType())) {return Collections.singletonList(hint);}}}return possibleTypes;}public boolean isSpecializationOf(MediaType a, MediaType b) {return isInstanceOf(getSupertype(a), b);}

简单解读就是：

先校验Tika解析出的文件类型和你传入的文件类型是否一致
如果不一致在校验下传入的文件类型是否是Tika解析出的文件类型的子集（xml文件中的元素：sub-class-of）
如果是子集，返回子集映射的类型

可以看下.xlsx它的xml文件中的父子集关系就是：application/zip > application/x-tika-ooxml > application/vnd.openxmlformats-officedocument.spreadsheetml.sheet ，所以传入带后缀的文件名称后，就可以正确的解析出文件类型了。

代码优化

了解了Tika文件类型检测的原理后，我们就知道如何正确的使用了，对原来的代码进行优化下：

private static final MimeTypes DEFAULT_MIME_TYPES = MimeTypes.getDefaultMimeTypes();/*** 文件类型检测** @param bytes 文件数据* @param expectFileType 预期的文件类型* @return*/
public static boolean fileTypeDetect(byte[] bytes, FileTypeEnum expectFileType) {String extension = "." + expectFileType.getMsg();try {Tika tika = new Tika();String detectedMediaType = tika.detect(bytes, extension);MimeType mimeType = DEFAULT_MIME_TYPES.forName(detectedMediaType);return CollectionUtils.isNotEmpty(mimeType.getExtensions())&& mimeType.getExtensions().stream().anyMatch(ext -> ext.equals(extension));} catch (Exception e) {// do something}return true;
}

使用时注意

使用时需要注意，在获取魔数的时候，流会被读取！

    byte[] readMagicHeader(InputStream stream) throws IOException {if (stream == null) {throw new IllegalArgumentException("InputStream is missing");}byte[] bytes = new byte[getMinLength()];int totalRead = 0;int lastRead = stream.read(bytes);while (lastRead != -1) {totalRead += lastRead;if (totalRead == bytes.length) {return bytes;}lastRead = stream.read(bytes, totalRead, bytes.length - totalRead);}byte[] shorter = new byte[totalRead];System.arraycopy(bytes, 0, shorter, 0, totalRead);return shorter;}

总结

源码总会给你最好的答案~