Search is the base of many applications. Once data starts to pile up, users want to be able to find it. It’s the foundation of the internet and an ever-growing challenge that is never solved or done.

搜索是许多应用程序的基础。 数据开始堆积之后,用户希望能够找到它。 它是互联网的基础,并且是一个从未解决或完成的不断增长的挑战。

The field of Natural Language Processing (NLP) is rapidly evolving with a number of new developments. Large-scale general language models are an exciting new capability allowing us to add amazing functionality quickly with limited compute and people. Innovation continues with new models and advancements coming in at what seems a weekly basis.

随着许多新的发展,自然语言处理(NLP)领域正在Swift发展。 大型通用语言模型是一项令人兴奋的新功能,使我们能够在有限的计算和人员的情况下快速添加惊人的功能。 创新随着新模式的不断发展和进步的出现,似乎是每周一次。

This article introduces txtai, an AI-powered search engine that enables Natural Language Understanding (NLU) based search in any application.


txtai简介 (Introducing txtai)

txtai builds an AI-powered index over sections of text. txtai supports building text indices to perform similarity searches and create extractive question-answering based systems. txtai is open source and available on GitHub.

txtai在文本的各个部分上建立了一个AI驱动的索引。 txtai支持构建文本索引以执行相似性搜索并创建基于提取问题的系统。 txtai是开源的,可在GitHub上获得。

txtai is built on the following stack:


  • Sentence Transformers


  • Transformers


  • Faiss, Annoy, Hnswlib

  • Python 3.6+
txtai and/or the concepts behind it has already been used to power the Natural Language Processing (NLP) applications listed below:


  • cord19q — COVID-19 literature analysis

    cord19q — COVID-19文献​​分析

  • paperai — AI-powered literature discovery and review engine for medical/scientific papers

    paperai —用于医学/科学论文的人工智能技术文献发现和审阅引擎

  • neuspo — a fact-driven, real-time sports event and news site


  • codequestion — Ask coding questions directly from the terminal

    codequestion —直接从终端询问编码问题

安装并运行txtai (Install and run txtai)

The following code snippet shows how to install txtai and create an embeddings model.


pip install txtai

Next, we can create a simple in memory model with a couple sample records to try txtai out.


import numpy as npfrom txtai.embeddings import Embeddings# Create embeddings model, backed by sentence-transformers & transformers
embeddings = Embeddings({"method": "transformers", "path": "sentence-transformers/bert-base-nli-mean-tokens"})sections = ["US tops 5 million confirmed virus cases","Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg","Beijing mobilises invasion craft along coast as Taiwan tensions escalate","The National Park Service warns against sacrificing slower friends in a bear attack","Maine man wins $1M from $25 lottery ticket","Make huge profits without work, earn up to $100,000 a day"]print("%-20s %s" % ("Query", "Best Match"))
print("-" * 50)for query in ("feel good story", "climate change", "health", "war", "wildlife", "asia","north america", "dishonest junk"):# Get index of best section that best matches queryuid = np.argmax(embeddings.similarity(query, sections))print("%-20s %s" % (query, sections[uid]))

Running the code above will print the following:


Image for post
Embeddings query output

The example above shows for almost all of the queries, the actual text isn’t stored in the list of text sections. This is the true power of transformer models over token based search. What you get out of the box is


