Creating a fulltext search engine in PHP 5 with the Zend Framework's Zend_Search_Lucene --http://www.phpriot.com/d/articles/php/search/zend-search-lucene/index.html
By Quentin Zervaas, 27 April 2006
有时间的话,我会将它翻译成中文,本身不难的,可慢慢看。
This article covers the implementation of a fulltext search engine using PHP 5 and the Zend Framework. We will be using the Zend_Search_Lucene component to create and search our fulltext index.
There are several other libraries we could use instead of this one, but Zend_Search_Lucene is completely native to PHP, whereas others such as Xapian or Tsearch2 rely on third party modules (for instance, the Tsearch2 module must be compiled into your PostgreSQL installation).
It must be noted at this point though that we require at least PHP 5 for Zend_Search_Lucene – PHP 4 will not work.
In this article we will be covering the following:
  • How to index a document or series of documents
  • The different types of fields that can be indexed
  • Searching the index
To demonstrate this functionality, we will cover the implementation of a search engine into phpRiot. We previously used the Tsearch2 module but had some problems that we were unable to overcome.
How fulltext indexing and querying works
Just in case you’ve never dealt with the subject before, I will briefly cover how the whole thing works.
The index essentially stores a special copy of all your data, but instead of storing the actual content of your documents, it stores a list of keywords found in each document.
So if I were to create an index of all the documents on phpRiot, this would be done by looping over every document, finding the keywords and then adding those words to the document.
Additionally, since we can’t just list a bunch of keywords to a user when they perform a search, we also store some other data in the index along with the keywords. In the case of phpRiot, we will store the document title, the author, the document URL and a brief summary of the article.
Alternatively we just store the document ID, and then proceed to look up those other details from the database using the ID, however it is quicker to do it this way (less queries), and since the title and author are only a few words for each document, there is little overhead in duplicating this data by storing it in our index.
Querying the data
Once we have created the index, we can then allow people to search it. When somebody types in a search term, we look for those terms in our index, which then points back to 0 or more different documents. Zend_Search_Lucene will also score the results for us, meaning the most relevant matches are returned first.
Keeping the index up-to-date
If a document is ever updated in the database, this likely means the index will be out-of-date. This means we have to make an extra consideration in our general content management. That is, we need to re-index that document.
There are several ways of approaching this. For instance, you could update it in real-time when the document is updated, or you could run an index update at a certain time every day. Each method has its own advantages. For something like phpRiot, updating the index in real time is probably the best way, as the data is not updated very frequently. On the other hand, if you were indexing user submitted comments, it could create a huge load to recreate the index for every comment, as there could be hundreds of comments per day.
Getting started
The first thing we must do is install the Zend Framework if you have not already done so. It is structured in a similar way to how the Pear file structure is organised. At this stage, the Zend Framework is only in a “preview” phase. At time of writing the current version was Preview 0.1.3.
You can download this from http://framework.zend.com/download. If you use Subversion, you can also checkout the trunk version which may have newer code in it.
I’m not exactly sure where the developers intended the framework to be stored, but like Pear is stored in /usr/local/lib/php, I chose to store it in /usr/local/lib/zend.
Highlight: Plain
$ cd /usr/local/src
$ wget http://framework.zend.com/download/tgz
$ tar -zxf ZendFramework-0.1.3.tar.gz
$ mv ZendFramework-0.1.3/library /usr/local/lib/zend
So now all that is required is that we add /usr/local/lib/zend to our include path. For instance, my include path directive in httpd.conf for phpRiot looks something like:
Highlight: Plain
php_value include_path .:/var/www/phpriot.com/include:/usr/local/lib/php
Which now becomes:
Highlight: Plain
php_value include_path .:/var/www/phpriot.com/include:/usr/local/lib/php:/usr/local/lib/zend
Creating our first index
The basic process for creating an index is:
  1. Open the index
  2. Add each document
  1. Commit (save) the index
The index is stored on the filesystem, so when you open an index, you specify the path where the it is to be stored. This is achieved by instantiating the Zend_Search_Lucene class.
Highlight: PHP
<?php
    require_once('Zend/Search/Lucene.php');
    $indexPath = '/var/www/phpriot.com/data/docindex';
    $index = newZend_Search_Lucene($indexPath, true);
?>
You’ll also notice the second parameter in the constructor call. This means the index is created from scratch. If you set this to false (or omit the argument), an existing index is opened. This is done when updating or querying the index. Since we’re creating the index at this stage, we include that parameter.
Adding a document to our index
Now that we have opened our index, we must add our documents to it. We create a document to be added to the index using the following:
Highlight: PHP
<?php
    $doc = newZend_Search_Lucene_Document();
?>
The next thing we must do is determine which fields we need to add to our index.
There are several different field types that can be stored for a single document. As we mentioned earlier, we want to store the keywords for our document, and also store some extra data that we can use when fetching the data. If we wanted to, we could duplicate the entire document content in our search index, however, this would be overkill and is unnecessary. It’s not as bad to duplicate smaller items such as the author name or document title as it’s only a small amount of data.
As you can tell from the previous paragraph, we’re starting to determine which fields we want to store in our index:
    • Document content – but we don’t want to duplicate the entire article, as we’re only going to show users a preview of the article in the search results
    • Title – we’re definitely going to include the title in our results
    • Article teaser – even though we’re not storing the content, we do want to store a short summary of the document to display in our search results
    • Author – we will list the author of each document in our search results, plus we want to give users the ability to search by author
  • Created – We’ll also store a timestamp of when the article was created.
This brings us to the different types of fields we can use for indexing in Zend_Search_Lucene:
  • UnIndexed – Data that isn’t available for searching, but is stored with our document (article teaser, article URL and timestamp of creation)
  • UnStored – Data that is available for search, but isn’t stored in the index in full (the document content)
  • Text – Data that is available for search and is stored in full (title and author)
There is also the Keyword and Binary fields available, but we won’t be using them in this example. Binary data isn’t searchable, but could be used to store an image with an indexed document.
To add a field to our indexed document, we use the addField() method. When we call this method, we must pass data that has been prepared according to the necessary field type. This is achieved used the Zend_Search_Lucene_Field class with the field type as the static method name.
In other words, to create the title field data, we use:
Highlight: PHP
<?php
    $data = Zend_Search_Lucene_Field::Text('title', $docTitle);
?>
Note that we specify the name of the field here. We can later reference this data by this field name, but we will cover this in the section about querying the index.
So to add all the data with the field types we just worked out, we would use this:
Highlight: PHP
<?php
    $doc = newZend_Search_Lucene_Document();
    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('url', $docUrl));
    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('created', $docCreated));
    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('teaser', $docTeaser));
    $doc->addField(Zend_Search_Lucene_Field::Text('title', $docTitle));
    $doc->addField(Zend_Search_Lucene_Field::Text('author', $docAuthor));
    $doc->addField(Zend_Search_Lucene_Field::UnStored('contents', $docBody));
?>
Important note: We added the main search content with a field name of contents. This is a special name to Zend_Search_Lucene, which means by default, all queries will search this field. If users instead want to search the title field for ‘foo’, their search query would look like title:foo. This will be covered further in the section about querying the index.
Finally, we add the document to the index using addDocument():
Highlight: PHP
<?php
    $index->addDocument($doc);
?>
We will cover this code in full in the next chapter (including looping over several documents and adding them all at once).
Committing / saving the index
Once all documents have been added, the index must be saved.
Highlight: PHP
<?php
    $index->commit();
?>
You can call commit after adding each document, but this operation is expensive and each call results in a new index segment (in lamen’s terms, this means it’s an extra file to search when querying the index). Note that better management of the segments in the index is a planned future improvement, but for the time being it’s best to be careful and call it once after you’ve done what you need to do with the index.
If you were to look at your filesystem now that you have committed the index, you would notice a directory at the location you specified when opening the index. This directory contains a number of files which are then used when querying the index.
Next up we will piece all the above code into a single script which indexes all the articles on phpRiot.
Indexing all the articles on phpRiot
Now that you’ve seen how to create a basic index, we will extend this script slightly so it can index all the documents in phpRiot.
Additionally, we will be extending the base Zend_Search_Lucene_Document class to simplify our code slightly. This will also demonstrate ways you can take advantage of the OOP style of programming that Zend_Search_Lucene uses.
Since we are demonstrating the indexing of phpRiot articles, we will also use the DatabaseObject class here to fetch article data, just as phpRiot does. You don’t really need to know how this class works to understand this example (as it is fairly self-explanatory in the function calls), but if you are interested, you can read our article Managing your data with DatabaseObject.
Extending Zend_Search_Lucene_Document
On the previous page, after we opened the index, we created a new instance of Zend_Search_Lucene_Document to hold the index data for a single document. Instead of calling this class directly, we’re going to extend this class to encapsulate all of the adding of data we also did.
In other words, we’re going to move the calls to addField into our class, rather than calling it for each field after we create our Zend_Search_Lucene_Document item.
Highlight: PHP
<?php
    classPhpRiotIndexedDocumentextendsZend_Search_Lucene_Document
    {
        /**
         * Constructor. Creates our indexable document and adds all
         * necessary fields to it using the passed in DatabaseObject
         * holding the article data.
         */
        publicfunction__construct(&$document)
        {
            $this->addField(Zend_Search_Lucene_Field::UnIndexed('url',     $document->generateUrl()));
            $this->addField(Zend_Search_Lucene_Field::UnIndexed('created', $document->getProperty('created')));
            $this->addField(Zend_Search_Lucene_Field::UnIndexed('teaser', $document->getProperty('teaser')));
            $this->addField(Zend_Search_Lucene_Field::Text('title', $document->getProperty('title')));
            $this->addField(Zend_Search_Lucene_Field::Text('author', $document->getProperty('author')));
            $this->addField(Zend_Search_Lucene_Field::UnStored('contents', $document->getProperty('body')));
        }
    }
?>
As you can see, we’ve made a simple wrapper class which fetches data from the passed in document (which uses DatabaseObject). The generateUrl() function is just a special internal method which determines a document’s full URL. We are storing this when we build the index so we don’t have to generate it each time a search is run (especially since this will never change, and if it does we can just rebuild the index).
Building the full index
Now that we have our class, we can create our index, loop over the documents, and then save our index:
Highlight: PHP
<?php
    require_once('Zend/Search/Lucene.php');
    require_once('DatabaseObject/PhpriotDocument.class.php');
    // where to save our index
    $indexPath = '/var/www/phpriot.com/data/docindex';
    // get a list of all database ids for documents
    $doc_ids = PhpriotDocument::GetDocIds($db);
    // create our index
    $index = newZend_Search_Lucene($indexPath, true);
    foreach($doc_idsas$doc_id){
        // load our databaseobject
        $document = newPhpriotDocument($db);
        $document->loadRecord($doc_id);
        // create our indexed document and add it to the index
        $index->addDocument(newPhpRiotIndexedDocument($document));
    }
    // write the index to disk
    $index->commit();
?>
The index has now been created! This can take some time if you have many documents or if each document has a large amount of content. We generate this by using PHP on the command line, which allows us to see its progress in real-time if we need to (we also output the title and a status message as each document is indexed).
How querying works in Zend_Search_Lucene
Now comes the most important part—actually finding stuff! There are quite a lot of options when it comes to querying your index, allowing you and your users to have a lot of control in the returned results.
When we created the indexed, we added six fields, but only three of those were actually searchable: document title, document content, and the author.
Each of these items are stored separately for each indexed documents, meaning you can search on them separately. The syntax used to search each section is somewhat like Google’s, in that you specify the field, followed by a colon, followed by the term (with no spaces).
So to search the author field for ‘Quentin’, the search query would be author:Quentin. (Note that the search is case-insensitive. To make it case-sensitive we would need to change some options when creating our index. For full details on this, please read the Zend_Search_Lucene manual section on Extensibility)
Likewise, to search in the title field for ‘php’, we would use title:php.
As we briefly mentioned earlier in this article, the default section that is searched in is the field called contents. So if you wanted to search the document body for the world ‘google’, you could use contents:google or just google.
Including and excluding terms
By default, all specified terms are searched on using a boolean ‘or’. This means that any of the terms can exist for a document to be returned. To force results to have a particular term, the plus symbol is used. This force results to not have a particular term, the minus symbol is used. If you’re searching in a different field, you can put the plus or minus either before the field name or the term name. In other words, +author:Quentin and author:+Quentin are identical.
Searching for phrases
It is possible to search for exact phrases with Zend_Search_Lucene, so if you wanted to search for the exact phrase “PHP Articles” you could. Because this is somewhat complicated to achieve, we will not be including this in our examples or implementation, however, there is alot of information on this on the Zend_Search_Lucene manual section on query types.
Sample queries
Here are some queries you can pass to Zend_Search_Lucene and their meanings.
Highlight: Plain
php
    // search the index for any article with the word php
 
php -author:quentin
    // find any article with the word php not written by me
 
author:quentin
    // find all the articles by me
 
php -ajax
    // find all articles with the word php that don't have the word ajax
 
title:mysql
    // find all articles with MySQL in the title
 
title:mysql -author:quentin
    // find all articles with MySQL in the title not by me
And so on. Hopefully you get the idea.
Scoring of results
All results returned from a search are assigned a score. This is a measure of how well the document matched the search term.
The results are ordered by their score, from highest to lowest.
I’m not exactly sure how the score is calculated or what it represents exactly, but it looks pretty on the search results.
You can customize the scoring algorithm (and hence the ordering of results). Please see the section later in the article on extending Zend_Search_Lucene.
Querying our index
On the previous page we looked at how to write queries to search the index. We learned how to include and exclude terms, and also how to search different fields in our indexed data.
Now we will look at actually pulling documents from our index using that term.
There are essentially two ways to query the index: passing the raw query in and letting Zend_Search_Lucene parse the query (ideal when you’re writing a search engine where you’re not sure what the user will enter), or by manually building up the query with API function calls.
In either case, you use the find() method on the index. The find() method returns a list of matches from your index.
Highlight: PHP
<?php
    require_once('Zend/Search/Lucene.php');
    $indexPath = '/var/www/phpriot.com/data/docindex';
    $index = newZend_Search_Lucene($indexPath);
    $hits = $index->query('php +author:Quentin');
?>
This sample code searches our index by also articles containing ‘php’, written by me. Note that when we opened our index, we did not pass the second parameter as we did when we created the index. This is because we are not writing the index, we are querying it.
We could also manually build this same query with function calls like so:
Highlight: PHP
<?php
    require_once('Zend/Search/Lucene.php');
    $indexPath = '/var/www/phpriot.com/data/docindex';
    $index = newZend_Search_Lucene($indexPath);
    $query = newZend_Search_Lucene_Search_Query_MultiTerm();
    $query->addTerm(newZend_Search_Lucene_Index_Term('php'), null);
    $query->addTerm(newZend_Search_Lucene_Index_Term('Quentin', 'author'), true);
    $hits = $index->query($query);
?>
The second parameter for addTerm used determines whether or not a field is required. True means it is required (like putting a plus sign before the term), false means it is prohibited (like putting a minus sign before the term), null means it isn’t required or prohibited.
The second parameter for Zend_Search_Lucene_Index_Term specifies the field to search index. By default this is contents.
On the whole, it is easier to simply allow Zend_Search_Lucene to parse the query.
Dealing with returned results
The results found from your query are returned in an array, meaning you can simply use count() on the array to determine the number of hits.
Each of the indexed fields are available as a class property.
So to loop over the results as we indexed them previously (with a title, author and teaser), we would do the following:
Highlight: PHP
<?php
    require_once('Zend/Search/Lucene.php');
    $query = 'php +author:Quentin';
    $indexPath = '/var/www/phpriot.com/data/docindex';
    $index = newZend_Search_Lucene($indexPath);
    $hits = $index->query($query);
    $numHits = count($hits);
?>
<p>
    Found <?=$hits?> result(s) for query <?=$query?>.
</p>
<?phpforeach($hitsas$hit){?>
    <h3><?=$hit->title?> (score: <?=$hit->score?>)</h3>
    <p>
        By <?=$hit->author?>
    </p>
    <p>
        <?=$hit->teaser?><br />
        <a href="<?=$hit->url?>">Read more...</a>
    </p>
<?php}?>
Here we also used an extra field called ‘score’. As mentioned previously, this is used as an indicator as to how well a document matched the query. Results with the highest score are listed first.
Creating a simple search engine
Using our code above, we can easily transform this into a simple site search engine. All we need to do is add a form and plug in the submitted query. Let’s assume this script is called search.php:
Highlight: PHP
<?php
    require_once('Zend/Search/Lucene.php');
    $query = isset(
Creating a fulltext search engine in PHP 5 with the Zend Framework's Zend_Search_Lucene --http://www.phpriot.com/d/articles/php/search/zend-search-lucene/index.html

By Quentin Zervaas, 27 April 2006

This article covers the implementation of a fulltext search engine using PHP 5 and the Zend Framework. We will be using the Zend_Search_Lucene component to create and search our fulltext index.

There are several other libraries we could use instead of this one, but Zend_Search_Lucene is completely native to PHP, whereas others such as Xapian or Tsearch2 rely on third party modules (for instance, the Tsearch2 module must be compiled into your PostgreSQL installation).

It must be noted at this point though that we require at least PHP 5 for Zend_Search_Lucene – PHP 4 will not work.

In this article we will be covering the following:

  • How to index a document or series of documents
  • The different types of fields that can be indexed

  • Searching the index

To demonstrate this functionality, we will cover the implementation of a search engine into phpRiot. We previously used the Tsearch2 module but had some problems that we were unable to overcome.

How fulltext indexing and querying works

Just in case you’ve never dealt with the subject before, I will briefly cover how the whole thing works.

The index essentially stores a special copy of all your data, but instead of storing the actual content of your documents, it stores a list of keywords found in each document.

So if I were to create an index of all the documents on phpRiot, this would be done by looping over every document, finding the keywords and then adding those words to the document.

Additionally, since we can’t just list a bunch of keywords to a user when they perform a search, we also store some other data in the index along with the keywords. In the case of phpRiot, we will store the document title, the author, the document URL and a brief summary of the article.

Alternatively we just store the document ID, and then proceed to look up those other details from the database using the ID, however it is quicker to do it this way (less queries), and since the title and author are only a few words for each document, there is little overhead in duplicating this data by storing it in our index.

Querying the data

Once we have created the index, we can then allow people to search it. When somebody types in a search term, we look for those terms in our index, which then points back to 0 or more different documents. Zend_Search_Lucene will also score the results for us, meaning the most relevant matches are returned first.

Keeping the index up-to-date

If a document is ever updated in the database, this likely means the index will be out-of-date. This means we have to make an extra consideration in our general content management. That is, we need to re-index that document.

There are several ways of approaching this. For instance, you could update it in real-time when the document is updated, or you could run an index update at a certain time every day. Each method has its own advantages. For something like phpRiot, updating the index in real time is probably the best way, as the data is not updated very frequently. On the other hand, if you were indexing user submitted comments, it could create a huge load to recreate the index for every comment, as there could be hundreds of comments per day.

Getting started

The first thing we must do is install the Zend Framework if you have not already done so. It is structured in a similar way to how the Pear file structure is organised. At this stage, the Zend Framework is only in a “preview” phase. At time of writing the current version was Preview 0.1.3.

You can download this from http://framework.zend.com/download. If you use Subversion, you can also checkout the trunk version which may have newer code in it.

I’m not exactly sure where the developers intended the framework to be stored, but like Pear is stored in /usr/local/lib/php, I chose to store it in /usr/local/lib/zend.

Highlight: Plain

$ cd /usr/local/src

$ wget http://framework.zend.com/download/tgz

$ tar -zxf ZendFramework-0.1.3.tar.gz

$ mv ZendFramework-0.1.3/library /usr/local/lib/zend

So now all that is required is that we add /usr/local/lib/zend to our include path. For instance, my include path directive in httpd.conf for phpRiot looks something like:

Highlight: Plain

php_value include_path .:/var/www/phpriot.com/include:/usr/local/lib/php

Which now becomes:

Highlight: Plain

php_value include_path .:/var/www/phpriot.com/include:/usr/local/lib/php:/usr/local/lib/zend

Creating our first index

The basic process for creating an index is:

  1. Open the index
  2. Add each document

  1. Commit (save) the index

The index is stored on the filesystem, so when you open an index, you specify the path where the it is to be stored. This is achieved by instantiating the Zend_Search_Lucene class.

Highlight: PHP

<?php

    require_once('Zend/Search/Lucene.php');

    $indexPath = '/var/www/phpriot.com/data/docindex';

    $index = newZend_Search_Lucene($indexPath, true);

?>

You’ll also notice the second parameter in the constructor call. This means the index is created from scratch. If you set this to false (or omit the argument), an existing index is opened. This is done when updating or querying the index. Since we’re creating the index at this stage, we include that parameter.

Adding a document to our index

Now that we have opened our index, we must add our documents to it. We create a document to be added to the index using the following:

Highlight: PHP

<?php

    $doc = newZend_Search_Lucene_Document();

?>

The next thing we must do is determine which fields we need to add to our index.

There are several different field types that can be stored for a single document. As we mentioned earlier, we want to store the keywords for our document, and also store some extra data that we can use when fetching the data. If we wanted to, we could duplicate the entire document content in our search index, however, this would be overkill and is unnecessary. It’s not as bad to duplicate smaller items such as the author name or document title as it’s only a small amount of data.

As you can tell from the previous paragraph, we’re starting to determine which fields we want to store in our index:

    • Document content – but we don’t want to duplicate the entire article, as we’re only going to show users a preview of the article in the search results
    • Title – we’re definitely going to include the title in our results
    • Article teaser – even though we’re not storing the content, we do want to store a short summary of the document to display in our search results
    • Author – we will list the author of each document in our search results, plus we want to give users the ability to search by author

  • Created – We’ll also store a timestamp of when the article was created.

This brings us to the different types of fields we can use for indexing in Zend_Search_Lucene:

  • UnIndexed – Data that isn’t available for searching, but is stored with our document (article teaser, article URL and timestamp of creation)
  • UnStored – Data that is available for search, but isn’t stored in the index in full (the document content)

  • Text – Data that is available for search and is stored in full (title and author)

There is also the Keyword and Binary fields available, but we won’t be using them in this example. Binary data isn’t searchable, but could be used to store an image with an indexed document.

To add a field to our indexed document, we use the addField() method. When we call this method, we must pass data that has been prepared according to the necessary field type. This is achieved used the Zend_Search_Lucene_Field class with the field type as the static method name.

In other words, to create the title field data, we use:

Highlight: PHP

<?php

    $data = Zend_Search_Lucene_Field::Text('title', $docTitle);

?>

Note that we specify the name of the field here. We can later reference this data by this field name, but we will cover this in the section about querying the index.

So to add all the data with the field types we just worked out, we would use this:

Highlight: PHP

<?php

    $doc = newZend_Search_Lucene_Document();

    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('url', $docUrl));

    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('created', $docCreated));

    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('teaser', $docTeaser));

    $doc->addField(Zend_Search_Lucene_Field::Text('title', $docTitle));

    $doc->addField(Zend_Search_Lucene_Field::Text('author', $docAuthor));

    $doc->addField(Zend_Search_Lucene_Field::UnStored('contents', $docBody));

?>

Important note: We added the main search content with a field name of contents. This is a special name to Zend_Search_Lucene, which means by default, all queries will search this field. If users instead want to search the title field for ‘foo’, their search query would look like title:foo. This will be covered further in the section about querying the index.

Finally, we add the document to the index using addDocument():

Highlight: PHP

<?php

    $index->addDocument($doc);

?>

We will cover this code in full in the next chapter (including looping over several documents and adding them all at once).

Committing / saving the index

Once all documents have been added, the index must be saved.

Highlight: PHP

<?php

    $index->commit();

?>

You can call commit after adding each document, but this operation is expensive and each call results in a new index segment (in lamen’s terms, this means it’s an extra file to search when querying the index). Note that better management of the segments in the index is a planned future improvement, but for the time being it’s best to be careful and call it once after you’ve done what you need to do with the index.

If you were to look at your filesystem now that you have committed the index, you would notice a directory at the location you specified when opening the index. This directory contains a number of files which are then used when querying the index.

Next up we will piece all the above code into a single script which indexes all the articles on phpRiot.

Indexing all the articles on phpRiot

Now that you’ve seen how to create a basic index, we will extend this script slightly so it can index all the documents in phpRiot.

Additionally, we will be extending the base Zend_Search_Lucene_Document class to simplify our code slightly. This will also demonstrate ways you can take advantage of the OOP style of programming that Zend_Search_Lucene uses.

Since we are demonstrating the indexing of phpRiot articles, we will also use the DatabaseObject class here to fetch article data, just as phpRiot does. You don’t really need to know how this class works to understand this example (as it is fairly self-explanatory in the function calls), but if you are interested, you can read our article Managing your data with DatabaseObject.

Extending Zend_Search_Lucene_Document

On the previous page, after we opened the index, we created a new instance of Zend_Search_Lucene_Document to hold the index data for a single document. Instead of calling this class directly, we’re going to extend this class to encapsulate all of the adding of data we also did.

In other words, we’re going to move the calls to addField into our class, rather than calling it for each field after we create our Zend_Search_Lucene_Document item.

Highlight: PHP

<?php

    classPhpRiotIndexedDocumentextendsZend_Search_Lucene_Document

    {

        /**

         * Constructor. Creates our indexable document and adds all

         * necessary fields to it using the passed in DatabaseObject

         * holding the article data.

         */

        publicfunction__construct(&$document)

        {

            $this->addField(Zend_Search_Lucene_Field::UnIndexed('url',     $document->generateUrl()));

            $this->addField(Zend_Search_Lucene_Field::UnIndexed('created', $document->getProperty('created')));

            $this->addField(Zend_Search_Lucene_Field::UnIndexed('teaser', $document->getProperty('teaser')));

            $this->addField(Zend_Search_Lucene_Field::Text('title', $document->getProperty('title')));

            $this->addField(Zend_Search_Lucene_Field::Text('author', $document->getProperty('author')));

            $this->addField(Zend_Search_Lucene_Field::UnStored('contents', $document->getProperty('body')));

        }

    }

?>

As you can see, we’ve made a simple wrapper class which fetches data from the passed in document (which uses DatabaseObject). The generateUrl() function is just a special internal method which determines a document’s full URL. We are storing this when we build the index so we don’t have to generate it each time a search is run (especially since this will never change, and if it does we can just rebuild the index).

Building the full index

Now that we have our class, we can create our index, loop over the documents, and then save our index:

Highlight: PHP

<?php

    require_once('Zend/Search/Lucene.php');

    require_once('DatabaseObject/PhpriotDocument.class.php');

    // where to save our index

    $indexPath = '/var/www/phpriot.com/data/docindex';

    // get a list of all database ids for documents

    $doc_ids = PhpriotDocument::GetDocIds($db);

    // create our index

    $index = newZend_Search_Lucene($indexPath, true);

    foreach($doc_idsas$doc_id){

        // load our databaseobject

        $document = newPhpriotDocument($db);

        $document->loadRecord($doc_id);

        // create our indexed document and add it to the index

        $index->addDocument(newPhpRiotIndexedDocument($document));

    }

    // write the index to disk

    $index->commit();

?>

The index has now been created! This can take some time if you have many documents or if each document has a large amount of content. We generate this by using PHP on the command line, which allows us to see its progress in real-time if we need to (we also output the title and a status message as each document is indexed).

How querying works in Zend_Search_Lucene

Now comes the most important part—actually finding stuff! There are quite a lot of options when it comes to querying your index, allowing you and your users to have a lot of control in the returned results.

When we created the indexed, we added six fields, but only three of those were actually searchable: document title, document content, and the author.

Each of these items are stored separately for each indexed documents, meaning you can search on them separately. The syntax used to search each section is somewhat like Google’s, in that you specify the field, followed by a colon, followed by the term (with no spaces).

So to search the author field for ‘Quentin’, the search query would be author:Quentin. (Note that the search is case-insensitive. To make it case-sensitive we would need to change some options when creating our index. For full details on this, please read the Zend_Search_Lucene manual section on Extensibility)

Likewise, to search in the title field for ‘php’, we would use title:php.

As we briefly mentioned earlier in this article, the default section that is searched in is the field called contents. So if you wanted to search the document body for the world ‘google’, you could use contents:google or just google.

Including and excluding terms

By default, all specified terms are searched on using a boolean ‘or’. This means that any of the terms can exist for a document to be returned. To force results to have a particular term, the plus symbol is used. This force results to not have a particular term, the minus symbol is used. If you’re searching in a different field, you can put the plus or minus either before the field name or the term name. In other words, +author:Quentin and author:+Quentin are identical.

Searching for phrases

It is possible to search for exact phrases with Zend_Search_Lucene, so if you wanted to search for the exact phrase “PHP Articles” you could. Because this is somewhat complicated to achieve, we will not be including this in our examples or implementation, however, there is alot of information on this on the Zend_Search_Lucene manual section on query types.

Sample queries

Here are some queries you can pass to Zend_Search_Lucene and their meanings.

Highlight: Plain

php

    // search the index for any article with the word php

 

php -author:quentin

    // find any article with the word php not written by me

 

author:quentin

    // find all the articles by me

 

php -ajax

    // find all articles with the word php that don't have the word ajax

 

title:mysql

    // find all articles with MySQL in the title

 

title:mysql -author:quentin

    // find all articles with MySQL in the title not by me

And so on. Hopefully you get the idea.

Scoring of results

All results returned from a search are assigned a score. This is a measure of how well the document matched the search term.

The results are ordered by their score, from highest to lowest.

I’m not exactly sure how the score is calculated or what it represents exactly, but it looks pretty on the search results.

You can customize the scoring algorithm (and hence the ordering of results). Please see the section later in the article on extending Zend_Search_Lucene.

Querying our index

On the previous page we looked at how to write queries to search the index. We learned how to include and exclude terms, and also how to search different fields in our indexed data.

Now we will look at actually pulling documents from our index using that term.

There are essentially two ways to query the index: passing the raw query in and letting Zend_Search_Lucene parse the query (ideal when you’re writing a search engine where you’re not sure what the user will enter), or by manually building up the query with API function calls.

In either case, you use the find() method on the index. The find() method returns a list of matches from your index.

Highlight: PHP

<?php

    require_once('Zend/Search/Lucene.php');

    $indexPath = '/var/www/phpriot.com/data/docindex';

    $index = newZend_Search_Lucene($indexPath);

    $hits = $index->query('php +author:Quentin');

?>

This sample code searches our index by also articles containing ‘php’, written by me. Note that when we opened our index, we did not pass the second parameter as we did when we created the index. This is because we are not writing the index, we are querying it.

We could also manually build this same query with function calls like so:

Highlight: PHP

<?php

    require_once('Zend/Search/Lucene.php');

    $indexPath = '/var/www/phpriot.com/data/docindex';

    $index = newZend_Search_Lucene($indexPath);

    $query = newZend_Search_Lucene_Search_Query_MultiTerm();

    $query->addTerm(newZend_Search_Lucene_Index_Term('php'), null);

    $query->addTerm(newZend_Search_Lucene_Index_Term('Quentin', 'author'), true);

    $hits = $index->query($query);

?>

The second parameter for addTerm used determines whether or not a field is required. True means it is required (like putting a plus sign before the term), false means it is prohibited (like putting a minus sign before the term), null means it isn’t required or prohibited.

The second parameter for Zend_Search_Lucene_Index_Term specifies the field to search index. By default this is contents.

On the whole, it is easier to simply allow Zend_Search_Lucene to parse the query.

Dealing with returned results

The results found from your query are returned in an array, meaning you can simply use count() on the array to determine the number of hits.

Each of the indexed fields are available as a class property.

So to loop over the results as we indexed them previously (with a title, author and teaser), we would do the following:

Highlight: PHP

<?php

    require_once('Zend/Search/Lucene.php');

    $query = 'php +author:Quentin';

    $indexPath = '/var/www/phpriot.com/data/docindex';

    $index = newZend_Search_Lucene($indexPath);

    $hits = $index->query($query);

    $numHits = count($hits);

?>

<p>

    Found <?=$hits?> result(s) for query <?=$query?>.

</p>

<?phpforeach($hitsas$hit){?>

    <h3><?=$hit->title?> (score: <?=$hit->score?>)</h3>

    <p>

        By <?=$hit->author?>

    </p>

    <p>

        <?=$hit->teaser?><br />

        <a href="<?=$hit->url?>">Read more...</a>

    </p>

<?php}?>

Here we also used an extra field called ‘score’. As mentioned previously, this is used as an indicator as to how well a document matched the query. Results with the highest score are listed first.

Creating a simple search engine

Using our code above, we can easily transform this into a simple site search engine. All we need to do is add a form and plug in the submitted query. Let’s assume this script is called search.php:

Highlight: PHP

<?php

    require_once('Zend/Search/Lucene.php');

___FCKpd___125

    $query = trim($query);

    $indexPath = '/var/www/phpriot.com/data/docindex';

    $index = newZend_Search_Lucene($indexPath);

    if(strlen($query) > 0){

        $hits = $index->query($query);

        $numHits = count($hits);

    }

?>

<form method="get" action="search.php">

    <input type="text" name="query" value="<?=htmlSpecialChars($query)?>" />

    <input type="submit" value="Search" />

</form>

<?phpif(strlen($query) > 0){?>

    <p>

        Found <?=$hits?> result(s) for query <?=$query?>.

    </p>

    <?phpforeach($hitsas$hit){?>

        <h3><?=$hit->title?> (score: <?=$hit->score?>)</h3>

        <p>

            By <?=$hit->author?>

        </p>

        <p>

            <?=$hit->teaser?><br />

            <a href="<?=$hit->url?>">Read more...</a>

        </p>

    <?php}?>

<?php}?>

Error handling

The one thing we haven’t dealt with yet are errors in the search. For instance, if we were to type in ‘title:’ with no query behind it then an error would occur. We handle this by catching the Zend_Search_Lucene_Exception exception.

Highlight: PHP

<?php

    require_once('Zend/Search/Lucene.php');

    $query = isset(

___FCKpd___156

___FCKpd___157

___FCKpd___158

___FCKpd___159

___FCKpd___160

___FCKpd___161

___FCKpd___162

___FCKpd___163

___FCKpd___164

___FCKpd___165

___FCKpd___166

This means now that if an error occurs in the search, we simply assume zero hits were returned, thereby handling the error without indicating to the user that anything went wrong.

Of course, you could also choose to get the error message from the exception and output that instead ($ex->getMessage()).

Keeping the index up-to-date

The other thing we haven’t yet dealt with is if any of our documents are updated. There are several ways to handle this:

  • Update just the entry for the updated document straight away
  • Rebuild the entire index when a document is updated straight away

  • Rebuild the entire index at a certain time each day (or several times per day)

The ideal method really depends on the kind of data you have, how often it is updated, and how important it is for it the search index to be completely up-to-date.

To be honest, I haven’t figured out a way to update a single document in the index yet, but I may just be missing something simple. If you open the index (without the second parameter), and then index a document that is already in the index, then it will be duplicated (and hence returned twice in any matching search).

There isn’t too much documentation on this library yet, at least nothing about this specifically. If anybody knows how, please drop me an email or submit a comment with this article.

So at this point, the way to keep the index updated is to rebuild it from scratch when a document is updated.

Extending Zend_Search_Lucene

There are several aspects of Zend_Search_Lucene that can be extended, allowing a fully customized search solution. These include:

  • A custom tokenizer for determining keywords in a document
  • Custom scoring algorithms to determine how well a document matches a search query

  • A custom storage method, to your index is stored however and wherever you please

A custom tokenizer

There are many reasons why a custom tokenizer can be useful. Here are some ideas:

  • PDF tokenizer – a tokenizer that can parse a PDF file and find all the keywords
  • Image tokenizer – a tokenizer that can perform Optical Character Recognition (OCR), thereby allowing you to index words in an image (and you could store the image also, using the Binary field type)

  • HTML tokenizer – a tokenizer than can read HTML data, thereby knowing not to index HTML keywords but only the actual content. You could make further improvements on this also, such as finding all headings and treating them with higher preference to the rest of the content.

More information on this can be found at http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.analysis.

Custom scoring algorithms

Using a custom scoring algorithm, you can determine how favourably different fields in a document are looked upon. For example, you might want to treat matches in the ‘title’ field (if you have one) much more favourably than matches in the ‘contents’ field.

More information can be found on this at http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.scoring.

Custom storage method

You can change how indexes are written to disk by extending the Zend_Search_Lucene_Storage_Directory and Zend_Search_Lucene_Storage_File classes.

It may or may not be possible to change this store all indexed data in a database, but I haven’t actually tried this so I’m not sure.

More information on this can be found at http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.storage.

Conclusion

In this article we looked closely at Zend_Search_Lucene, part of the new Zend Framework. While it is still in development, it works sufficiently well to implement fulltext indexing and searching on any PHP 5 website.

We also looked briefly at some ways of extending the search capabilities.

Hopefully this has given you some ideas for your own website. Feel free to add comments and ideas to this article.


GET['query']) ?
Creating a fulltext search engine in PHP 5 with the Zend Framework's Zend_Search_Lucene --http://www.phpriot.com/d/articles/php/search/zend-search-lucene/index.html

By Quentin Zervaas, 27 April 2006

This article covers the implementation of a fulltext search engine using PHP 5 and the Zend Framework. We will be using the Zend_Search_Lucene component to create and search our fulltext index.

There are several other libraries we could use instead of this one, but Zend_Search_Lucene is completely native to PHP, whereas others such as Xapian or Tsearch2 rely on third party modules (for instance, the Tsearch2 module must be compiled into your PostgreSQL installation).

It must be noted at this point though that we require at least PHP 5 for Zend_Search_Lucene – PHP 4 will not work.

In this article we will be covering the following:

  • How to index a document or series of documents
  • The different types of fields that can be indexed

  • Searching the index

To demonstrate this functionality, we will cover the implementation of a search engine into phpRiot. We previously used the Tsearch2 module but had some problems that we were unable to overcome.

How fulltext indexing and querying works

Just in case you’ve never dealt with the subject before, I will briefly cover how the whole thing works.

The index essentially stores a special copy of all your data, but instead of storing the actual content of your documents, it stores a list of keywords found in each document.

So if I were to create an index of all the documents on phpRiot, this would be done by looping over every document, finding the keywords and then adding those words to the document.

Additionally, since we can’t just list a bunch of keywords to a user when they perform a search, we also store some other data in the index along with the keywords. In the case of phpRiot, we will store the document title, the author, the document URL and a brief summary of the article.

Alternatively we just store the document ID, and then proceed to look up those other details from the database using the ID, however it is quicker to do it this way (less queries), and since the title and author are only a few words for each document, there is little overhead in duplicating this data by storing it in our index.

Querying the data

Once we have created the index, we can then allow people to search it. When somebody types in a search term, we look for those terms in our index, which then points back to 0 or more different documents. Zend_Search_Lucene will also score the results for us, meaning the most relevant matches are returned first.

Keeping the index up-to-date

If a document is ever updated in the database, this likely means the index will be out-of-date. This means we have to make an extra consideration in our general content management. That is, we need to re-index that document.

There are several ways of approaching this. For instance, you could update it in real-time when the document is updated, or you could run an index update at a certain time every day. Each method has its own advantages. For something like phpRiot, updating the index in real time is probably the best way, as the data is not updated very frequently. On the other hand, if you were indexing user submitted comments, it could create a huge load to recreate the index for every comment, as there could be hundreds of comments per day.

Getting started

The first thing we must do is install the Zend Framework if you have not already done so. It is structured in a similar way to how the Pear file structure is organised. At this stage, the Zend Framework is only in a “preview” phase. At time of writing the current version was Preview 0.1.3.

You can download this from http://framework.zend.com/download. If you use Subversion, you can also checkout the trunk version which may have newer code in it.

I’m not exactly sure where the developers intended the framework to be stored, but like Pear is stored in /usr/local/lib/php, I chose to store it in /usr/local/lib/zend.

Highlight: Plain

$ cd /usr/local/src

$ wget http://framework.zend.com/download/tgz

$ tar -zxf ZendFramework-0.1.3.tar.gz

$ mv ZendFramework-0.1.3/library /usr/local/lib/zend

So now all that is required is that we add /usr/local/lib/zend to our include path. For instance, my include path directive in httpd.conf for phpRiot looks something like:

Highlight: Plain

php_value include_path .:/var/www/phpriot.com/include:/usr/local/lib/php

Which now becomes:

Highlight: Plain

php_value include_path .:/var/www/phpriot.com/include:/usr/local/lib/php:/usr/local/lib/zend

Creating our first index

The basic process for creating an index is:

  1. Open the index
  2. Add each document

  1. Commit (save) the index

The index is stored on the filesystem, so when you open an index, you specify the path where the it is to be stored. This is achieved by instantiating the Zend_Search_Lucene class.

Highlight: PHP

<?php

    require_once('Zend/Search/Lucene.php');

    $indexPath = '/var/www/phpriot.com/data/docindex';

    $index = newZend_Search_Lucene($indexPath, true);

?>

You’ll also notice the second parameter in the constructor call. This means the index is created from scratch. If you set this to false (or omit the argument), an existing index is opened. This is done when updating or querying the index. Since we’re creating the index at this stage, we include that parameter.

Adding a document to our index

Now that we have opened our index, we must add our documents to it. We create a document to be added to the index using the following:

Highlight: PHP

<?php

    $doc = newZend_Search_Lucene_Document();

?>

The next thing we must do is determine which fields we need to add to our index.

There are several different field types that can be stored for a single document. As we mentioned earlier, we want to store the keywords for our document, and also store some extra data that we can use when fetching the data. If we wanted to, we could duplicate the entire document content in our search index, however, this would be overkill and is unnecessary. It’s not as bad to duplicate smaller items such as the author name or document title as it’s only a small amount of data.

As you can tell from the previous paragraph, we’re starting to determine which fields we want to store in our index:

    • Document content – but we don’t want to duplicate the entire article, as we’re only going to show users a preview of the article in the search results
    • Title – we’re definitely going to include the title in our results
    • Article teaser – even though we’re not storing the content, we do want to store a short summary of the document to display in our search results
    • Author – we will list the author of each document in our search results, plus we want to give users the ability to search by author

  • Created – We’ll also store a timestamp of when the article was created.

This brings us to the different types of fields we can use for indexing in Zend_Search_Lucene:

  • UnIndexed – Data that isn’t available for searching, but is stored with our document (article teaser, article URL and timestamp of creation)
  • UnStored – Data that is available for search, but isn’t stored in the index in full (the document content)

  • Text – Data that is available for search and is stored in full (title and author)

There is also the Keyword and Binary fields available, but we won’t be using them in this example. Binary data isn’t searchable, but could be used to store an image with an indexed document.

To add a field to our indexed document, we use the addField() method. When we call this method, we must pass data that has been prepared according to the necessary field type. This is achieved used the Zend_Search_Lucene_Field class with the field type as the static method name.

In other words, to create the title field data, we use:

Highlight: PHP

<?php

    $data = Zend_Search_Lucene_Field::Text('title', $docTitle);

?>

Note that we specify the name of the field here. We can later reference this data by this field name, but we will cover this in the section about querying the index.

So to add all the data with the field types we just worked out, we would use this:

Highlight: PHP

<?php

    $doc = newZend_Search_Lucene_Document();

    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('url', $docUrl));

    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('created', $docCreated));

    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('teaser', $docTeaser));

    $doc->addField(Zend_Search_Lucene_Field::Text('title', $docTitle));

    $doc->addField(Zend_Search_Lucene_Field::Text('author', $docAuthor));

    $doc->addField(Zend_Search_Lucene_Field::UnStored('contents', $docBody));

?>

Important note: We added the main search content with a field name of contents. This is a special name to Zend_Search_Lucene, which means by default, all queries will search this field. If users instead want to search the title field for ‘foo’, their search query would look like title:foo. This will be covered further in the section about querying the index.

Finally, we add the document to the index using addDocument():

Highlight: PHP

<?php

    $index->addDocument($doc);

?>

We will cover this code in full in the next chapter (including looping over several documents and adding them all at once).

Committing / saving the index

Once all documents have been added, the index must be saved.

Highlight: PHP

<?php

    $index->commit();

?>

You can call commit after adding each document, but this operation is expensive and each call results in a new index segment (in lamen’s terms, this means it’s an extra file to search when querying the index). Note that better management of the segments in the index is a planned future improvement, but for the time being it’s best to be careful and call it once after you’ve done what you need to do with the index.

If you were to look at your filesystem now that you have committed the index, you would notice a directory at the location you specified when opening the index. This directory contains a number of files which are then used when querying the index.

Next up we will piece all the above code into a single script which indexes all the articles on phpRiot.

Indexing all the articles on phpRiot

Now that you’ve seen how to create a basic index, we will extend this script slightly so it can index all the documents in phpRiot.

Additionally, we will be extending the base Zend_Search_Lucene_Document class to simplify our code slightly. This will also demonstrate ways you can take advantage of the OOP style of programming that Zend_Search_Lucene uses.

Since we are demonstrating the indexing of phpRiot articles, we will also use the DatabaseObject class here to fetch article data, just as phpRiot does. You don’t really need to know how this class works to understand this example (as it is fairly self-explanatory in the function calls), but if you are interested, you can read our article Managing your data with DatabaseObject.

Extending Zend_Search_Lucene_Document

On the previous page, after we opened the index, we created a new instance of Zend_Search_Lucene_Document to hold the index data for a single document. Instead of calling this class directly, we’re going to extend this class to encapsulate all of the adding of data we also did.

In other words, we’re going to move the calls to addField into our class, rather than calling it for each field after we create our Zend_Search_Lucene_Document item.

Highlight: PHP

<?php

    classPhpRiotIndexedDocumentextendsZend_Search_Lucene_Document

    {

        /**

         * Constructor. Creates our indexable document and adds all

         * necessary fields to it using the passed in DatabaseObject

         * holding the article data.

         */

        publicfunction__construct(&$document)

        {

            $this->addField(Zend_Search_Lucene_Field::UnIndexed('url',     $document->generateUrl()));

            $this->addField(Zend_Search_Lucene_Field::UnIndexed('created', $document->getProperty('created')));

            $this->addField(Zend_Search_Lucene_Field::UnIndexed('teaser', $document->getProperty('teaser')));

            $this->addField(Zend_Search_Lucene_Field::Text('title', $document->getProperty('title')));

            $this->addField(Zend_Search_Lucene_Field::Text('author', $document->getProperty('author')));

            $this->addField(Zend_Search_Lucene_Field::UnStored('contents', $document->getProperty('body')));

        }

    }

?>

As you can see, we’ve made a simple wrapper class which fetches data from the passed in document (which uses DatabaseObject). The generateUrl() function is just a special internal method which determines a document’s full URL. We are storing this when we build the index so we don’t have to generate it each time a search is run (especially since this will never change, and if it does we can just rebuild the index).

Building the full index

Now that we have our class, we can create our index, loop over the documents, and then save our index:

Highlight: PHP

<?php

    require_once('Zend/Search/Lucene.php');

    require_once('DatabaseObject/PhpriotDocument.class.php');

    // where to save our index

    $indexPath = '/var/www/phpriot.com/data/docindex';

    // get a list of all database ids for documents

    $doc_ids = PhpriotDocument::GetDocIds($db);

    // create our index

    $index = newZend_Search_Lucene($indexPath, true);

    foreach($doc_idsas$doc_id){

        // load our databaseobject

        $document = newPhpriotDocument($db);

        $document->loadRecord($doc_id);

        // create our indexed document and add it to the index

        $index->addDocument(newPhpRiotIndexedDocument($document));

    }

    // write the index to disk

    $index->commit();

?>

The index has now been created! This can take some time if you have many documents or if each document has a large amount of content. We generate this by using PHP on the command line, which allows us to see its progress in real-time if we need to (we also output the title and a status message as each document is indexed).

How querying works in Zend_Search_Lucene

Now comes the most important part—actually finding stuff! There are quite a lot of options when it comes to querying your index, allowing you and your users to have a lot of control in the returned results.

When we created the indexed, we added six fields, but only three of those were actually searchable: document title, document content, and the author.

Each of these items are stored separately for each indexed documents, meaning you can search on them separately. The syntax used to search each section is somewhat like Google’s, in that you specify the field, followed by a colon, followed by the term (with no spaces).

So to search the author field for ‘Quentin’, the search query would be author:Quentin. (Note that the search is case-insensitive. To make it case-sensitive we would need to change some options when creating our index. For full details on this, please read the Zend_Search_Lucene manual section on Extensibility)

Likewise, to search in the title field for ‘php’, we would use title:php.

As we briefly mentioned earlier in this article, the default section that is searched in is the field called contents. So if you wanted to search the document body for the world ‘google’, you could use contents:google or just google.

Including and excluding terms

By default, all specified terms are searched on using a boolean ‘or’. This means that any of the terms can exist for a document to be returned. To force results to have a particular term, the plus symbol is used. This force results to not have a particular term, the minus symbol is used. If you’re searching in a different field, you can put the plus or minus either before the field name or the term name. In other words, +author:Quentin and author:+Quentin are identical.

Searching for phrases

It is possible to search for exact phrases with Zend_Search_Lucene, so if you wanted to search for the exact phrase “PHP Articles” you could. Because this is somewhat complicated to achieve, we will not be including this in our examples or implementation, however, there is alot of information on this on the Zend_Search_Lucene manual section on query types.

Sample queries

Here are some queries you can pass to Zend_Search_Lucene and their meanings.

Highlight: Plain

php

    // search the index for any article with the word php

 

php -author:quentin

    // find any article with the word php not written by me

 

author:quentin

    // find all the articles by me

 

php -ajax

    // find all articles with the word php that don't have the word ajax

 

title:mysql

    // find all articles with MySQL in the title

 

title:mysql -author:quentin

    // find all articles with MySQL in the title not by me

And so on. Hopefully you get the idea.

Scoring of results

All results returned from a search are assigned a score. This is a measure of how well the document matched the search term.

The results are ordered by their score, from highest to lowest.

I’m not exactly sure how the score is calculated or what it represents exactly, but it looks pretty on the search results.

You can customize the scoring algorithm (and hence the ordering of results). Please see the section later in the article on extending Zend_Search_Lucene.

Querying our index

On the previous page we looked at how to write queries to search the index. We learned how to include and exclude terms, and also how to search different fields in our indexed data.

Now we will look at actually pulling documents from our index using that term.

There are essentially two ways to query the index: passing the raw query in and letting Zend_Search_Lucene parse the query (ideal when you’re writing a search engine where you’re not sure what the user will enter), or by manually building up the query with API function calls.

In either case, you use the find() method on the index. The find() method returns a list of matches from your index.

Highlight: PHP

<?php

    require_once('Zend/Search/Lucene.php');

    $indexPath = '/var/www/phpriot.com/data/docindex';

    $index = newZend_Search_Lucene($indexPath);

    $hits = $index->query('php +author:Quentin');

?>

This sample code searches our index by also articles containing ‘php’, written by me. Note that when we opened our index, we did not pass the second parameter as we did when we created the index. This is because we are not writing the index, we are querying it.

We could also manually build this same query with function calls like so:

Highlight: PHP

<?php

    require_once('Zend/Search/Lucene.php');

    $indexPath = '/var/www/phpriot.com/data/docindex';

    $index = newZend_Search_Lucene($indexPath);

    $query = newZend_Search_Lucene_Search_Query_MultiTerm();

    $query->addTerm(newZend_Search_Lucene_Index_Term('php'), null);

    $query->addTerm(newZend_Search_Lucene_Index_Term('Quentin', 'author'), true);

    $hits = $index->query($query);

?>

The second parameter for addTerm used determines whether or not a field is required. True means it is required (like putting a plus sign before the term), false means it is prohibited (like putting a minus sign before the term), null means it isn’t required or prohibited.

The second parameter for Zend_Search_Lucene_Index_Term specifies the field to search index. By default this is contents.

On the whole, it is easier to simply allow Zend_Search_Lucene to parse the query.

Dealing with returned results

The results found from your query are returned in an array, meaning you can simply use count() on the array to determine the number of hits.

Each of the indexed fields are available as a class property.

So to loop over the results as we indexed them previously (with a title, author and teaser), we would do the following:

Highlight: PHP

<?php

    require_once('Zend/Search/Lucene.php');

    $query = 'php +author:Quentin';

    $indexPath = '/var/www/phpriot.com/data/docindex';

    $index = newZend_Search_Lucene($indexPath);

    $hits = $index->query($query);

    $numHits = count($hits);

?>

<p>

    Found <?=$hits?> result(s) for query <?=$query?>.

</p>

<?phpforeach($hitsas$hit){?>

    <h3><?=$hit->title?> (score: <?=$hit->score?>)</h3>

    <p>

        By <?=$hit->author?>

    </p>

    <p>

        <?=$hit->teaser?><br />

        <a href="<?=$hit->url?>">Read more...</a>

    </p>

<?php}?>

Here we also used an extra field called ‘score’. As mentioned previously, this is used as an indicator as to how well a document matched the query. Results with the highest score are listed first.

Creating a simple search engine

Using our code above, we can easily transform this into a simple site search engine. All we need to do is add a form and plug in the submitted query. Let’s assume this script is called search.php:

Highlight: PHP

<?php

    require_once('Zend/Search/Lucene.php');

___FCKpd___125

___FCKpd___126

___FCKpd___127

___FCKpd___128

___FCKpd___129

___FCKpd___130

___FCKpd___131

___FCKpd___132

___FCKpd___133

___FCKpd___134

___FCKpd___135

___FCKpd___136

___FCKpd___137

___FCKpd___138

___FCKpd___139

___FCKpd___140

___FCKpd___141

___FCKpd___142

___FCKpd___143

___FCKpd___144

___FCKpd___145

___FCKpd___146

___FCKpd___147

___FCKpd___148

___FCKpd___149

___FCKpd___150

___FCKpd___151

___FCKpd___152

Error handling

The one thing we haven’t dealt with yet are errors in the search. For instance, if we were to type in ‘title:’ with no query behind it then an error would occur. We handle this by catching the Zend_Search_Lucene_Exception exception.

Highlight: PHP

___FCKpd___153

___FCKpd___154

___FCKpd___155

___FCKpd___156

___FCKpd___157

___FCKpd___158

___FCKpd___159

___FCKpd___160

___FCKpd___161

___FCKpd___162

___FCKpd___163

___FCKpd___164

___FCKpd___165

___FCKpd___166

This means now that if an error occurs in the search, we simply assume zero hits were returned, thereby handling the error without indicating to the user that anything went wrong.

Of course, you could also choose to get the error message from the exception and output that instead ($ex->getMessage()).

Keeping the index up-to-date

The other thing we haven’t yet dealt with is if any of our documents are updated. There are several ways to handle this:

  • Update just the entry for the updated document straight away
  • Rebuild the entire index when a document is updated straight away

  • Rebuild the entire index at a certain time each day (or several times per day)

The ideal method really depends on the kind of data you have, how often it is updated, and how important it is for it the search index to be completely up-to-date.

To be honest, I haven’t figured out a way to update a single document in the index yet, but I may just be missing something simple. If you open the index (without the second parameter), and then index a document that is already in the index, then it will be duplicated (and hence returned twice in any matching search).

There isn’t too much documentation on this library yet, at least nothing about this specifically. If anybody knows how, please drop me an email or submit a comment with this article.

So at this point, the way to keep the index updated is to rebuild it from scratch when a document is updated.

Extending Zend_Search_Lucene

There are several aspects of Zend_Search_Lucene that can be extended, allowing a fully customized search solution. These include:

  • A custom tokenizer for determining keywords in a document
  • Custom scoring algorithms to determine how well a document matches a search query

  • A custom storage method, to your index is stored however and wherever you please

A custom tokenizer

There are many reasons why a custom tokenizer can be useful. Here are some ideas:

  • PDF tokenizer – a tokenizer that can parse a PDF file and find all the keywords
  • Image tokenizer – a tokenizer that can perform Optical Character Recognition (OCR), thereby allowing you to index words in an image (and you could store the image also, using the Binary field type)

  • HTML tokenizer – a tokenizer than can read HTML data, thereby knowing not to index HTML keywords but only the actual content. You could make further improvements on this also, such as finding all headings and treating them with higher preference to the rest of the content.

More information on this can be found at http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.analysis.

Custom scoring algorithms

Using a custom scoring algorithm, you can determine how favourably different fields in a document are looked upon. For example, you might want to treat matches in the ‘title’ field (if you have one) much more favourably than matches in the ‘contents’ field.

More information can be found on this at http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.scoring.

Custom storage method

You can change how indexes are written to disk by extending the Zend_Search_Lucene_Storage_Directory and Zend_Search_Lucene_Storage_File classes.

It may or may not be possible to change this store all indexed data in a database, but I haven’t actually tried this so I’m not sure.

More information on this can be found at http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.storage.

Conclusion

In this article we looked closely at Zend_Search_Lucene, part of the new Zend Framework. While it is still in development, it works sufficiently well to implement fulltext indexing and searching on any PHP 5 website.

We also looked briefly at some ways of extending the search capabilities.

Hopefully this has given you some ideas for your own website. Feel free to add comments and ideas to this article.


GET['query'] : '';
Creating a fulltext search engine in PHP 5 with the Zend Framework's Zend_Search_Lucene --http://www.phpriot.com/d/articles/php/search/zend-search-lucene/index.html

By Quentin Zervaas, 27 April 2006

This article covers the implementation of a fulltext search engine using PHP 5 and the Zend Framework. We will be using the Zend_Search_Lucene component to create and search our fulltext index.

There are several other libraries we could use instead of this one, but Zend_Search_Lucene is completely native to PHP, whereas others such as Xapian or Tsearch2 rely on third party modules (for instance, the Tsearch2 module must be compiled into your PostgreSQL installation).

It must be noted at this point though that we require at least PHP 5 for Zend_Search_Lucene – PHP 4 will not work.

In this article we will be covering the following:

  • How to index a document or series of documents
  • The different types of fields that can be indexed

  • Searching the index

To demonstrate this functionality, we will cover the implementation of a search engine into phpRiot. We previously used the Tsearch2 module but had some problems that we were unable to overcome.

How fulltext indexing and querying works

Just in case you’ve never dealt with the subject before, I will briefly cover how the whole thing works.

The index essentially stores a special copy of all your data, but instead of storing the actual content of your documents, it stores a list of keywords found in each document.

So if I were to create an index of all the documents on phpRiot, this would be done by looping over every document, finding the keywords and then adding those words to the document.

Additionally, since we can’t just list a bunch of keywords to a user when they perform a search, we also store some other data in the index along with the keywords. In the case of phpRiot, we will store the document title, the author, the document URL and a brief summary of the article.

Alternatively we just store the document ID, and then proceed to look up those other details from the database using the ID, however it is quicker to do it this way (less queries), and since the title and author are only a few words for each document, there is little overhead in duplicating this data by storing it in our index.

Querying the data

Once we have created the index, we can then allow people to search it. When somebody types in a search term, we look for those terms in our index, which then points back to 0 or more different documents. Zend_Search_Lucene will also score the results for us, meaning the most relevant matches are returned first.

Keeping the index up-to-date

If a document is ever updated in the database, this likely means the index will be out-of-date. This means we have to make an extra consideration in our general content management. That is, we need to re-index that document.

There are several ways of approaching this. For instance, you could update it in real-time when the document is updated, or you could run an index update at a certain time every day. Each method has its own advantages. For something like phpRiot, updating the index in real time is probably the best way, as the data is not updated very frequently. On the other hand, if you were indexing user submitted comments, it could create a huge load to recreate the index for every comment, as there could be hundreds of comments per day.

Getting started

The first thing we must do is install the Zend Framework if you have not already done so. It is structured in a similar way to how the Pear file structure is organised. At this stage, the Zend Framework is only in a “preview” phase. At time of writing the current version was Preview 0.1.3.

You can download this from http://framework.zend.com/download. If you use Subversion, you can also checkout the trunk version which may have newer code in it.

I’m not exactly sure where the developers intended the framework to be stored, but like Pear is stored in /usr/local/lib/php, I chose to store it in /usr/local/lib/zend.

Highlight: Plain

$ cd /usr/local/src

$ wget http://framework.zend.com/download/tgz

$ tar -zxf ZendFramework-0.1.3.tar.gz

$ mv ZendFramework-0.1.3/library /usr/local/lib/zend

So now all that is required is that we add /usr/local/lib/zend to our include path. For instance, my include path directive in httpd.conf for phpRiot looks something like:

Highlight: Plain

php_value include_path .:/var/www/phpriot.com/include:/usr/local/lib/php

Which now becomes:

Highlight: Plain

php_value include_path .:/var/www/phpriot.com/include:/usr/local/lib/php:/usr/local/lib/zend

Creating our first index

The basic process for creating an index is:

  1. Open the index
  2. Add each document

  1. Commit (save) the index

The index is stored on the filesystem, so when you open an index, you specify the path where the it is to be stored. This is achieved by instantiating the Zend_Search_Lucene class.

Highlight: PHP

<?php

    require_once('Zend/Search/Lucene.php');

    $indexPath = '/var/www/phpriot.com/data/docindex';

    $index = newZend_Search_Lucene($indexPath, true);

?>

You’ll also notice the second parameter in the constructor call. This means the index is created from scratch. If you set this to false (or omit the argument), an existing index is opened. This is done when updating or querying the index. Since we’re creating the index at this stage, we include that parameter.

Adding a document to our index

Now that we have opened our index, we must add our documents to it. We create a document to be added to the index using the following:

Highlight: PHP

<?php

    $doc = newZend_Search_Lucene_Document();

?>

The next thing we must do is determine which fields we need to add to our index.

There are several different field types that can be stored for a single document. As we mentioned earlier, we want to store the keywords for our document, and also store some extra data that we can use when fetching the data. If we wanted to, we could duplicate the entire document content in our search index, however, this would be overkill and is unnecessary. It’s not as bad to duplicate smaller items such as the author name or document title as it’s only a small amount of data.

As you can tell from the previous paragraph, we’re starting to determine which fields we want to store in our index:

    • Document content – but we don’t want to duplicate the entire article, as we’re only going to show users a preview of the article in the search results
    • Title – we’re definitely going to include the title in our results
    • Article teaser – even though we’re not storing the content, we do want to store a short summary of the document to display in our search results
    • Author – we will list the author of each document in our search results, plus we want to give users the ability to search by author

  • Created – We’ll also store a timestamp of when the article was created.

This brings us to the different types of fields we can use for indexing in Zend_Search_Lucene:

  • UnIndexed – Data that isn’t available for searching, but is stored with our document (article teaser, article URL and timestamp of creation)
  • UnStored – Data that is available for search, but isn’t stored in the index in full (the document content)

  • Text – Data that is available for search and is stored in full (title and author)

There is also the Keyword and Binary fields available, but we won’t be using them in this example. Binary data isn’t searchable, but could be used to store an image with an indexed document.

To add a field to our indexed document, we use the addField() method. When we call this method, we must pass data that has been prepared according to the necessary field type. This is achieved used the Zend_Search_Lucene_Field class with the field type as the static method name.

In other words, to create the title field data, we use:

Highlight: PHP

<?php

    $data = Zend_Search_Lucene_Field::Text('title', $docTitle);

?>

Note that we specify the name of the field here. We can later reference this data by this field name, but we will cover this in the section about querying the index.

So to add all the data with the field types we just worked out, we would use this:

Highlight: PHP

<?php

    $doc = newZend_Search_Lucene_Document();

    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('url', $docUrl));

    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('created', $docCreated));

    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('teaser', $docTeaser));

    $doc->addField(Zend_Search_Lucene_Field::Text('title', $docTitle));

    $doc->addField(Zend_Search_Lucene_Field::Text('author', $docAuthor));

    $doc->addField(Zend_Search_Lucene_Field::UnStored('contents', $docBody));

?>

Important note: We added the main search content with a field name of contents. This is a special name to Zend_Search_Lucene, which means by default, all queries will search this field. If users instead want to search the title field for ‘foo’, their search query would look like title:foo. This will be covered further in the section about querying the index.

Finally, we add the document to the index using addDocument():

Highlight: PHP

<?php

    $index->addDocument($doc);

?>

We will cover this code in full in the next chapter (including looping over several documents and adding them all at once).

Committing / saving the index

Once all documents have been added, the index must be saved.

Highlight: PHP

<?php

    $index->commit();

?>

You can call commit after adding each document, but this operation is expensive and each call results in a new index segment (in lamen’s terms, this means it’s an extra file to search when querying the index). Note that better management of the segments in the index is a planned future improvement, but for the time being it’s best to be careful and call it once after you’ve done what you need to do with the index.

If you were to look at your filesystem now that you have committed the index, you would notice a directory at the location you specified when opening the index. This directory contains a number of files which are then used when querying the index.

Next up we will piece all the above code into a single script which indexes all the articles on phpRiot.

Indexing all the articles on phpRiot

Now that you’ve seen how to create a basic index, we will extend this script slightly so it can index all the documents in phpRiot.

Additionally, we will be extending the base Zend_Search_Lucene_Document class to simplify our code slightly. This will also demonstrate ways you can take advantage of the OOP style of programming that Zend_Search_Lucene uses.

Since we are demonstrating the indexing of phpRiot articles, we will also use the DatabaseObject class here to fetch article data, just as phpRiot does. You don’t really need to know how this class works to understand this example (as it is fairly self-explanatory in the function calls), but if you are interested, you can read our article Managing your data with DatabaseObject.

Extending Zend_Search_Lucene_Document

On the previous page, after we opened the index, we created a new instance of Zend_Search_Lucene_Document to hold the index data for a single document. Instead of calling this class directly, we’re going to extend this class to encapsulate all of the adding of data we also did.

In other words, we’re going to move the calls to addField into our class, rather than calling it for each field after we create our Zend_Search_Lucene_Document item.

Highlight: PHP

<?php

    classPhpRiotIndexedDocumentextendsZend_Search_Lucene_Document

    {

        /**

         * Constructor. Creates our indexable document and adds all

         * necessary fields to it using the passed in DatabaseObject

         * holding the article data.

         */

        publicfunction__construct(&$document)

        {

            $this->addField(Zend_Search_Lucene_Field::UnIndexed('url',     $document->generateUrl()));

            $this->addField(Zend_Search_Lucene_Field::UnIndexed('created', $document->getProperty('created')));

            $this->addField(Zend_Search_Lucene_Field::UnIndexed('teaser', $document->getProperty('teaser')));

            $this->addField(Zend_Search_Lucene_Field::Text('title', $document->getProperty('title')));

            $this->addField(Zend_Search_Lucene_Field::Text('author', $document->getProperty('author')));

            $this->addField(Zend_Search_Lucene_Field::UnStored('contents', $document->getProperty('body')));

        }

    }

?>

As you can see, we’ve made a simple wrapper class which fetches data from the passed in document (which uses DatabaseObject). The generateUrl() function is just a special internal method which determines a document’s full URL. We are storing this when we build the index so we don’t have to generate it each time a search is run (especially since this will never change, and if it does we can just rebuild the index).

Building the full index

Now that we have our class, we can create our index, loop over the documents, and then save our index:

Highlight: PHP

<?php

    require_once('Zend/Search/Lucene.php');

    require_once('DatabaseObject/PhpriotDocument.class.php');

    // where to save our index

    $indexPath = '/var/www/phpriot.com/data/docindex';

    // get a list of all database ids for documents

    $doc_ids = PhpriotDocument::GetDocIds($db);

    // create our index

    $index = newZend_Search_Lucene($indexPath, true);

    foreach($doc_idsas$doc_id){

        // load our databaseobject

        $document = newPhpriotDocument($db);

        $document->loadRecord($doc_id);

        // create our indexed document and add it to the index

        $index->addDocument(newPhpRiotIndexedDocument($document));

    }

    // write the index to disk

    $index->commit();

?>

The index has now been created! This can take some time if you have many documents or if each document has a large amount of content. We generate this by using PHP on the command line, which allows us to see its progress in real-time if we need to (we also output the title and a status message as each document is indexed).

How querying works in Zend_Search_Lucene

Now comes the most important part—actually finding stuff! There are quite a lot of options when it comes to querying your index, allowing you and your users to have a lot of control in the returned results.

When we created the indexed, we added six fields, but only three of those were actually searchable: document title, document content, and the author.

Each of these items are stored separately for each indexed documents, meaning you can search on them separately. The syntax used to search each section is somewhat like Google’s, in that you specify the field, followed by a colon, followed by the term (with no spaces).

So to search the author field for ‘Quentin’, the search query would be author:Quentin. (Note that the search is case-insensitive. To make it case-sensitive we would need to change some options when creating our index. For full details on this, please read the Zend_Search_Lucene manual section on Extensibility)

Likewise, to search in the title field for ‘php’, we would use title:php.

As we briefly mentioned earlier in this article, the default section that is searched in is the field called contents. So if you wanted to search the document body for the world ‘google’, you could use contents:google or just google.

Including and excluding terms

By default, all specified terms are searched on using a boolean ‘or’. This means that any of the terms can exist for a document to be returned. To force results to have a particular term, the plus symbol is used. This force results to not have a particular term, the minus symbol is used. If you’re searching in a different field, you can put the plus or minus either before the field name or the term name. In other words, +author:Quentin and author:+Quentin are identical.

Searching for phrases

It is possible to search for exact phrases with Zend_Search_Lucene, so if you wanted to search for the exact phrase “PHP Articles” you could. Because this is somewhat complicated to achieve, we will not be including this in our examples or implementation, however, there is alot of information on this on the Zend_Search_Lucene manual section on query types.

Sample queries

Here are some queries you can pass to Zend_Search_Lucene and their meanings.

Highlight: Plain

php

    // search the index for any article with the word php

 

php -author:quentin

    // find any article with the word php not written by me

 

author:quentin

    // find all the articles by me

 

php -ajax

    // find all articles with the word php that don't have the word ajax

 

title:mysql

    // find all articles with MySQL in the title

 

title:mysql -author:quentin

    // find all articles with MySQL in the title not by me

And so on. Hopefully you get the idea.

Scoring of results

All results returned from a search are assigned a score. This is a measure of how well the document matched the search term.

The results are ordered by their score, from highest to lowest.

I’m not exactly sure how the score is calculated or what it represents exactly, but it looks pretty on the search results.

You can customize the scoring algorithm (and hence the ordering of results). Please see the section later in the article on extending Zend_Search_Lucene.

Querying our index

On the previous page we looked at how to write queries to search the index. We learned how to include and exclude terms, and also how to search different fields in our indexed data.

Now we will look at actually pulling documents from our index using that term.

There are essentially two ways to query the index: passing the raw query in and letting Zend_Search_Lucene parse the query (ideal when you’re writing a search engine where you’re not sure what the user will enter), or by manually building up the query with API function calls.

In either case, you use the find() method on the index. The find() method returns a list of matches from your index.

Highlight: PHP

<?php

    require_once('Zend/Search/Lucene.php');

    $indexPath = '/var/www/phpriot.com/data/docindex';

    $index = newZend_Search_Lucene($indexPath);

    $hits = $index->query('php +author:Quentin');

?>

This sample code searches our index by also articles containing ‘php’, written by me. Note that when we opened our index, we did not pass the second parameter as we did when we created the index. This is because we are not writing the index, we are querying it.

We could also manually build this same query with function calls like so:

Highlight: PHP

<?php

    require_once('Zend/Search/Lucene.php');

    $indexPath = '/var/www/phpriot.com/data/docindex';

    $index = newZend_Search_Lucene($indexPath);

    $query = newZend_Search_Lucene_Search_Query_MultiTerm();

    $query->addTerm(newZend_Search_Lucene_Index_Term('php'), null);

    $query->addTerm(newZend_Search_Lucene_Index_Term('Quentin', 'author'), true);

    $hits = $index->query($query);

?>

The second parameter for addTerm used determines whether or not a field is required. True means it is required (like putting a plus sign before the term), false means it is prohibited (like putting a minus sign before the term), null means it isn’t required or prohibited.

The second parameter for Zend_Search_Lucene_Index_Term specifies the field to search index. By default this is contents.

On the whole, it is easier to simply allow Zend_Search_Lucene to parse the query.

Dealing with returned results

The results found from your query are returned in an array, meaning you can simply use count() on the array to determine the number of hits.

Each of the indexed fields are available as a class property.

So to loop over the results as we indexed them previously (with a title, author and teaser), we would do the following:

Highlight: PHP

<?php

    require_once('Zend/Search/Lucene.php');

    $query = 'php +author:Quentin';

    $indexPath = '/var/www/phpriot.com/data/docindex';

    $index = newZend_Search_Lucene($indexPath);

    $hits = $index->query($query);

    $numHits = count($hits);

?>

<p>

    Found <?=$hits?> result(s) for query <?=$query?>.

</p>

<?phpforeach($hitsas$hit){?>

    <h3><?=$hit->title?> (score: <?=$hit->score?>)</h3>

    <p>

        By <?=$hit->author?>

    </p>

    <p>

        <?=$hit->teaser?><br />

        <a href="<?=$hit->url?>">Read more...</a>

    </p>

<?php}?>

Here we also used an extra field called ‘score’. As mentioned previously, this is used as an indicator as to how well a document matched the query. Results with the highest score are listed first.

Creating a simple search engine

Using our code above, we can easily transform this into a simple site search engine. All we need to do is add a form and plug in the submitted query. Let’s assume this script is called search.php:

Highlight: PHP

<?php

    require_once('Zend/Search/Lucene.php');

    $query = isset(

___FCKpd___126

___FCKpd___127

___FCKpd___128

___FCKpd___129

___FCKpd___130

___FCKpd___131

___FCKpd___132

___FCKpd___133

___FCKpd___134

___FCKpd___135

___FCKpd___136

___FCKpd___137

___FCKpd___138

___FCKpd___139

___FCKpd___140

___FCKpd___141

___FCKpd___142

___FCKpd___143

___FCKpd___144

___FCKpd___145

___FCKpd___146

___FCKpd___147

___FCKpd___148

___FCKpd___149

___FCKpd___150

___FCKpd___151

___FCKpd___152

Error handling

The one thing we haven’t dealt with yet are errors in the search. For instance, if we were to type in ‘title:’ with no query behind it then an error would occur. We handle this by catching the Zend_Search_Lucene_Exception exception.

Highlight: PHP

___FCKpd___153

___FCKpd___154

___FCKpd___155

___FCKpd___156

___FCKpd___157

___FCKpd___158

___FCKpd___159

___FCKpd___160

___FCKpd___161

___FCKpd___162

___FCKpd___163

___FCKpd___164

___FCKpd___165

___FCKpd___166

This means now that if an error occurs in the search, we simply assume zero hits were returned, thereby handling the error without indicating to the user that anything went wrong.

Of course, you could also choose to get the error message from the exception and output that instead ($ex->getMessage()).

Keeping the index up-to-date

The other thing we haven’t yet dealt with is if any of our documents are updated. There are several ways to handle this:

  • Update just the entry for the updated document straight away
  • Rebuild the entire index when a document is updated straight away

  • Rebuild the entire index at a certain time each day (or several times per day)

The ideal method really depends on the kind of data you have, how often it is updated, and how important it is for it the search index to be completely up-to-date.

To be honest, I haven’t figured out a way to update a single document in the index yet, but I may just be missing something simple. If you open the index (without the second parameter), and then index a document that is already in the index, then it will be duplicated (and hence returned twice in any matching search).

There isn’t too much documentation on this library yet, at least nothing about this specifically. If anybody knows how, please drop me an email or submit a comment with this article.

So at this point, the way to keep the index updated is to rebuild it from scratch when a document is updated.

Extending Zend_Search_Lucene

There are several aspects of Zend_Search_Lucene that can be extended, allowing a fully customized search solution. These include:

  • A custom tokenizer for determining keywords in a document
  • Custom scoring algorithms to determine how well a document matches a search query

  • A custom storage method, to your index is stored however and wherever you please

A custom tokenizer

There are many reasons why a custom tokenizer can be useful. Here are some ideas:

  • PDF tokenizer – a tokenizer that can parse a PDF file and find all the keywords
  • Image tokenizer – a tokenizer that can perform Optical Character Recognition (OCR), thereby allowing you to index words in an image (and you could store the image also, using the Binary field type)

  • HTML tokenizer – a tokenizer than can read HTML data, thereby knowing not to index HTML keywords but only the actual content. You could make further improvements on this also, such as finding all headings and treating them with higher preference to the rest of the content.

More information on this can be found at http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.analysis.

Custom scoring algorithms

Using a custom scoring algorithm, you can determine how favourably different fields in a document are looked upon. For example, you might want to treat matches in the ‘title’ field (if you have one) much more favourably than matches in the ‘contents’ field.

More information can be found on this at http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.scoring.

Custom storage method

You can change how indexes are written to disk by extending the Zend_Search_Lucene_Storage_Directory and Zend_Search_Lucene_Storage_File classes.

It may or may not be possible to change this store all indexed data in a database, but I haven’t actually tried this so I’m not sure.

More information on this can be found at http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.storage.

Conclusion

In this article we looked closely at Zend_Search_Lucene, part of the new Zend Framework. While it is still in development, it works sufficiently well to implement fulltext indexing and searching on any PHP 5 website.

We also looked briefly at some ways of extending the search capabilities.

Hopefully this has given you some ideas for your own website. Feel free to add comments and ideas to this article.


GET['query']) ?
Creating a fulltext search engine in PHP 5 with the Zend Framework's Zend_Search_Lucene --http://www.phpriot.com/d/articles/php/search/zend-search-lucene/index.html

By Quentin Zervaas, 27 April 2006

This article covers the implementation of a fulltext search engine using PHP 5 and the Zend Framework. We will be using the Zend_Search_Lucene component to create and search our fulltext index.

There are several other libraries we could use instead of this one, but Zend_Search_Lucene is completely native to PHP, whereas others such as Xapian or Tsearch2 rely on third party modules (for instance, the Tsearch2 module must be compiled into your PostgreSQL installation).

It must be noted at this point though that we require at least PHP 5 for Zend_Search_Lucene – PHP 4 will not work.

In this article we will be covering the following:

  • How to index a document or series of documents
  • The different types of fields that can be indexed

  • Searching the index

To demonstrate this functionality, we will cover the implementation of a search engine into phpRiot. We previously used the Tsearch2 module but had some problems that we were unable to overcome.

How fulltext indexing and querying works

Just in case you’ve never dealt with the subject before, I will briefly cover how the whole thing works.

The index essentially stores a special copy of all your data, but instead of storing the actual content of your documents, it stores a list of keywords found in each document.

So if I were to create an index of all the documents on phpRiot, this would be done by looping over every document, finding the keywords and then adding those words to the document.

Additionally, since we can’t just list a bunch of keywords to a user when they perform a search, we also store some other data in the index along with the keywords. In the case of phpRiot, we will store the document title, the author, the document URL and a brief summary of the article.

Alternatively we just store the document ID, and then proceed to look up those other details from the database using the ID, however it is quicker to do it this way (less queries), and since the title and author are only a few words for each document, there is little overhead in duplicating this data by storing it in our index.

Querying the data

Once we have created the index, we can then allow people to search it. When somebody types in a search term, we look for those terms in our index, which then points back to 0 or more different documents. Zend_Search_Lucene will also score the results for us, meaning the most relevant matches are returned first.

Keeping the index up-to-date

If a document is ever updated in the database, this likely means the index will be out-of-date. This means we have to make an extra consideration in our general content management. That is, we need to re-index that document.

There are several ways of approaching this. For instance, you could update it in real-time when the document is updated, or you could run an index update at a certain time every day. Each method has its own advantages. For something like phpRiot, updating the index in real time is probably the best way, as the data is not updated very frequently. On the other hand, if you were indexing user submitted comments, it could create a huge load to recreate the index for every comment, as there could be hundreds of comments per day.

Getting started

The first thing we must do is install the Zend Framework if you have not already done so. It is structured in a similar way to how the Pear file structure is organised. At this stage, the Zend Framework is only in a “preview” phase. At time of writing the current version was Preview 0.1.3.

You can download this from http://framework.zend.com/download. If you use Subversion, you can also checkout the trunk version which may have newer code in it.

I’m not exactly sure where the developers intended the framework to be stored, but like Pear is stored in /usr/local/lib/php, I chose to store it in /usr/local/lib/zend.

Highlight: Plain

$ cd /usr/local/src

$ wget http://framework.zend.com/download/tgz

$ tar -zxf ZendFramework-0.1.3.tar.gz

$ mv ZendFramework-0.1.3/library /usr/local/lib/zend

So now all that is required is that we add /usr/local/lib/zend to our include path. For instance, my include path directive in httpd.conf for phpRiot looks something like:

Highlight: Plain

php_value include_path .:/var/www/phpriot.com/include:/usr/local/lib/php

Which now becomes:

Highlight: Plain

php_value include_path .:/var/www/phpriot.com/include:/usr/local/lib/php:/usr/local/lib/zend

Creating our first index

The basic process for creating an index is:

  1. Open the index
  2. Add each document

  1. Commit (save) the index

The index is stored on the filesystem, so when you open an index, you specify the path where the it is to be stored. This is achieved by instantiating the Zend_Search_Lucene class.

Highlight: PHP

<?php

    require_once('Zend/Search/Lucene.php');

    $indexPath = '/var/www/phpriot.com/data/docindex';

    $index = newZend_Search_Lucene($indexPath, true);

?>

You’ll also notice the second parameter in the constructor call. This means the index is created from scratch. If you set this to false (or omit the argument), an existing index is opened. This is done when updating or querying the index. Since we’re creating the index at this stage, we include that parameter.

Adding a document to our index

Now that we have opened our index, we must add our documents to it. We create a document to be added to the index using the following:

Highlight: PHP

<?php

    $doc = newZend_Search_Lucene_Document();

?>

The next thing we must do is determine which fields we need to add to our index.

There are several different field types that can be stored for a single document. As we mentioned earlier, we want to store the keywords for our document, and also store some extra data that we can use when fetching the data. If we wanted to, we could duplicate the entire document content in our search index, however, this would be overkill and is unnecessary. It’s not as bad to duplicate smaller items such as the author name or document title as it’s only a small amount of data.

As you can tell from the previous paragraph, we’re starting to determine which fields we want to store in our index:

    • Document content – but we don’t want to duplicate the entire article, as we’re only going to show users a preview of the article in the search results
    • Title – we’re definitely going to include the title in our results
    • Article teaser – even though we’re not storing the content, we do want to store a short summary of the document to display in our search results
    • Author – we will list the author of each document in our search results, plus we want to give users the ability to search by author

  • Created – We’ll also store a timestamp of when the article was created.

This brings us to the different types of fields we can use for indexing in Zend_Search_Lucene:

  • UnIndexed – Data that isn’t available for searching, but is stored with our document (article teaser, article URL and timestamp of creation)
  • UnStored – Data that is available for search, but isn’t stored in the index in full (the document content)

  • Text – Data that is available for search and is stored in full (title and author)

There is also the Keyword and Binary fields available, but we won’t be using them in this example. Binary data isn’t searchable, but could be used to store an image with an indexed document.

To add a field to our indexed document, we use the addField() method. When we call this method, we must pass data that has been prepared according to the necessary field type. This is achieved used the Zend_Search_Lucene_Field class with the field type as the static method name.

In other words, to create the title field data, we use:

Highlight: PHP

<?php

    $data = Zend_Search_Lucene_Field::Text('title', $docTitle);

?>

Note that we specify the name of the field here. We can later reference this data by this field name, but we will cover this in the section about querying the index.

So to add all the data with the field types we just worked out, we would use this:

Highlight: PHP

<?php

    $doc = newZend_Search_Lucene_Document();

    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('url', $docUrl));

    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('created', $docCreated));

    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('teaser', $docTeaser));

    $doc->addField(Zend_Search_Lucene_Field::Text('title', $docTitle));

    $doc->addField(Zend_Search_Lucene_Field::Text('author', $docAuthor));

    $doc->addField(Zend_Search_Lucene_Field::UnStored('contents', $docBody));

?>

Important note: We added the main search content with a field name of contents. This is a special name to Zend_Search_Lucene, which means by default, all queries will search this field. If users instead want to search the title field for ‘foo’, their search query would look like title:foo. This will be covered further in the section about querying the index.

Finally, we add the document to the index using addDocument():

Highlight: PHP

<?php

    $index->addDocument($doc);

?>

We will cover this code in full in the next chapter (including looping over several documents and adding them all at once).

Committing / saving the index

Once all documents have been added, the index must be saved.

Highlight: PHP

<?php

    $index->commit();

?>

You can call commit after adding each document, but this operation is expensive and each call results in a new index segment (in lamen’s terms, this means it’s an extra file to search when querying the index). Note that better management of the segments in the index is a planned future improvement, but for the time being it’s best to be careful and call it once after you’ve done what you need to do with the index.

If you were to look at your filesystem now that you have committed the index, you would notice a directory at the location you specified when opening the index. This directory contains a number of files which are then used when querying the index.

Next up we will piece all the above code into a single script which indexes all the articles on phpRiot.

Indexing all the articles on phpRiot

Now that you’ve seen how to create a basic index, we will extend this script slightly so it can index all the documents in phpRiot.

Additionally, we will be extending the base Zend_Search_Lucene_Document class to simplify our code slightly. This will also demonstrate ways you can take advantage of the OOP style of programming that Zend_Search_Lucene uses.

Since we are demonstrating the indexing of phpRiot articles, we will also use the DatabaseObject class here to fetch article data, just as phpRiot does. You don’t really need to know how this class works to understand this example (as it is fairly self-explanatory in the function calls), but if you are interested, you can read our article Managing your data with DatabaseObject.

Extending Zend_Search_Lucene_Document

On the previous page, after we opened the index, we created a new instance of Zend_Search_Lucene_Document to hold the index data for a single document. Instead of calling this class directly, we’re going to extend this class to encapsulate all of the adding of data we also did.

In other words, we’re going to move the calls to addField into our class, rather than calling it for each field after we create our Zend_Search_Lucene_Document item.

Highlight: PHP

<?php

    classPhpRiotIndexedDocumentextendsZend_Search_Lucene_Document

    {

        /**

         * Constructor. Creates our indexable document and adds all

         * necessary fields to it using the passed in DatabaseObject

         * holding the article data.

         */

        publicfunction__construct(&$document)

        {

            $this->addField(Zend_Search_Lucene_Field::UnIndexed('url',     $document->generateUrl()));

            $this->addField(Zend_Search_Lucene_Field::UnIndexed('created', $document->getProperty('created')));

            $this->addField(Zend_Search_Lucene_Field::UnIndexed('teaser', $document->getProperty('teaser')));

            $this->addField(Zend_Search_Lucene_Field::Text('title', $document->getProperty('title')));

            $this->addField(Zend_Search_Lucene_Field::Text('author', $document->getProperty('author')));

            $this->addField(Zend_Search_Lucene_Field::UnStored('contents', $document->getProperty('body')));

        }

    }

?>

As you can see, we’ve made a simple wrapper class which fetches data from the passed in document (which uses DatabaseObject). The generateUrl() function is just a special internal method which determines a document’s full URL. We are storing this when we build the index so we don’t have to generate it each time a search is run (especially since this will never change, and if it does we can just rebuild the index).

Building the full index

Now that we have our class, we can create our index, loop over the documents, and then save our index:

Highlight: PHP

<?php

    require_once('Zend/Search/Lucene.php');

    require_once('DatabaseObject/PhpriotDocument.class.php');

    // where to save our index

    $indexPath = '/var/www/phpriot.com/data/docindex';

    // get a list of all database ids for documents

    $doc_ids = PhpriotDocument::GetDocIds($db);

    // create our index

    $index = newZend_Search_Lucene($indexPath, true);

    foreach($doc_idsas$doc_id){

        // load our databaseobject

        $document = newPhpriotDocument($db);

        $document->loadRecord($doc_id);

        // create our indexed document and add it to the index

        $index->addDocument(newPhpRiotIndexedDocument($document));

    }

    // write the index to disk

    $index->commit();

?>

The index has now been created! This can take some time if you have many documents or if each document has a large amount of content. We generate this by using PHP on the command line, which allows us to see its progress in real-time if we need to (we also output the title and a status message as each document is indexed).

How querying works in Zend_Search_Lucene

Now comes the most important part—actually finding stuff! There are quite a lot of options when it comes to querying your index, allowing you and your users to have a lot of control in the returned results.

When we created the indexed, we added six fields, but only three of those were actually searchable: document title, document content, and the author.

Each of these items are stored separately for each indexed documents, meaning you can search on them separately. The syntax used to search each section is somewhat like Google’s, in that you specify the field, followed by a colon, followed by the term (with no spaces).

So to search the author field for ‘Quentin’, the search query would be author:Quentin. (Note that the search is case-insensitive. To make it case-sensitive we would need to change some options when creating our index. For full details on this, please read the Zend_Search_Lucene manual section on Extensibility)

Likewise, to search in the title field for ‘php’, we would use title:php.

As we briefly mentioned earlier in this article, the default section that is searched in is the field called contents. So if you wanted to search the document body for the world ‘google’, you could use contents:google or just google.

Including and excluding terms

By default, all specified terms are searched on using a boolean ‘or’. This means that any of the terms can exist for a document to be returned. To force results to have a particular term, the plus symbol is used. This force results to not have a particular term, the minus symbol is used. If you’re searching in a different field, you can put the plus or minus either before the field name or the term name. In other words, +author:Quentin and author:+Quentin are identical.

Searching for phrases

It is possible to search for exact phrases with Zend_Search_Lucene, so if you wanted to search for the exact phrase “PHP Articles” you could. Because this is somewhat complicated to achieve, we will not be including this in our examples or implementation, however, there is alot of information on this on the Zend_Search_Lucene manual section on query types.

Sample queries

Here are some queries you can pass to Zend_Search_Lucene and their meanings.

Highlight: Plain

php

    // search the index for any article with the word php

 

php -author:quentin

    // find any article with the word php not written by me

 

author:quentin

    // find all the articles by me

 

php -ajax

    // find all articles with the word php that don't have the word ajax

 

title:mysql

    // find all articles with MySQL in the title

 

title:mysql -author:quentin

    // find all articles with MySQL in the title not by me

And so on. Hopefully you get the idea.

Scoring of results

All results returned from a search are assigned a score. This is a measure of how well the document matched the search term.

The results are ordered by their score, from highest to lowest.

I’m not exactly sure how the score is calculated or what it represents exactly, but it looks pretty on the search results.

You can customize the scoring algorithm (and hence the ordering of results). Please see the section later in the article on extending Zend_Search_Lucene.

Querying our index

On the previous page we looked at how to write queries to search the index. We learned how to include and exclude terms, and also how to search different fields in our indexed data.

Now we will look at actually pulling documents from our index using that term.

There are essentially two ways to query the index: passing the raw query in and letting Zend_Search_Lucene parse the query (ideal when you’re writing a search engine where you’re not sure what the user will enter), or by manually building up the query with API function calls.

In either case, you use the find() method on the index. The find() method returns a list of matches from your index.

Highlight: PHP

<?php

    require_once('Zend/Search/Lucene.php');

    $indexPath = '/var/www/phpriot.com/data/docindex';

    $index = newZend_Search_Lucene($indexPath);

    $hits = $index->query('php +author:Quentin');

?>

This sample code searches our index by also articles containing ‘php’, written by me. Note that when we opened our index, we did not pass the second parameter as we did when we created the index. This is because we are not writing the index, we are querying it.

We could also manually build this same query with function calls like so:

Highlight: PHP

<?php

    require_once('Zend/Search/Lucene.php');

    $indexPath = '/var/www/phpriot.com/data/docindex';

    $index = newZend_Search_Lucene($indexPath);

    $query = newZend_Search_Lucene_Search_Query_MultiTerm();

    $query->addTerm(newZend_Search_Lucene_Index_Term('php'), null);

    $query->addTerm(newZend_Search_Lucene_Index_Term('Quentin', 'author'), true);

    $hits = $index->query($query);

?>

The second parameter for addTerm used determines whether or not a field is required. True means it is required (like putting a plus sign before the term), false means it is prohibited (like putting a minus sign before the term), null means it isn’t required or prohibited.

The second parameter for Zend_Search_Lucene_Index_Term specifies the field to search index. By default this is contents.

On the whole, it is easier to simply allow Zend_Search_Lucene to parse the query.

Dealing with returned results

The results found from your query are returned in an array, meaning you can simply use count() on the array to determine the number of hits.

Each of the indexed fields are available as a class property.

So to loop over the results as we indexed them previously (with a title, author and teaser), we would do the following:

Highlight: PHP

<?php

    require_once('Zend/Search/Lucene.php');

    $query = 'php +author:Quentin';

    $indexPath = '/var/www/phpriot.com/data/docindex';

    $index = newZend_Search_Lucene($indexPath);

    $hits = $index->query($query);

    $numHits = count($hits);

?>

<p>

    Found <?=$hits?> result(s) for query <?=$query?>.

</p>

<?phpforeach($hitsas$hit){?>

    <h3><?=$hit->title?> (score: <?=$hit->score?>)</h3>

    <p>

        By <?=$hit->author?>

    </p>

    <p>

        <?=$hit->teaser?><br />

        <a href="<?=$hit->url?>">Read more...</a>

    </p>

<?php}?>

Here we also used an extra field called ‘score’. As mentioned previously, this is used as an indicator as to how well a document matched the query. Results with the highest score are listed first.

Creating a simple search engine

Using our code above, we can easily transform this into a simple site search engine. All we need to do is add a form and plug in the submitted query. Let’s assume this script is called search.php:

Highlight: PHP

<?php

    require_once('Zend/Search/Lucene.php');

    $query = isset(

___FCKpd___126

___FCKpd___127

___FCKpd___128

___FCKpd___129

___FCKpd___130

___FCKpd___131

___FCKpd___132

___FCKpd___133

___FCKpd___134

___FCKpd___135

___FCKpd___136

___FCKpd___137

___FCKpd___138

___FCKpd___139

___FCKpd___140

___FCKpd___141

___FCKpd___142

___FCKpd___143

___FCKpd___144

___FCKpd___145

___FCKpd___146

___FCKpd___147

___FCKpd___148

___FCKpd___149

___FCKpd___150

___FCKpd___151

___FCKpd___152

Error handling

The one thing we haven’t dealt with yet are errors in the search. For instance, if we were to type in ‘title:’ with no query behind it then an error would occur. We handle this by catching the Zend_Search_Lucene_Exception exception.

Highlight: PHP

___FCKpd___153

___FCKpd___154

___FCKpd___155

___FCKpd___156

___FCKpd___157

___FCKpd___158

___FCKpd___159

___FCKpd___160

___FCKpd___161

___FCKpd___162

___FCKpd___163

___FCKpd___164

___FCKpd___165

___FCKpd___166

This means now that if an error occurs in the search, we simply assume zero hits were returned, thereby handling the error without indicating to the user that anything went wrong.

Of course, you could also choose to get the error message from the exception and output that instead ($ex->getMessage()).

Keeping the index up-to-date

The other thing we haven’t yet dealt with is if any of our documents are updated. There are several ways to handle this:

  • Update just the entry for the updated document straight away
  • Rebuild the entire index when a document is updated straight away

  • Rebuild the entire index at a certain time each day (or several times per day)

The ideal method really depends on the kind of data you have, how often it is updated, and how important it is for it the search index to be completely up-to-date.

To be honest, I haven’t figured out a way to update a single document in the index yet, but I may just be missing something simple. If you open the index (without the second parameter), and then index a document that is already in the index, then it will be duplicated (and hence returned twice in any matching search).

There isn’t too much documentation on this library yet, at least nothing about this specifically. If anybody knows how, please drop me an email or submit a comment with this article.

So at this point, the way to keep the index updated is to rebuild it from scratch when a document is updated.

Extending Zend_Search_Lucene

There are several aspects of Zend_Search_Lucene that can be extended, allowing a fully customized search solution. These include:

  • A custom tokenizer for determining keywords in a document
  • Custom scoring algorithms to determine how well a document matches a search query

  • A custom storage method, to your index is stored however and wherever you please

A custom tokenizer

There are many reasons why a custom tokenizer can be useful. Here are some ideas:

  • PDF tokenizer – a tokenizer that can parse a PDF file and find all the keywords
  • Image tokenizer – a tokenizer that can perform Optical Character Recognition (OCR), thereby allowing you to index words in an image (and you could store the image also, using the Binary field type)

  • HTML tokenizer – a tokenizer than can read HTML data, thereby knowing not to index HTML keywords but only the actual content. You could make further improvements on this also, such as finding all headings and treating them with higher preference to the rest of the content.

More information on this can be found at http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.analysis.

Custom scoring algorithms

Using a custom scoring algorithm, you can determine how favourably different fields in a document are looked upon. For example, you might want to treat matches in the ‘title’ field (if you have one) much more favourably than matches in the ‘contents’ field.

More information can be found on this at http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.scoring.

Custom storage method

You can change how indexes are written to disk by extending the Zend_Search_Lucene_Storage_Directory and Zend_Search_Lucene_Storage_File classes.

It may or may not be possible to change this store all indexed data in a database, but I haven’t actually tried this so I’m not sure.

More information on this can be found at http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.storage.

Conclusion

In this article we looked closely at Zend_Search_Lucene, part of the new Zend Framework. While it is still in development, it works sufficiently well to implement fulltext indexing and searching on any PHP 5 website.

We also looked briefly at some ways of extending the search capabilities.

Hopefully this has given you some ideas for your own website. Feel free to add comments and ideas to this article.


GET['query'] : '';
Creating a fulltext search engine in PHP 5 with the Zend Framework's Zend_Search_Lucene --http://www.phpriot.com/d/articles/php/search/zend-search-lucene/index.html

By Quentin Zervaas, 27 April 2006

This article covers the implementation of a fulltext search engine using PHP 5 and the Zend Framework. We will be using the Zend_Search_Lucene component to create and search our fulltext index.

There are several other libraries we could use instead of this one, but Zend_Search_Lucene is completely native to PHP, whereas others such as Xapian or Tsearch2 rely on third party modules (for instance, the Tsearch2 module must be compiled into your PostgreSQL installation).

It must be noted at this point though that we require at least PHP 5 for Zend_Search_Lucene – PHP 4 will not work.

In this article we will be covering the following:

  • How to index a document or series of documents
  • The different types of fields that can be indexed

  • Searching the index

To demonstrate this functionality, we will cover the implementation of a search engine into phpRiot. We previously used the Tsearch2 module but had some problems that we were unable to overcome.

How fulltext indexing and querying works

Just in case you’ve never dealt with the subject before, I will briefly cover how the whole thing works.

The index essentially stores a special copy of all your data, but instead of storing the actual content of your documents, it stores a list of keywords found in each document.

So if I were to create an index of all the documents on phpRiot, this would be done by looping over every document, finding the keywords and then adding those words to the document.

Additionally, since we can’t just list a bunch of keywords to a user when they perform a search, we also store some other data in the index along with the keywords. In the case of phpRiot, we will store the document title, the author, the document URL and a brief summary of the article.

Alternatively we just store the document ID, and then proceed to look up those other details from the database using the ID, however it is quicker to do it this way (less queries), and since the title and author are only a few words for each document, there is little overhead in duplicating this data by storing it in our index.

Querying the data

Once we have created the index, we can then allow people to search it. When somebody types in a search term, we look for those terms in our index, which then points back to 0 or more different documents. Zend_Search_Lucene will also score the results for us, meaning the most relevant matches are returned first.

Keeping the index up-to-date

If a document is ever updated in the database, this likely means the index will be out-of-date. This means we have to make an extra consideration in our general content management. That is, we need to re-index that document.

There are several ways of approaching this. For instance, you could update it in real-time when the document is updated, or you could run an index update at a certain time every day. Each method has its own advantages. For something like phpRiot, updating the index in real time is probably the best way, as the data is not updated very frequently. On the other hand, if you were indexing user submitted comments, it could create a huge load to recreate the index for every comment, as there could be hundreds of comments per day.

Getting started

The first thing we must do is install the Zend Framework if you have not already done so. It is structured in a similar way to how the Pear file structure is organised. At this stage, the Zend Framework is only in a “preview” phase. At time of writing the current version was Preview 0.1.3.

You can download this from http://framework.zend.com/download. If you use Subversion, you can also checkout the trunk version which may have newer code in it.

I’m not exactly sure where the developers intended the framework to be stored, but like Pear is stored in /usr/local/lib/php, I chose to store it in /usr/local/lib/zend.

Highlight: Plain

$ cd /usr/local/src

$ wget http://framework.zend.com/download/tgz

$ tar -zxf ZendFramework-0.1.3.tar.gz

$ mv ZendFramework-0.1.3/library /usr/local/lib/zend

So now all that is required is that we add /usr/local/lib/zend to our include path. For instance, my include path directive in httpd.conf for phpRiot looks something like:

Highlight: Plain

php_value include_path .:/var/www/phpriot.com/include:/usr/local/lib/php

Which now becomes:

Highlight: Plain

php_value include_path .:/var/www/phpriot.com/include:/usr/local/lib/php:/usr/local/lib/zend

Creating our first index

The basic process for creating an index is:

  1. Open the index
  2. Add each document

  1. Commit (save) the index

The index is stored on the filesystem, so when you open an index, you specify the path where the it is to be stored. This is achieved by instantiating the Zend_Search_Lucene class.

Highlight: PHP

<?php

    require_once('Zend/Search/Lucene.php');

    $indexPath = '/var/www/phpriot.com/data/docindex';

    $index = newZend_Search_Lucene($indexPath, true);

?>

You’ll also notice the second parameter in the constructor call. This means the index is created from scratch. If you set this to false (or omit the argument), an existing index is opened. This is done when updating or querying the index. Since we’re creating the index at this stage, we include that parameter.

Adding a document to our index

Now that we have opened our index, we must add our documents to it. We create a document to be added to the index using the following:

Highlight: PHP

<?php

    $doc = newZend_Search_Lucene_Document();

?>

The next thing we must do is determine which fields we need to add to our index.

There are several different field types that can be stored for a single document. As we mentioned earlier, we want to store the keywords for our document, and also store some extra data that we can use when fetching the data. If we wanted to, we could duplicate the entire document content in our search index, however, this would be overkill and is unnecessary. It’s not as bad to duplicate smaller items such as the author name or document title as it’s only a small amount of data.

As you can tell from the previous paragraph, we’re starting to determine which fields we want to store in our index:

    • Document content – but we don’t want to duplicate the entire article, as we’re only going to show users a preview of the article in the search results
    • Title – we’re definitely going to include the title in our results
    • Article teaser – even though we’re not storing the content, we do want to store a short summary of the document to display in our search results
    • Author – we will list the author of each document in our search results, plus we want to give users the ability to search by author

  • Created – We’ll also store a timestamp of when the article was created.

This brings us to the different types of fields we can use for indexing in Zend_Search_Lucene:

  • UnIndexed – Data that isn’t available for searching, but is stored with our document (article teaser, article URL and timestamp of creation)
  • UnStored – Data that is available for search, but isn’t stored in the index in full (the document content)

  • Text – Data that is available for search and is stored in full (title and author)

There is also the Keyword and Binary fields available, but we won’t be using them in this example. Binary data isn’t searchable, but could be used to store an image with an indexed document.

To add a field to our indexed document, we use the addField() method. When we call this method, we must pass data that has been prepared according to the necessary field type. This is achieved used the Zend_Search_Lucene_Field class with the field type as the static method name.

In other words, to create the title field data, we use:

Highlight: PHP

<?php

    $data = Zend_Search_Lucene_Field::Text('title', $docTitle);

?>

Note that we specify the name of the field here. We can later reference this data by this field name, but we will cover this in the section about querying the index.

So to add all the data with the field types we just worked out, we would use this:

Highlight: PHP

<?php

    $doc = newZend_Search_Lucene_Document();

    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('url', $docUrl));

    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('created', $docCreated));

    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('teaser', $docTeaser));

    $doc->addField(Zend_Search_Lucene_Field::Text('title', $docTitle));

    $doc->addField(Zend_Search_Lucene_Field::Text('author', $docAuthor));

    $doc->addField(Zend_Search_Lucene_Field::UnStored('contents', $docBody));

?>

Important note: We added the main search content with a field name of contents. This is a special name to Zend_Search_Lucene, which means by default, all queries will search this field. If users instead want to search the title field for ‘foo’, their search query would look like title:foo. This will be covered further in the section about querying the index.

Finally, we add the document to the index using addDocument():

Highlight: PHP

<?php

    $index->addDocument($doc);

?>

We will cover this code in full in the next chapter (including looping over several documents and adding them all at once).

Committing / saving the index

Once all documents have been added, the index must be saved.

Highlight: PHP

<?php

    $index->commit();

?>

You can call commit after adding each document, but this operation is expensive and each call results in a new index segment (in lamen’s terms, this means it’s an extra file to search when querying the index). Note that better management of the segments in the index is a planned future improvement, but for the time being it’s best to be careful and call it once after you’ve done what you need to do with the index.

If you were to look at your filesystem now that you have committed the index, you would notice a directory at the location you specified when opening the index. This directory contains a number of files which are then used when querying the index.

Next up we will piece all the above code into a single script which indexes all the articles on phpRiot.

Indexing all the articles on phpRiot

Now that you’ve seen how to create a basic index, we will extend this script slightly so it can index all the documents in phpRiot.

Additionally, we will be extending the base Zend_Search_Lucene_Document class to simplify our code slightly. This will also demonstrate ways you can take advantage of the OOP style of programming that Zend_Search_Lucene uses.

Since we are demonstrating the indexing of phpRiot articles, we will also use the DatabaseObject class here to fetch article data, just as phpRiot does. You don’t really need to know how this class works to understand this example (as it is fairly self-explanatory in the function calls), but if you are interested, you can read our article Managing your data with DatabaseObject.

Extending Zend_Search_Lucene_Document

On the previous page, after we opened the index, we created a new instance of Zend_Search_Lucene_Document to hold the index data for a single document. Instead of calling this class directly, we’re going to extend this class to encapsulate all of the adding of data we also did.

In other words, we’re going to move the calls to addField into our class, rather than calling it for each field after we create our Zend_Search_Lucene_Document item.

Highlight: PHP

<?php

    classPhpRiotIndexedDocumentextendsZend_Search_Lucene_Document

    {

        /**

         * Constructor. Creates our indexable document and adds all

         * necessary fields to it using the passed in DatabaseObject

         * holding the article data.

         */

        publicfunction__construct(&$document)

        {

            $this->addField(Zend_Search_Lucene_Field::UnIndexed('url',     $document->generateUrl()));

            $this->addField(Zend_Search_Lucene_Field::UnIndexed('created', $document->getProperty('created')));

            $this->addField(Zend_Search_Lucene_Field::UnIndexed('teaser', $document->getProperty('teaser')));

            $this->addField(Zend_Search_Lucene_Field::Text('title', $document->getProperty('title')));

            $this->addField(Zend_Search_Lucene_Field::Text('author', $document->getProperty('author')));

            $this->addField(Zend_Search_Lucene_Field::UnStored('contents', $document->getProperty('body')));

        }

    }

?>

As you can see, we’ve made a simple wrapper class which fetches data from the passed in document (which uses DatabaseObject). The generateUrl() function is just a special internal method which determines a document’s full URL. We are storing this when we build the index so we don’t have to generate it each time a search is run (especially since this will never change, and if it does we can just rebuild the index).

Building the full index

Now that we have our class, we can create our index, loop over the documents, and then save our index:

Highlight: PHP

<?php

    require_once('Zend/Search/Lucene.php');

    require_once('DatabaseObject/PhpriotDocument.class.php');

    // where to save our index

    $indexPath = '/var/www/phpriot.com/data/docindex';

    // get a list of all database ids for documents

    $doc_ids = PhpriotDocument::GetDocIds($db);

    // create our index

    $index = newZend_Search_Lucene($indexPath, true);

    foreach($doc_idsas$doc_id){

        // load our databaseobject

        $document = newPhpriotDocument($db);

        $document->loadRecord($doc_id);

        // create our indexed document and add it to the index

        $index->addDocument(newPhpRiotIndexedDocument($document));

    }

    // write the index to disk

    $index->commit();

?>

The index has now been created! This can take some time if you have many documents or if each document has a large amount of content. We generate this by using PHP on the command line, which allows us to see its progress in real-time if we need to (we also output the title and a status message as each document is indexed).

How querying works in Zend_Search_Lucene

Now comes the most important part—actually finding stuff! There are quite a lot of options when it comes to querying your index, allowing you and your users to have a lot of control in the returned results.

When we created the indexed, we added six fields, but only three of those were actually searchable: document title, document content, and the author.

Each of these items are stored separately for each indexed documents, meaning you can search on them separately. The syntax used to search each section is somewhat like Google’s, in that you specify the field, followed by a colon, followed by the term (with no spaces).

So to search the author field for ‘Quentin’, the search query would be author:Quentin. (Note that the search is case-insensitive. To make it case-sensitive we would need to change some options when creating our index. For full details on this, please read the Zend_Search_Lucene manual section on Extensibility)

Likewise, to search in the title field for ‘php’, we would use title:php.

As we briefly mentioned earlier in this article, the default section that is searched in is the field called contents. So if you wanted to search the document body for the world ‘google’, you could use contents:google or just google.

Including and excluding terms

By default, all specified terms are searched on using a boolean ‘or’. This means that any of the terms can exist for a document to be returned. To force results to have a particular term, the plus symbol is used. This force results to not have a particular term, the minus symbol is used. If you’re searching in a different field, you can put the plus or minus either before the field name or the term name. In other words, +author:Quentin and author:+Quentin are identical.

Searching for phrases

It is possible to search for exact phrases with Zend_Search_Lucene, so if you wanted to search for the exact phrase “PHP Articles” you could. Because this is somewhat complicated to achieve, we will not be including this in our examples or implementation, however, there is alot of information on this on the Zend_Search_Lucene manual section on query types.

Sample queries

Here are some queries you can pass to Zend_Search_Lucene and their meanings.

Highlight: Plain

php

    // search the index for any article with the word php

 

php -author:quentin

    // find any article with the word php not written by me

 

author:quentin

    // find all the articles by me

 

php -ajax

    // find all articles with the word php that don't have the word ajax

 

title:mysql

    // find all articles with MySQL in the title

 

title:mysql -author:quentin

    // find all articles with MySQL in the title not by me

And so on. Hopefully you get the idea.

Scoring of results

All results returned from a search are assigned a score. This is a measure of how well the document matched the search term.

The results are ordered by their score, from highest to lowest.

I’m not exactly sure how the score is calculated or what it represents exactly, but it looks pretty on the search results.

You can customize the scoring algorithm (and hence the ordering of results). Please see the section later in the article on extending Zend_Search_Lucene.

Querying our index

On the previous page we looked at how to write queries to search the index. We learned how to include and exclude terms, and also how to search different fields in our indexed data.

Now we will look at actually pulling documents from our index using that term.

There are essentially two ways to query the index: passing the raw query in and letting Zend_Search_Lucene parse the query (ideal when you’re writing a search engine where you’re not sure what the user will enter), or by manually building up the query with API function calls.

In either case, you use the find() method on the index. The find() method returns a list of matches from your index.

Highlight: PHP

<?php

    require_once('Zend/Search/Lucene.php');

    $indexPath = '/var/www/phpriot.com/data/docindex';

    $index = newZend_Search_Lucene($indexPath);

    $hits = $index->query('php +author:Quentin');

?>

This sample code searches our index by also articles containing ‘php’, written by me. Note that when we opened our index, we did not pass the second parameter as we did when we created the index. This is because we are not writing the index, we are querying it.

We could also manually build this same query with function calls like so:

Highlight: PHP

<?php

    require_once('Zend/Search/Lucene.php');

    $indexPath = '/var/www/phpriot.com/data/docindex';

    $index = newZend_Search_Lucene($indexPath);

    $query = newZend_Search_Lucene_Search_Query_MultiTerm();

    $query->addTerm(newZend_Search_Lucene_Index_Term('php'), null);

    $query->addTerm(newZend_Search_Lucene_Index_Term('Quentin', 'author'), true);

    $hits = $index->query($query);

?>

The second parameter for addTerm used determines whether or not a field is required. True means it is required (like putting a plus sign before the term), false means it is prohibited (like putting a minus sign before the term), null means it isn’t required or prohibited.

The second parameter for Zend_Search_Lucene_Index_Term specifies the field to search index. By default this is contents.

On the whole, it is easier to simply allow Zend_Search_Lucene to parse the query.

Dealing with returned results

The results found from your query are returned in an array, meaning you can simply use count() on the array to determine the number of hits.

Each of the indexed fields are available as a class property.

So to loop over the results as we indexed them previously (with a title, author and teaser), we would do the following:

Highlight: PHP

<?php

    require_once('Zend/Search/Lucene.php');

    $query = 'php +author:Quentin';

    $indexPath = '/var/www/phpriot.com/data/docindex';

    $index = newZend_Search_Lucene($indexPath);

    $hits = $index->query($query);

    $numHits = count($hits);

?>

<p>

    Found <?=$hits?> result(s) for query <?=$query?>.

</p>

<?phpforeach($hitsas$hit){?>

    <h3><?=$hit->title?> (score: <?=$hit->score?>)</h3>

    <p>

        By <?=$hit->author?>

    </p>

    <p>

        <?=$hit->teaser?><br />

        <a href="<?=$hit->url?>">Read more...</a>

    </p>

<?php}?>

Here we also used an extra field called ‘score’. As mentioned previously, this is used as an indicator as to how well a document matched the query. Results with the highest score are listed first.

Creating a simple search engine

Using our code above, we can easily transform this into a simple site search engine. All we need to do is add a form and plug in the submitted query. Let’s assume this script is called search.php:

Highlight: PHP

<?php

    require_once('Zend/Search/Lucene.php');

___FCKpd___125

    $query = trim($query);

    $indexPath = '/var/www/phpriot.com/data/docindex';

    $index = newZend_Search_Lucene($indexPath);

    if(strlen($query) > 0){

        $hits = $index->query($query);

        $numHits = count($hits);

    }

?>

<form method="get" action="search.php">

    <input type="text" name="query" value="<?=htmlSpecialChars($query)?>" />

    <input type="submit" value="Search" />

</form>

<?phpif(strlen($query) > 0){?>

    <p>

        Found <?=$hits?> result(s) for query <?=$query?>.

    </p>

    <?phpforeach($hitsas$hit){?>

        <h3><?=$hit->title?> (score: <?=$hit->score?>)</h3>

        <p>

            By <?=$hit->author?>

        </p>

        <p>

            <?=$hit->teaser?><br />

            <a href="<?=$hit->url?>">Read more...</a>

        </p>

    <?php}?>

<?php}?>

Error handling

The one thing we haven’t dealt with yet are errors in the search. For instance, if we were to type in ‘title:’ with no query behind it then an error would occur. We handle this by catching the Zend_Search_Lucene_Exception exception.

Highlight: PHP

<?php

    require_once('Zend/Search/Lucene.php');

___FCKpd___155

___FCKpd___156

___FCKpd___157

___FCKpd___158

___FCKpd___159

___FCKpd___160

___FCKpd___161

___FCKpd___162

___FCKpd___163

___FCKpd___164

___FCKpd___165

___FCKpd___166

This means now that if an error occurs in the search, we simply assume zero hits were returned, thereby handling the error without indicating to the user that anything went wrong.

Of course, you could also choose to get the error message from the exception and output that instead ($ex->getMessage()).

Keeping the index up-to-date

The other thing we haven’t yet dealt with is if any of our documents are updated. There are several ways to handle this:

  • Update just the entry for the updated document straight away
  • Rebuild the entire index when a document is updated straight away

  • Rebuild the entire index at a certain time each day (or several times per day)

The ideal method really depends on the kind of data you have, how often it is updated, and how important it is for it the search index to be completely up-to-date.

To be honest, I haven’t figured out a way to update a single document in the index yet, but I may just be missing something simple. If you open the index (without the second parameter), and then index a document that is already in the index, then it will be duplicated (and hence returned twice in any matching search).

There isn’t too much documentation on this library yet, at least nothing about this specifically. If anybody knows how, please drop me an email or submit a comment with this article.

So at this point, the way to keep the index updated is to rebuild it from scratch when a document is updated.

Extending Zend_Search_Lucene

There are several aspects of Zend_Search_Lucene that can be extended, allowing a fully customized search solution. These include:

  • A custom tokenizer for determining keywords in a document
  • Custom scoring algorithms to determine how well a document matches a search query

  • A custom storage method, to your index is stored however and wherever you please

A custom tokenizer

There are many reasons why a custom tokenizer can be useful. Here are some ideas:

  • PDF tokenizer – a tokenizer that can parse a PDF file and find all the keywords
  • Image tokenizer – a tokenizer that can perform Optical Character Recognition (OCR), thereby allowing you to index words in an image (and you could store the image also, using the Binary field type)

  • HTML tokenizer – a tokenizer than can read HTML data, thereby knowing not to index HTML keywords but only the actual content. You could make further improvements on this also, such as finding all headings and treating them with higher preference to the rest of the content.

More information on this can be found at http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.analysis.

Custom scoring algorithms

Using a custom scoring algorithm, you can determine how favourably different fields in a document are looked upon. For example, you might want to treat matches in the ‘title’ field (if you have one) much more favourably than matches in the ‘contents’ field.

More information can be found on this at http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.scoring.

Custom storage method

You can change how indexes are written to disk by extending the Zend_Search_Lucene_Storage_Directory and Zend_Search_Lucene_Storage_File classes.

It may or may not be possible to change this store all indexed data in a database, but I haven’t actually tried this so I’m not sure.

More information on this can be found at http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.storage.

Conclusion

In this article we looked closely at Zend_Search_Lucene, part of the new Zend Framework. While it is still in development, it works sufficiently well to implement fulltext indexing and searching on any PHP 5 website.

We also looked briefly at some ways of extending the search capabilities.

Hopefully this has given you some ideas for your own website. Feel free to add comments and ideas to this article.


GET['query']) ?
Creating a fulltext search engine in PHP 5 with the Zend Framework's Zend_Search_Lucene --http://www.phpriot.com/d/articles/php/search/zend-search-lucene/index.html

By Quentin Zervaas, 27 April 2006

This article covers the implementation of a fulltext search engine using PHP 5 and the Zend Framework. We will be using the Zend_Search_Lucene component to create and search our fulltext index.

There are several other libraries we could use instead of this one, but Zend_Search_Lucene is completely native to PHP, whereas others such as Xapian or Tsearch2 rely on third party modules (for instance, the Tsearch2 module must be compiled into your PostgreSQL installation).

It must be noted at this point though that we require at least PHP 5 for Zend_Search_Lucene – PHP 4 will not work.

In this article we will be covering the following:

  • How to index a document or series of documents
  • The different types of fields that can be indexed

  • Searching the index

To demonstrate this functionality, we will cover the implementation of a search engine into phpRiot. We previously used the Tsearch2 module but had some problems that we were unable to overcome.

How fulltext indexing and querying works

Just in case you’ve never dealt with the subject before, I will briefly cover how the whole thing works.

The index essentially stores a special copy of all your data, but instead of storing the actual content of your documents, it stores a list of keywords found in each document.

So if I were to create an index of all the documents on phpRiot, this would be done by looping over every document, finding the keywords and then adding those words to the document.

Additionally, since we can’t just list a bunch of keywords to a user when they perform a search, we also store some other data in the index along with the keywords. In the case of phpRiot, we will store the document title, the author, the document URL and a brief summary of the article.

Alternatively we just store the document ID, and then proceed to look up those other details from the database using the ID, however it is quicker to do it this way (less queries), and since the title and author are only a few words for each document, there is little overhead in duplicating this data by storing it in our index.

Querying the data

Once we have created the index, we can then allow people to search it. When somebody types in a search term, we look for those terms in our index, which then points back to 0 or more different documents. Zend_Search_Lucene will also score the results for us, meaning the most relevant matches are returned first.

Keeping the index up-to-date

If a document is ever updated in the database, this likely means the index will be out-of-date. This means we have to make an extra consideration in our general content management. That is, we need to re-index that document.

There are several ways of approaching this. For instance, you could update it in real-time when the document is updated, or you could run an index update at a certain time every day. Each method has its own advantages. For something like phpRiot, updating the index in real time is probably the best way, as the data is not updated very frequently. On the other hand, if you were indexing user submitted comments, it could create a huge load to recreate the index for every comment, as there could be hundreds of comments per day.

Getting started

The first thing we must do is install the Zend Framework if you have not already done so. It is structured in a similar way to how the Pear file structure is organised. At this stage, the Zend Framework is only in a “preview” phase. At time of writing the current version was Preview 0.1.3.

You can download this from http://framework.zend.com/download. If you use Subversion, you can also checkout the trunk version which may have newer code in it.

I’m not exactly sure where the developers intended the framework to be stored, but like Pear is stored in /usr/local/lib/php, I chose to store it in /usr/local/lib/zend.

Highlight: Plain

$ cd /usr/local/src

$ wget http://framework.zend.com/download/tgz

$ tar -zxf ZendFramework-0.1.3.tar.gz

$ mv ZendFramework-0.1.3/library /usr/local/lib/zend

So now all that is required is that we add /usr/local/lib/zend to our include path. For instance, my include path directive in httpd.conf for phpRiot looks something like:

Highlight: Plain

php_value include_path .:/var/www/phpriot.com/include:/usr/local/lib/php

Which now becomes:

Highlight: Plain

php_value include_path .:/var/www/phpriot.com/include:/usr/local/lib/php:/usr/local/lib/zend

Creating our first index

The basic process for creating an index is:

  1. Open the index
  2. Add each document

  1. Commit (save) the index

The index is stored on the filesystem, so when you open an index, you specify the path where the it is to be stored. This is achieved by instantiating the Zend_Search_Lucene class.

Highlight: PHP

<?php

    require_once('Zend/Search/Lucene.php');

    $indexPath = '/var/www/phpriot.com/data/docindex';

    $index = newZend_Search_Lucene($indexPath, true);

?>

You’ll also notice the second parameter in the constructor call. This means the index is created from scratch. If you set this to false (or omit the argument), an existing index is opened. This is done when updating or querying the index. Since we’re creating the index at this stage, we include that parameter.

Adding a document to our index

Now that we have opened our index, we must add our documents to it. We create a document to be added to the index using the following:

Highlight: PHP

<?php

    $doc = newZend_Search_Lucene_Document();

?>

The next thing we must do is determine which fields we need to add to our index.

There are several different field types that can be stored for a single document. As we mentioned earlier, we want to store the keywords for our document, and also store some extra data that we can use when fetching the data. If we wanted to, we could duplicate the entire document content in our search index, however, this would be overkill and is unnecessary. It’s not as bad to duplicate smaller items such as the author name or document title as it’s only a small amount of data.

As you can tell from the previous paragraph, we’re starting to determine which fields we want to store in our index:

    • Document content – but we don’t want to duplicate the entire article, as we’re only going to show users a preview of the article in the search results
    • Title – we’re definitely going to include the title in our results
    • Article teaser – even though we’re not storing the content, we do want to store a short summary of the document to display in our search results
    • Author – we will list the author of each document in our search results, plus we want to give users the ability to search by author

  • Created – We’ll also store a timestamp of when the article was created.

This brings us to the different types of fields we can use for indexing in Zend_Search_Lucene:

  • UnIndexed – Data that isn’t available for searching, but is stored with our document (article teaser, article URL and timestamp of creation)
  • UnStored – Data that is available for search, but isn’t stored in the index in full (the document content)

  • Text – Data that is available for search and is stored in full (title and author)

There is also the Keyword and Binary fields available, but we won’t be using them in this example. Binary data isn’t searchable, but could be used to store an image with an indexed document.

To add a field to our indexed document, we use the addField() method. When we call this method, we must pass data that has been prepared according to the necessary field type. This is achieved used the Zend_Search_Lucene_Field class with the field type as the static method name.

In other words, to create the title field data, we use:

Highlight: PHP

<?php

    $data = Zend_Search_Lucene_Field::Text('title', $docTitle);

?>

Note that we specify the name of the field here. We can later reference this data by this field name, but we will cover this in the section about querying the index.

So to add all the data with the field types we just worked out, we would use this:

Highlight: PHP

<?php

    $doc = newZend_Search_Lucene_Document();

    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('url', $docUrl));

    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('created', $docCreated));

    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('teaser', $docTeaser));

    $doc->addField(Zend_Search_Lucene_Field::Text('title', $docTitle));

    $doc->addField(Zend_Search_Lucene_Field::Text('author', $docAuthor));

    $doc->addField(Zend_Search_Lucene_Field::UnStored('contents', $docBody));

?>

Important note: We added the main search content with a field name of contents. This is a special name to Zend_Search_Lucene, which means by default, all queries will search this field. If users instead want to search the title field for ‘foo’, their search query would look like title:foo. This will be covered further in the section about querying the index.

Finally, we add the document to the index using addDocument():

Highlight: PHP

<?php

    $index->addDocument($doc);

?>

We will cover this code in full in the next chapter (including looping over several documents and adding them all at once).

Committing / saving the index

Once all documents have been added, the index must be saved.

Highlight: PHP

<?php

    $index->commit();

?>

You can call commit after adding each document, but this operation is expensive and each call results in a new index segment (in lamen’s terms, this means it’s an extra file to search when querying the index). Note that better management of the segments in the index is a planned future improvement, but for the time being it’s best to be careful and call it once after you’ve done what you need to do with the index.

If you were to look at your filesystem now that you have committed the index, you would notice a directory at the location you specified when opening the index. This directory contains a number of files which are then used when querying the index.

Next up we will piece all the above code into a single script which indexes all the articles on phpRiot.

Indexing all the articles on phpRiot

Now that you’ve seen how to create a basic index, we will extend this script slightly so it can index all the documents in phpRiot.

Additionally, we will be extending the base Zend_Search_Lucene_Document class to simplify our code slightly. This will also demonstrate ways you can take advantage of the OOP style of programming that Zend_Search_Lucene uses.

Since we are demonstrating the indexing of phpRiot articles, we will also use the DatabaseObject class here to fetch article data, just as phpRiot does. You don’t really need to know how this class works to understand this example (as it is fairly self-explanatory in the function calls), but if you are interested, you can read our article Managing your data with DatabaseObject.

Extending Zend_Search_Lucene_Document

On the previous page, after we opened the index, we created a new instance of Zend_Search_Lucene_Document to hold the index data for a single document. Instead of calling this class directly, we’re going to extend this class to encapsulate all of the adding of data we also did.

In other words, we’re going to move the calls to addField into our class, rather than calling it for each field after we create our Zend_Search_Lucene_Document item.

Highlight: PHP

<?php

    classPhpRiotIndexedDocumentextendsZend_Search_Lucene_Document

    {

        /**

         * Constructor. Creates our indexable document and adds all

         * necessary fields to it using the passed in DatabaseObject

         * holding the article data.

         */

        publicfunction__construct(&$document)

        {

            $this->addField(Zend_Search_Lucene_Field::UnIndexed('url',     $document->generateUrl()));

            $this->addField(Zend_Search_Lucene_Field::UnIndexed('created', $document->getProperty('created')));

            $this->addField(Zend_Search_Lucene_Field::UnIndexed('teaser', $document->getProperty('teaser')));

            $this->addField(Zend_Search_Lucene_Field::Text('title', $document->getProperty('title')));

            $this->addField(Zend_Search_Lucene_Field::Text('author', $document->getProperty('author')));

            $this->addField(Zend_Search_Lucene_Field::UnStored('contents', $document->getProperty('body')));

        }

    }

?>

As you can see, we’ve made a simple wrapper class which fetches data from the passed in document (which uses DatabaseObject). The generateUrl() function is just a special internal method which determines a document’s full URL. We are storing this when we build the index so we don’t have to generate it each time a search is run (especially since this will never change, and if it does we can just rebuild the index).

Building the full index

Now that we have our class, we can create our index, loop over the documents, and then save our index:

Highlight: PHP

<?php

    require_once('Zend/Search/Lucene.php');

    require_once('DatabaseObject/PhpriotDocument.class.php');

    // where to save our index

    $indexPath = '/var/www/phpriot.com/data/docindex';

    // get a list of all database ids for documents

    $doc_ids = PhpriotDocument::GetDocIds($db);

    // create our index

    $index = newZend_Search_Lucene($indexPath, true);

    foreach($doc_idsas$doc_id){

        // load our databaseobject

        $document = newPhpriotDocument($db);

        $document->loadRecord($doc_id);

        // create our indexed document and add it to the index

        $index->addDocument(newPhpRiotIndexedDocument($document));

    }

    // write the index to disk

    $index->commit();

?>

The index has now been created! This can take some time if you have many documents or if each document has a large amount of content. We generate this by using PHP on the command line, which allows us to see its progress in real-time if we need to (we also output the title and a status message as each document is indexed).

How querying works in Zend_Search_Lucene

Now comes the most important part—actually finding stuff! There are quite a lot of options when it comes to querying your index, allowing you and your users to have a lot of control in the returned results.

When we created the indexed, we added six fields, but only three of those were actually searchable: document title, document content, and the author.

Each of these items are stored separately for each indexed documents, meaning you can search on them separately. The syntax used to search each section is somewhat like Google’s, in that you specify the field, followed by a colon, followed by the term (with no spaces).

So to search the author field for ‘Quentin’, the search query would be author:Quentin. (Note that the search is case-insensitive. To make it case-sensitive we would need to change some options when creating our index. For full details on this, please read the Zend_Search_Lucene manual section on Extensibility)

Likewise, to search in the title field for ‘php’, we would use title:php.

As we briefly mentioned earlier in this article, the default section that is searched in is the field called contents. So if you wanted to search the document body for the world ‘google’, you could use contents:google or just google.

Including and excluding terms

By default, all specified terms are searched on using a boolean ‘or’. This means that any of the terms can exist for a document to be returned. To force results to have a particular term, the plus symbol is used. This force results to not have a particular term, the minus symbol is used. If you’re searching in a different field, you can put the plus or minus either before the field name or the term name. In other words, +author:Quentin and author:+Quentin are identical.

Searching for phrases

It is possible to search for exact phrases with Zend_Search_Lucene, so if you wanted to search for the exact phrase “PHP Articles” you could. Because this is somewhat complicated to achieve, we will not be including this in our examples or implementation, however, there is alot of information on this on the Zend_Search_Lucene manual section on query types.

Sample queries

Here are some queries you can pass to Zend_Search_Lucene and their meanings.

Highlight: Plain

php

    // search the index for any article with the word php

 

php -author:quentin

    // find any article with the word php not written by me

 

author:quentin

    // find all the articles by me

 

php -ajax

    // find all articles with the word php that don't have the word ajax

 

title:mysql

    // find all articles with MySQL in the title

 

title:mysql -author:quentin

    // find all articles with MySQL in the title not by me

And so on. Hopefully you get the idea.

Scoring of results

All results returned from a search are assigned a score. This is a measure of how well the document matched the search term.

The results are ordered by their score, from highest to lowest.

I’m not exactly sure how the score is calculated or what it represents exactly, but it looks pretty on the search results.

You can customize the scoring algorithm (and hence the ordering of results). Please see the section later in the article on extending Zend_Search_Lucene.

Querying our index

On the previous page we looked at how to write queries to search the index. We learned how to include and exclude terms, and also how to search different fields in our indexed data.

Now we will look at actually pulling documents from our index using that term.

There are essentially two ways to query the index: passing the raw query in and letting Zend_Search_Lucene parse the query (ideal when you’re writing a search engine where you’re not sure what the user will enter), or by manually building up the query with API function calls.

In either case, you use the find() method on the index. The find() method returns a list of matches from your index.

Highlight: PHP

<?php

    require_once('Zend/Search/Lucene.php');

    $indexPath = '/var/www/phpriot.com/data/docindex';

    $index = newZend_Search_Lucene($indexPath);

    $hits = $index->query('php +author:Quentin');

?>

This sample code searches our index by also articles containing ‘php’, written by me. Note that when we opened our index, we did not pass the second parameter as we did when we created the index. This is because we are not writing the index, we are querying it.

We could also manually build this same query with function calls like so:

Highlight: PHP

<?php

    require_once('Zend/Search/Lucene.php');

    $indexPath = '/var/www/phpriot.com/data/docindex';

    $index = newZend_Search_Lucene($indexPath);

    $query = newZend_Search_Lucene_Search_Query_MultiTerm();

    $query->addTerm(newZend_Search_Lucene_Index_Term('php'), null);

    $query->addTerm(newZend_Search_Lucene_Index_Term('Quentin', 'author'), true);

    $hits = $index->query($query);

?>

The second parameter for addTerm used determines whether or not a field is required. True means it is required (like putting a plus sign before the term), false means it is prohibited (like putting a minus sign before the term), null means it isn’t required or prohibited.

The second parameter for Zend_Search_Lucene_Index_Term specifies the field to search index. By default this is contents.

On the whole, it is easier to simply allow Zend_Search_Lucene to parse the query.

Dealing with returned results

The results found from your query are returned in an array, meaning you can simply use count() on the array to determine the number of hits.

Each of the indexed fields are available as a class property.

So to loop over the results as we indexed them previously (with a title, author and teaser), we would do the following:

Highlight: PHP

<?php

    require_once('Zend/Search/Lucene.php');

    $query = 'php +author:Quentin';

    $indexPath = '/var/www/phpriot.com/data/docindex';

    $index = newZend_Search_Lucene($indexPath);

    $hits = $index->query($query);

    $numHits = count($hits);

?>

<p>

    Found <?=$hits?> result(s) for query <?=$query?>.

</p>

<?phpforeach($hitsas$hit){?>

    <h3><?=$hit->title?> (score: <?=$hit->score?>)</h3>

    <p>

        By <?=$hit->author?>

    </p>

    <p>

        <?=$hit->teaser?><br />

        <a href="<?=$hit->url?>">Read more...</a>

    </p>

<?php}?>

Here we also used an extra field called ‘score’. As mentioned previously, this is used as an indicator as to how well a document matched the query. Results with the highest score are listed first.

Creating a simple search engine

Using our code above, we can easily transform this into a simple site search engine. All we need to do is add a form and plug in the submitted query. Let’s assume this script is called search.php:

Highlight: PHP

<?php

    require_once('Zend/Search/Lucene.php');

___FCKpd___125

___FCKpd___126

___FCKpd___127

___FCKpd___128

___FCKpd___129

___FCKpd___130

___FCKpd___131

___FCKpd___132

___FCKpd___133

___FCKpd___134

___FCKpd___135

___FCKpd___136

___FCKpd___137

___FCKpd___138

___FCKpd___139

___FCKpd___140

___FCKpd___141

___FCKpd___142

___FCKpd___143

___FCKpd___144

___FCKpd___145

___FCKpd___146

___FCKpd___147

___FCKpd___148

___FCKpd___149

___FCKpd___150

___FCKpd___151

___FCKpd___152

Error handling

The one thing we haven’t dealt with yet are errors in the search. For instance, if we were to type in ‘title:’ with no query behind it then an error would occur. We handle this by catching the Zend_Search_Lucene_Exception exception.

Highlight: PHP

___FCKpd___153

___FCKpd___154

___FCKpd___155

___FCKpd___156

___FCKpd___157

___FCKpd___158

___FCKpd___159

___FCKpd___160

___FCKpd___161

___FCKpd___162

___FCKpd___163

___FCKpd___164

___FCKpd___165

___FCKpd___166

This means now that if an error occurs in the search, we simply assume zero hits were returned, thereby handling the error without indicating to the user that anything went wrong.

Of course, you could also choose to get the error message from the exception and output that instead ($ex->getMessage()).

Keeping the index up-to-date

The other thing we haven’t yet dealt with is if any of our documents are updated. There are several ways to handle this:

  • Update just the entry for the updated document straight away
  • Rebuild the entire index when a document is updated straight away

  • Rebuild the entire index at a certain time each day (or several times per day)

The ideal method really depends on the kind of data you have, how often it is updated, and how important it is for it the search index to be completely up-to-date.

To be honest, I haven’t figured out a way to update a single document in the index yet, but I may just be missing something simple. If you open the index (without the second parameter), and then index a document that is already in the index, then it will be duplicated (and hence returned twice in any matching search).

There isn’t too much documentation on this library yet, at least nothing about this specifically. If anybody knows how, please drop me an email or submit a comment with this article.

So at this point, the way to keep the index updated is to rebuild it from scratch when a document is updated.

Extending Zend_Search_Lucene

There are several aspects of Zend_Search_Lucene that can be extended, allowing a fully customized search solution. These include:

  • A custom tokenizer for determining keywords in a document
  • Custom scoring algorithms to determine how well a document matches a search query

  • A custom storage method, to your index is stored however and wherever you please

A custom tokenizer

There are many reasons why a custom tokenizer can be useful. Here are some ideas:

  • PDF tokenizer – a tokenizer that can parse a PDF file and find all the keywords
  • Image tokenizer – a tokenizer that can perform Optical Character Recognition (OCR), thereby allowing you to index words in an image (and you could store the image also, using the Binary field type)

  • HTML tokenizer – a tokenizer than can read HTML data, thereby knowing not to index HTML keywords but only the actual content. You could make further improvements on this also, such as finding all headings and treating them with higher preference to the rest of the content.

More information on this can be found at http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.analysis.

Custom scoring algorithms

Using a custom scoring algorithm, you can determine how favourably different fields in a document are looked upon. For example, you might want to treat matches in the ‘title’ field (if you have one) much more favourably than matches in the ‘contents’ field.

More information can be found on this at http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.scoring.

Custom storage method

You can change how indexes are written to disk by extending the Zend_Search_Lucene_Storage_Directory and Zend_Search_Lucene_Storage_File classes.

It may or may not be possible to change this store all indexed data in a database, but I haven’t actually tried this so I’m not sure.

More information on this can be found at http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.storage.

Conclusion

In this article we looked closely at Zend_Search_Lucene, part of the new Zend Framework. While it is still in development, it works sufficiently well to implement fulltext indexing and searching on any PHP 5 website.

We also looked briefly at some ways of extending the search capabilities.

Hopefully this has given you some ideas for your own website. Feel free to add comments and ideas to this article.


GET['query'] : '';
Creating a fulltext search engine in PHP 5 with the Zend Framework's Zend_Search_Lucene --http://www.phpriot.com/d/articles/php/search/zend-search-lucene/index.html

By Quentin Zervaas, 27 April 2006

This article covers the implementation of a fulltext search engine using PHP 5 and the Zend Framework. We will be using the Zend_Search_Lucene component to create and search our fulltext index.

There are several other libraries we could use instead of this one, but Zend_Search_Lucene is completely native to PHP, whereas others such as Xapian or Tsearch2 rely on third party modules (for instance, the Tsearch2 module must be compiled into your PostgreSQL installation).

It must be noted at this point though that we require at least PHP 5 for Zend_Search_Lucene – PHP 4 will not work.

In this article we will be covering the following:

  • How to index a document or series of documents
  • The different types of fields that can be indexed

  • Searching the index

To demonstrate this functionality, we will cover the implementation of a search engine into phpRiot. We previously used the Tsearch2 module but had some problems that we were unable to overcome.

How fulltext indexing and querying works

Just in case you’ve never dealt with the subject before, I will briefly cover how the whole thing works.

The index essentially stores a special copy of all your data, but instead of storing the actual content of your documents, it stores a list of keywords found in each document.

So if I were to create an index of all the documents on phpRiot, this would be done by looping over every document, finding the keywords and then adding those words to the document.

Additionally, since we can’t just list a bunch of keywords to a user when they perform a search, we also store some other data in the index along with the keywords. In the case of phpRiot, we will store the document title, the author, the document URL and a brief summary of the article.

Alternatively we just store the document ID, and then proceed to look up those other details from the database using the ID, however it is quicker to do it this way (less queries), and since the title and author are only a few words for each document, there is little overhead in duplicating this data by storing it in our index.

Querying the data

Once we have created the index, we can then allow people to search it. When somebody types in a search term, we look for those terms in our index, which then points back to 0 or more different documents. Zend_Search_Lucene will also score the results for us, meaning the most relevant matches are returned first.

Keeping the index up-to-date

If a document is ever updated in the database, this likely means the index will be out-of-date. This means we have to make an extra consideration in our general content management. That is, we need to re-index that document.

There are several ways of approaching this. For instance, you could update it in real-time when the document is updated, or you could run an index update at a certain time every day. Each method has its own advantages. For something like phpRiot, updating the index in real time is probably the best way, as the data is not updated very frequently. On the other hand, if you were indexing user submitted comments, it could create a huge load to recreate the index for every comment, as there could be hundreds of comments per day.

Getting started

The first thing we must do is install the Zend Framework if you have not already done so. It is structured in a similar way to how the Pear file structure is organised. At this stage, the Zend Framework is only in a “preview” phase. At time of writing the current version was Preview 0.1.3.

You can download this from http://framework.zend.com/download. If you use Subversion, you can also checkout the trunk version which may have newer code in it.

I’m not exactly sure where the developers intended the framework to be stored, but like Pear is stored in /usr/local/lib/php, I chose to store it in /usr/local/lib/zend.

Highlight: Plain

$ cd /usr/local/src

$ wget http://framework.zend.com/download/tgz

$ tar -zxf ZendFramework-0.1.3.tar.gz

$ mv ZendFramework-0.1.3/library /usr/local/lib/zend

So now all that is required is that we add /usr/local/lib/zend to our include path. For instance, my include path directive in httpd.conf for phpRiot looks something like:

Highlight: Plain

php_value include_path .:/var/www/phpriot.com/include:/usr/local/lib/php

Which now becomes:

Highlight: Plain

php_value include_path .:/var/www/phpriot.com/include:/usr/local/lib/php:/usr/local/lib/zend

Creating our first index

The basic process for creating an index is:

  1. Open the index
  2. Add each document

  1. Commit (save) the index

The index is stored on the filesystem, so when you open an index, you specify the path where the it is to be stored. This is achieved by instantiating the Zend_Search_Lucene class.

Highlight: PHP

<?php

    require_once('Zend/Search/Lucene.php');

    $indexPath = '/var/www/phpriot.com/data/docindex';

    $index = newZend_Search_Lucene($indexPath, true);

?>

You’ll also notice the second parameter in the constructor call. This means the index is created from scratch. If you set this to false (or omit the argument), an existing index is opened. This is done when updating or querying the index. Since we’re creating the index at this stage, we include that parameter.

Adding a document to our index

Now that we have opened our index, we must add our documents to it. We create a document to be added to the index using the following:

Highlight: PHP

<?php

    $doc = newZend_Search_Lucene_Document();

?>

The next thing we must do is determine which fields we need to add to our index.

There are several different field types that can be stored for a single document. As we mentioned earlier, we want to store the keywords for our document, and also store some extra data that we can use when fetching the data. If we wanted to, we could duplicate the entire document content in our search index, however, this would be overkill and is unnecessary. It’s not as bad to duplicate smaller items such as the author name or document title as it’s only a small amount of data.

As you can tell from the previous paragraph, we’re starting to determine which fields we want to store in our index:

    • Document content – but we don’t want to duplicate the entire article, as we’re only going to show users a preview of the article in the search results
    • Title – we’re definitely going to include the title in our results
    • Article teaser – even though we’re not storing the content, we do want to store a short summary of the document to display in our search results
    • Author – we will list the author of each document in our search results, plus we want to give users the ability to search by author

  • Created – We’ll also store a timestamp of when the article was created.

This brings us to the different types of fields we can use for indexing in Zend_Search_Lucene:

  • UnIndexed – Data that isn’t available for searching, but is stored with our document (article teaser, article URL and timestamp of creation)
  • UnStored – Data that is available for search, but isn’t stored in the index in full (the document content)

  • Text – Data that is available for search and is stored in full (title and author)

There is also the Keyword and Binary fields available, but we won’t be using them in this example. Binary data isn’t searchable, but could be used to store an image with an indexed document.

To add a field to our indexed document, we use the addField() method. When we call this method, we must pass data that has been prepared according to the necessary field type. This is achieved used the Zend_Search_Lucene_Field class with the field type as the static method name.

In other words, to create the title field data, we use:

Highlight: PHP

<?php

    $data = Zend_Search_Lucene_Field::Text('title', $docTitle);

?>

Note that we specify the name of the field here. We can later reference this data by this field name, but we will cover this in the section about querying the index.

So to add all the data with the field types we just worked out, we would use this:

Highlight: PHP

<?php

    $doc = newZend_Search_Lucene_Document();

    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('url', $docUrl));

    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('created', $docCreated));

    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('teaser', $docTeaser));

    $doc->addField(Zend_Search_Lucene_Field::Text('title', $docTitle));

    $doc->addField(Zend_Search_Lucene_Field::Text('author', $docAuthor));

    $doc->addField(Zend_Search_Lucene_Field::UnStored('contents', $docBody));

?>

Important note: We added the main search content with a field name of contents. This is a special name to Zend_Search_Lucene, which means by default, all queries will search this field. If users instead want to search the title field for ‘foo’, their search query would look like title:foo. This will be covered further in the section about querying the index.

Finally, we add the document to the index using addDocument():

Highlight: PHP

<?php

    $index->addDocument($doc);

?>

We will cover this code in full in the next chapter (including looping over several documents and adding them all at once).

Committing / saving the index

Once all documents have been added, the index must be saved.

Highlight: PHP

<?php

    $index->commit();

?>

You can call commit after adding each document, but this operation is expensive and each call results in a new index segment (in lamen’s terms, this means it’s an extra file to search when querying the index). Note that better management of the segments in the index is a planned future improvement, but for the time being it’s best to be careful and call it once after you’ve done what you need to do with the index.

If you were to look at your filesystem now that you have committed the index, you would notice a directory at the location you specified when opening the index. This directory contains a number of files which are then used when querying the index.

Next up we will piece all the above code into a single script which indexes all the articles on phpRiot.

Indexing all the articles on phpRiot

Now that you’ve seen how to create a basic index, we will extend this script slightly so it can index all the documents in phpRiot.

Additionally, we will be extending the base Zend_Search_Lucene_Document class to simplify our code slightly. This will also demonstrate ways you can take advantage of the OOP style of programming that Zend_Search_Lucene uses.

Since we are demonstrating the indexing of phpRiot articles, we will also use the DatabaseObject class here to fetch article data, just as phpRiot does. You don’t really need to know how this class works to understand this example (as it is fairly self-explanatory in the function calls), but if you are interested, you can read our article Managing your data with DatabaseObject.

Extending Zend_Search_Lucene_Document

On the previous page, after we opened the index, we created a new instance of Zend_Search_Lucene_Document to hold the index data for a single document. Instead of calling this class directly, we’re going to extend this class to encapsulate all of the adding of data we also did.

In other words, we’re going to move the calls to addField into our class, rather than calling it for each field after we create our Zend_Search_Lucene_Document item.

Highlight: PHP

<?php

    classPhpRiotIndexedDocumentextendsZend_Search_Lucene_Document

    {

        /**

         * Constructor. Creates our indexable document and adds all

         * necessary fields to it using the passed in DatabaseObject

         * holding the article data.

         */

        publicfunction__construct(&$document)

        {

            $this->addField(Zend_Search_Lucene_Field::UnIndexed('url',     $document->generateUrl()));

            $this->addField(Zend_Search_Lucene_Field::UnIndexed('created', $document->getProperty('created')));

            $this->addField(Zend_Search_Lucene_Field::UnIndexed('teaser', $document->getProperty('teaser')));

            $this->addField(Zend_Search_Lucene_Field::Text('title', $document->getProperty('title')));

            $this->addField(Zend_Search_Lucene_Field::Text('author', $document->getProperty('author')));

            $this->addField(Zend_Search_Lucene_Field::UnStored('contents', $document->getProperty('body')));

        }

    }

?>

As you can see, we’ve made a simple wrapper class which fetches data from the passed in document (which uses DatabaseObject). The generateUrl() function is just a special internal method which determines a document’s full URL. We are storing this when we build the index so we don’t have to generate it each time a search is run (especially since this will never change, and if it does we can just rebuild the index).

Building the full index

Now that we have our class, we can create our index, loop over the documents, and then save our index:

Highlight: PHP

<?php

    require_once('Zend/Search/Lucene.php');

    require_once('DatabaseObject/PhpriotDocument.class.php');

    // where to save our index

    $indexPath = '/var/www/phpriot.com/data/docindex';

    // get a list of all database ids for documents

    $doc_ids = PhpriotDocument::GetDocIds($db);

    // create our index

    $index = newZend_Search_Lucene($indexPath, true);

    foreach($doc_idsas$doc_id){

        // load our databaseobject

        $document = newPhpriotDocument($db);

        $document->loadRecord($doc_id);

        // create our indexed document and add it to the index

        $index->addDocument(newPhpRiotIndexedDocument($document));

    }

    // write the index to disk

    $index->commit();

?>

The index has now been created! This can take some time if you have many documents or if each document has a large amount of content. We generate this by using PHP on the command line, which allows us to see its progress in real-time if we need to (we also output the title and a status message as each document is indexed).

How querying works in Zend_Search_Lucene

Now comes the most important part—actually finding stuff! There are quite a lot of options when it comes to querying your index, allowing you and your users to have a lot of control in the returned results.

When we created the indexed, we added six fields, but only three of those were actually searchable: document title, document content, and the author.

Each of these items are stored separately for each indexed documents, meaning you can search on them separately. The syntax used to search each section is somewhat like Google’s, in that you specify the field, followed by a colon, followed by the term (with no spaces).

So to search the author field for ‘Quentin’, the search query would be author:Quentin. (Note that the search is case-insensitive. To make it case-sensitive we would need to change some options when creating our index. For full details on this, please read the Zend_Search_Lucene manual section on Extensibility)

Likewise, to search in the title field for ‘php’, we would use title:php.

As we briefly mentioned earlier in this article, the default section that is searched in is the field called contents. So if you wanted to search the document body for the world ‘google’, you could use contents:google or just google.

Including and excluding terms

By default, all specified terms are searched on using a boolean ‘or’. This means that any of the terms can exist for a document to be returned. To force results to have a particular term, the plus symbol is used. This force results to not have a particular term, the minus symbol is used. If you’re searching in a different field, you can put the plus or minus either before the field name or the term name. In other words, +author:Quentin and author:+Quentin are identical.

Searching for phrases

It is possible to search for exact phrases with Zend_Search_Lucene, so if you wanted to search for the exact phrase “PHP Articles” you could. Because this is somewhat complicated to achieve, we will not be including this in our examples or implementation, however, there is alot of information on this on the Zend_Search_Lucene manual section on query types.

Sample queries

Here are some queries you can pass to Zend_Search_Lucene and their meanings.

Highlight: Plain

php

    // search the index for any article with the word php

 

php -author:quentin

    // find any article with the word php not written by me

 

author:quentin

    // find all the articles by me

 

php -ajax

    // find all articles with the word php that don't have the word ajax

 

title:mysql

    // find all articles with MySQL in the title

 

title:mysql -author:quentin

    // find all articles with MySQL in the title not by me

And so on. Hopefully you get the idea.

Scoring of results

All results returned from a search are assigned a score. This is a measure of how well the document matched the search term.

The results are ordered by their score, from highest to lowest.

I’m not exactly sure how the score is calculated or what it represents exactly, but it looks pretty on the search results.

You can customize the scoring algorithm (and hence the ordering of results). Please see the section later in the article on extending Zend_Search_Lucene.

Querying our index

On the previous page we looked at how to write queries to search the index. We learned how to include and exclude terms, and also how to search different fields in our indexed data.

Now we will look at actually pulling documents from our index using that term.

There are essentially two ways to query the index: passing the raw query in and letting Zend_Search_Lucene parse the query (ideal when you’re writing a search engine where you’re not sure what the user will enter), or by manually building up the query with API function calls.

In either case, you use the find() method on the index. The find() method returns a list of matches from your index.

Highlight: PHP

<?php

    require_once('Zend/Search/Lucene.php');

    $indexPath = '/var/www/phpriot.com/data/docindex';

    $index = newZend_Search_Lucene($indexPath);

    $hits = $index->query('php +author:Quentin');

?>

This sample code searches our index by also articles containing ‘php’, written by me. Note that when we opened our index, we did not pass the second parameter as we did when we created the index. This is because we are not writing the index, we are querying it.

We could also manually build this same query with function calls like so:

Highlight: PHP

<?php

    require_once('Zend/Search/Lucene.php');

    $indexPath = '/var/www/phpriot.com/data/docindex';

    $index = newZend_Search_Lucene($indexPath);

    $query = newZend_Search_Lucene_Search_Query_MultiTerm();

    $query->addTerm(newZend_Search_Lucene_Index_Term('php'), null);

    $query->addTerm(newZend_Search_Lucene_Index_Term('Quentin', 'author'), true);

    $hits = $index->query($query);

?>

The second parameter for addTerm used determines whether or not a field is required. True means it is required (like putting a plus sign before the term), false means it is prohibited (like putting a minus sign before the term), null means it isn’t required or prohibited.

The second parameter for Zend_Search_Lucene_Index_Term specifies the field to search index. By default this is contents.

On the whole, it is easier to simply allow Zend_Search_Lucene to parse the query.

Dealing with returned results

The results found from your query are returned in an array, meaning you can simply use count() on the array to determine the number of hits.

Each of the indexed fields are available as a class property.

So to loop over the results as we indexed them previously (with a title, author and teaser), we would do the following:

Highlight: PHP

<?php

    require_once('Zend/Search/Lucene.php');

    $query = 'php +author:Quentin';

    $indexPath = '/var/www/phpriot.com/data/docindex';

    $index = newZend_Search_Lucene($indexPath);

    $hits = $index->query($query);

    $numHits = count($hits);

?>

<p>

    Found <?=$hits?> result(s) for query <?=$query?>.

</p>

<?phpforeach($hitsas$hit){?>

    <h3><?=$hit->title?> (score: <?=$hit->score?>)</h3>

    <p>

        By <?=$hit->author?>

    </p>

    <p>

        <?=$hit->teaser?><br />

        <a href="<?=$hit->url?>">Read more...</a>

    </p>

<?php}?>

Here we also used an extra field called ‘score’. As mentioned previously, this is used as an indicator as to how well a document matched the query. Results with the highest score are listed first.

Creating a simple search engine

Using our code above, we can easily transform this into a simple site search engine. All we need to do is add a form and plug in the submitted query. Let’s assume this script is called search.php:

Highlight: PHP

<?php

    require_once('Zend/Search/Lucene.php');

___FCKpd___125

    $query = trim($query);

    $indexPath = '/var/www/phpriot.com/data/docindex';

    $index = newZend_Search_Lucene($indexPath);

    if(strlen($query) > 0){

        $hits = $index->query($query);

        $numHits = count($hits);

    }

?>

<form method="get" action="search.php">

    <input type="text" name="query" value="<?=htmlSpecialChars($query)?>" />

    <input type="submit" value="Search" />

</form>

<?phpif(strlen($query) > 0){?>

    <p>

        Found <?=$hits?> result(s) for query <?=$query?>.

    </p>

    <?phpforeach($hitsas$hit){?>

        <h3><?=$hit->title?> (score: <?=$hit->score?>)</h3>

        <p>

            By <?=$hit->author?>

        </p>

        <p>

            <?=$hit->teaser?><br />

            <a href="<?=$hit->url?>">Read more...</a>

        </p>

    <?php}?>

<?php}?>

Error handling

The one thing we haven’t dealt with yet are errors in the search. For instance, if we were to type in ‘title:’ with no query behind it then an error would occur. We handle this by catching the Zend_Search_Lucene_Exception exception.

Highlight: PHP

<?php

    require_once('Zend/Search/Lucene.php');

___FCKpd___155

    $query = trim($query);

    $indexPath = '/var/www/phpriot.com/data/docindex';

    $index = newZend_Search_Lucene($indexPath);

    try{

        $hits = $index->query($query);

    }

    catch(Zend_Search_Lucene_Exception$ex){

        $hits = array();

    }

    $numHits = count($hits);

?>

This means now that if an error occurs in the search, we simply assume zero hits were returned, thereby handling the error without indicating to the user that anything went wrong.

Of course, you could also choose to get the error message from the exception and output that instead ($ex->getMessage()).

Keeping the index up-to-date

The other thing we haven’t yet dealt with is if any of our documents are updated. There are several ways to handle this:

  • Update just the entry for the updated document straight away
  • Rebuild the entire index when a document is updated straight away

  • Rebuild the entire index at a certain time each day (or several times per day)

The ideal method really depends on the kind of data you have, how often it is updated, and how important it is for it the search index to be completely up-to-date.

To be honest, I haven’t figured out a way to update a single document in the index yet, but I may just be missing something simple. If you open the index (without the second parameter), and then index a document that is already in the index, then it will be duplicated (and hence returned twice in any matching search).

There isn’t too much documentation on this library yet, at least nothing about this specifically. If anybody knows how, please drop me an email or submit a comment with this article.

So at this point, the way to keep the index updated is to rebuild it from scratch when a document is updated.

Extending Zend_Search_Lucene

There are several aspects of Zend_Search_Lucene that can be extended, allowing a fully customized search solution. These include:

  • A custom tokenizer for determining keywords in a document
  • Custom scoring algorithms to determine how well a document matches a search query

  • A custom storage method, to your index is stored however and wherever you please

A custom tokenizer

There are many reasons why a custom tokenizer can be useful. Here are some ideas:

  • PDF tokenizer – a tokenizer that can parse a PDF file and find all the keywords
  • Image tokenizer – a tokenizer that can perform Optical Character Recognition (OCR), thereby allowing you to index words in an image (and you could store the image also, using the Binary field type)

  • HTML tokenizer – a tokenizer than can read HTML data, thereby knowing not to index HTML keywords but only the actual content. You could make further improvements on this also, such as finding all headings and treating them with higher preference to the rest of the content.

More information on this can be found at http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.analysis.

Custom scoring algorithms

Using a custom scoring algorithm, you can determine how favourably different fields in a document are looked upon. For example, you might want to treat matches in the ‘title’ field (if you have one) much more favourably than matches in the ‘contents’ field.

More information can be found on this at http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.scoring.

Custom storage method

You can change how indexes are written to disk by extending the Zend_Search_Lucene_Storage_Directory and Zend_Search_Lucene_Storage_File classes.

It may or may not be possible to change this store all indexed data in a database, but I haven’t actually tried this so I’m not sure.

More information on this can be found at http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.storage.

Conclusion

In this article we looked closely at Zend_Search_Lucene, part of the new Zend Framework. While it is still in development, it works sufficiently well to implement fulltext indexing and searching on any PHP 5 website.

We also looked briefly at some ways of extending the search capabilities.

Hopefully this has given you some ideas for your own website. Feel free to add comments and ideas to this article.


GET['query']) ?
Creating a fulltext search engine in PHP 5 with the Zend Framework's Zend_Search_Lucene --http://www.phpriot.com/d/articles/php/search/zend-search-lucene/index.html

By Quentin Zervaas, 27 April 2006

This article covers the implementation of a fulltext search engine using PHP 5 and the Zend Framework. We will be using the Zend_Search_Lucene component to create and search our fulltext index.

There are several other libraries we could use instead of this one, but Zend_Search_Lucene is completely native to PHP, whereas others such as Xapian or Tsearch2 rely on third party modules (for instance, the Tsearch2 module must be compiled into your PostgreSQL installation).

It must be noted at this point though that we require at least PHP 5 for Zend_Search_Lucene – PHP 4 will not work.

In this article we will be covering the following:

  • How to index a document or series of documents
  • The different types of fields that can be indexed

  • Searching the index

To demonstrate this functionality, we will cover the implementation of a search engine into phpRiot. We previously used the Tsearch2 module but had some problems that we were unable to overcome.

How fulltext indexing and querying works

Just in case you’ve never dealt with the subject before, I will briefly cover how the whole thing works.

The index essentially stores a special copy of all your data, but instead of storing the actual content of your documents, it stores a list of keywords found in each document.

So if I were to create an index of all the documents on phpRiot, this would be done by looping over every document, finding the keywords and then adding those words to the document.

Additionally, since we can’t just list a bunch of keywords to a user when they perform a search, we also store some other data in the index along with the keywords. In the case of phpRiot, we will store the document title, the author, the document URL and a brief summary of the article.

Alternatively we just store the document ID, and then proceed to look up those other details from the database using the ID, however it is quicker to do it this way (less queries), and since the title and author are only a few words for each document, there is little overhead in duplicating this data by storing it in our index.

Querying the data

Once we have created the index, we can then allow people to search it. When somebody types in a search term, we look for those terms in our index, which then points back to 0 or more different documents. Zend_Search_Lucene will also score the results for us, meaning the most relevant matches are returned first.

Keeping the index up-to-date

If a document is ever updated in the database, this likely means the index will be out-of-date. This means we have to make an extra consideration in our general content management. That is, we need to re-index that document.

There are several ways of approaching this. For instance, you could update it in real-time when the document is updated, or you could run an index update at a certain time every day. Each method has its own advantages. For something like phpRiot, updating the index in real time is probably the best way, as the data is not updated very frequently. On the other hand, if you were indexing user submitted comments, it could create a huge load to recreate the index for every comment, as there could be hundreds of comments per day.

Getting started

The first thing we must do is install the Zend Framework if you have not already done so. It is structured in a similar way to how the Pear file structure is organised. At this stage, the Zend Framework is only in a “preview” phase. At time of writing the current version was Preview 0.1.3.

You can download this from http://framework.zend.com/download. If you use Subversion, you can also checkout the trunk version which may have newer code in it.

I’m not exactly sure where the developers intended the framework to be stored, but like Pear is stored in /usr/local/lib/php, I chose to store it in /usr/local/lib/zend.

Highlight: Plain

$ cd /usr/local/src

$ wget http://framework.zend.com/download/tgz

$ tar -zxf ZendFramework-0.1.3.tar.gz

$ mv ZendFramework-0.1.3/library /usr/local/lib/zend

So now all that is required is that we add /usr/local/lib/zend to our include path. For instance, my include path directive in httpd.conf for phpRiot looks something like:

Highlight: Plain

php_value include_path .:/var/www/phpriot.com/include:/usr/local/lib/php

Which now becomes:

Highlight: Plain

php_value include_path .:/var/www/phpriot.com/include:/usr/local/lib/php:/usr/local/lib/zend

Creating our first index

The basic process for creating an index is:

  1. Open the index
  2. Add each document

  1. Commit (save) the index

The index is stored on the filesystem, so when you open an index, you specify the path where the it is to be stored. This is achieved by instantiating the Zend_Search_Lucene class.

Highlight: PHP

<?php

    require_once('Zend/Search/Lucene.php');

    $indexPath = '/var/www/phpriot.com/data/docindex';

    $index = newZend_Search_Lucene($indexPath, true);

?>

You’ll also notice the second parameter in the constructor call. This means the index is created from scratch. If you set this to false (or omit the argument), an existing index is opened. This is done when updating or querying the index. Since we’re creating the index at this stage, we include that parameter.

Adding a document to our index

Now that we have opened our index, we must add our documents to it. We create a document to be added to the index using the following:

Highlight: PHP

<?php

    $doc = newZend_Search_Lucene_Document();

?>

The next thing we must do is determine which fields we need to add to our index.

There are several different field types that can be stored for a single document. As we mentioned earlier, we want to store the keywords for our document, and also store some extra data that we can use when fetching the data. If we wanted to, we could duplicate the entire document content in our search index, however, this would be overkill and is unnecessary. It’s not as bad to duplicate smaller items such as the author name or document title as it’s only a small amount of data.

As you can tell from the previous paragraph, we’re starting to determine which fields we want to store in our index:

    • Document content – but we don’t want to duplicate the entire article, as we’re only going to show users a preview of the article in the search results
    • Title – we’re definitely going to include the title in our results
    • Article teaser – even though we’re not storing the content, we do want to store a short summary of the document to display in our search results
    • Author – we will list the author of each document in our search results, plus we want to give users the ability to search by author

  • Created – We’ll also store a timestamp of when the article was created.

This brings us to the different types of fields we can use for indexing in Zend_Search_Lucene:

  • UnIndexed – Data that isn’t available for searching, but is stored with our document (article teaser, article URL and timestamp of creation)
  • UnStored – Data that is available for search, but isn’t stored in the index in full (the document content)

  • Text – Data that is available for search and is stored in full (title and author)

There is also the Keyword and Binary fields available, but we won’t be using them in this example. Binary data isn’t searchable, but could be used to store an image with an indexed document.

To add a field to our indexed document, we use the addField() method. When we call this method, we must pass data that has been prepared according to the necessary field type. This is achieved used the Zend_Search_Lucene_Field class with the field type as the static method name.

In other words, to create the title field data, we use:

Highlight: PHP

<?php

    $data = Zend_Search_Lucene_Field::Text('title', $docTitle);

?>

Note that we specify the name of the field here. We can later reference this data by this field name, but we will cover this in the section about querying the index.

So to add all the data with the field types we just worked out, we would use this:

Highlight: PHP

<?php

    $doc = newZend_Search_Lucene_Document();

    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('url', $docUrl));

    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('created', $docCreated));

    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('teaser', $docTeaser));

    $doc->addField(Zend_Search_Lucene_Field::Text('title', $docTitle));

    $doc->addField(Zend_Search_Lucene_Field::Text('author', $docAuthor));

    $doc->addField(Zend_Search_Lucene_Field::UnStored('contents', $docBody));

?>

Important note: We added the main search content with a field name of contents. This is a special name to Zend_Search_Lucene, which means by default, all queries will search this field. If users instead want to search the title field for ‘foo’, their search query would look like title:foo. This will be covered further in the section about querying the index.

Finally, we add the document to the index using addDocument():

Highlight: PHP

<?php

    $index->addDocument($doc);

?>

We will cover this code in full in the next chapter (including looping over several documents and adding them all at once).

Committing / saving the index

Once all documents have been added, the index must be saved.

Highlight: PHP

<?php

    $index->commit();

?>

You can call commit after adding each document, but this operation is expensive and each call results in a new index segment (in lamen’s terms, this means it’s an extra file to search when querying the index). Note that better management of the segments in the index is a planned future improvement, but for the time being it’s best to be careful and call it once after you’ve done what you need to do with the index.

If you were to look at your filesystem now that you have committed the index, you would notice a directory at the location you specified when opening the index. This directory contains a number of files which are then used when querying the index.

Next up we will piece all the above code into a single script which indexes all the articles on phpRiot.

Indexing all the articles on phpRiot

Now that you’ve seen how to create a basic index, we will extend this script slightly so it can index all the documents in phpRiot.

Additionally, we will be extending the base Zend_Search_Lucene_Document class to simplify our code slightly. This will also demonstrate ways you can take advantage of the OOP style of programming that Zend_Search_Lucene uses.

Since we are demonstrating the indexing of phpRiot articles, we will also use the DatabaseObject class here to fetch article data, just as phpRiot does. You don’t really need to know how this class works to understand this example (as it is fairly self-explanatory in the function calls), but if you are interested, you can read our article Managing your data with DatabaseObject.

Extending Zend_Search_Lucene_Document

On the previous page, after we opened the index, we created a new instance of Zend_Search_Lucene_Document to hold the index data for a single document. Instead of calling this class directly, we’re going to extend this class to encapsulate all of the adding of data we also did.

In other words, we’re going to move the calls to addField into our class, rather than calling it for each field after we create our Zend_Search_Lucene_Document item.

Highlight: PHP

<?php

    classPhpRiotIndexedDocumentextendsZend_Search_Lucene_Document

    {

        /**

         * Constructor. Creates our indexable document and adds all

         * necessary fields to it using the passed in DatabaseObject

         * holding the article data.

         */

        publicfunction__construct(&$document)

        {

            $this->addField(Zend_Search_Lucene_Field::UnIndexed('url',     $document->generateUrl()));

            $this->addField(Zend_Search_Lucene_Field::UnIndexed('created', $document->getProperty('created')));

            $this->addField(Zend_Search_Lucene_Field::UnIndexed('teaser', $document->getProperty('teaser')));

            $this->addField(Zend_Search_Lucene_Field::Text('title', $document->getProperty('title')));

            $this->addField(Zend_Search_Lucene_Field::Text('author', $document->getProperty('author')));

            $this->addField(Zend_Search_Lucene_Field::UnStored('contents', $document->getProperty('body')));

        }

    }

?>

As you can see, we’ve made a simple wrapper class which fetches data from the passed in document (which uses DatabaseObject). The generateUrl() function is just a special internal method which determines a document’s full URL. We are storing this when we build the index so we don’t have to generate it each time a search is run (especially since this will never change, and if it does we can just rebuild the index).

Building the full index

Now that we have our class, we can create our index, loop over the documents, and then save our index:

Highlight: PHP

<?php

    require_once('Zend/Search/Lucene.php');

    require_once('DatabaseObject/PhpriotDocument.class.php');

    // where to save our index

    $indexPath = '/var/www/phpriot.com/data/docindex';

    // get a list of all database ids for documents

    $doc_ids = PhpriotDocument::GetDocIds($db);

    // create our index

    $index = newZend_Search_Lucene($indexPath, true);

    foreach($doc_idsas$doc_id){

        // load our databaseobject

        $document = newPhpriotDocument($db);

        $document->loadRecord($doc_id);

        // create our indexed document and add it to the index

        $index->addDocument(newPhpRiotIndexedDocument($document));

    }

    // write the index to disk

    $index->commit();

?>

The index has now been created! This can take some time if you have many documents or if each document has a large amount of content. We generate this by using PHP on the command line, which allows us to see its progress in real-time if we need to (we also output the title and a status message as each document is indexed).

How querying works in Zend_Search_Lucene

Now comes the most important part—actually finding stuff! There are quite a lot of options when it comes to querying your index, allowing you and your users to have a lot of control in the returned results.

When we created the indexed, we added six fields, but only three of those were actually searchable: document title, document content, and the author.

Each of these items are stored separately for each indexed documents, meaning you can search on them separately. The syntax used to search each section is somewhat like Google’s, in that you specify the field, followed by a colon, followed by the term (with no spaces).

So to search the author field for ‘Quentin’, the search query would be author:Quentin. (Note that the search is case-insensitive. To make it case-sensitive we would need to change some options when creating our index. For full details on this, please read the Zend_Search_Lucene manual section on Extensibility)

Likewise, to search in the title field for ‘php’, we would use title:php.

As we briefly mentioned earlier in this article, the default section that is searched in is the field called contents. So if you wanted to search the document body for the world ‘google’, you could use contents:google or just google.

Including and excluding terms

By default, all specified terms are searched on using a boolean ‘or’. This means that any of the terms can exist for a document to be returned. To force results to have a particular term, the plus symbol is used. This force results to not have a particular term, the minus symbol is used. If you’re searching in a different field, you can put the plus or minus either before the field name or the term name. In other words, +author:Quentin and author:+Quentin are identical.

Searching for phrases

It is possible to search for exact phrases with Zend_Search_Lucene, so if you wanted to search for the exact phrase “PHP Articles” you could. Because this is somewhat complicated to achieve, we will not be including this in our examples or implementation, however, there is alot of information on this on the Zend_Search_Lucene manual section on query types.

Sample queries

Here are some queries you can pass to Zend_Search_Lucene and their meanings.

Highlight: Plain

php

    // search the index for any article with the word php

 

php -author:quentin

    // find any article with the word php not written by me

 

author:quentin

    // find all the articles by me

 

php -ajax

    // find all articles with the word php that don't have the word ajax

 

title:mysql

    // find all articles with MySQL in the title

 

title:mysql -author:quentin

    // find all articles with MySQL in the title not by me

And so on. Hopefully you get the idea.

Scoring of results

All results returned from a search are assigned a score. This is a measure of how well the document matched the search term.

The results are ordered by their score, from highest to lowest.

I’m not exactly sure how the score is calculated or what it represents exactly, but it looks pretty on the search results.

You can customize the scoring algorithm (and hence the ordering of results). Please see the section later in the article on extending Zend_Search_Lucene.

Querying our index

On the previous page we looked at how to write queries to search the index. We learned how to include and exclude terms, and also how to search different fields in our indexed data.

Now we will look at actually pulling documents from our index using that term.

There are essentially two ways to query the index: passing the raw query in and letting Zend_Search_Lucene parse the query (ideal when you’re writing a search engine where you’re not sure what the user will enter), or by manually building up the query with API function calls.

In either case, you use the find() method on the index. The find() method returns a list of matches from your index.

Highlight: PHP

<?php

    require_once('Zend/Search/Lucene.php');

    $indexPath = '/var/www/phpriot.com/data/docindex';

    $index = newZend_Search_Lucene($indexPath);

    $hits = $index->query('php +author:Quentin');

?>

This sample code searches our index by also articles containing ‘php’, written by me. Note that when we opened our index, we did not pass the second parameter as we did when we created the index. This is because we are not writing the index, we are querying it.

We could also manually build this same query with function calls like so:

Highlight: PHP

<?php

    require_once('Zend/Search/Lucene.php');

    $indexPath = '/var/www/phpriot.com/data/docindex';

    $index = newZend_Search_Lucene($indexPath);

    $query = newZend_Search_Lucene_Search_Query_MultiTerm();

    $query->addTerm(newZend_Search_Lucene_Index_Term('php'), null);

    $query->addTerm(newZend_Search_Lucene_Index_Term('Quentin', 'author'), true);

    $hits = $index->query($query);

?>

The second parameter for addTerm used determines whether or not a field is required. True means it is required (like putting a plus sign before the term), false means it is prohibited (like putting a minus sign before the term), null means it isn’t required or prohibited.

The second parameter for Zend_Search_Lucene_Index_Term specifies the field to search index. By default this is contents.

On the whole, it is easier to simply allow Zend_Search_Lucene to parse the query.

Dealing with returned results

The results found from your query are returned in an array, meaning you can simply use count() on the array to determine the number of hits.

Each of the indexed fields are available as a class property.

So to loop over the results as we indexed them previously (with a title, author and teaser), we would do the following:

Highlight: PHP

<?php

    require_once('Zend/Search/Lucene.php');

    $query = 'php +author:Quentin';

    $indexPath = '/var/www/phpriot.com/data/docindex';

    $index = newZend_Search_Lucene($indexPath);

    $hits = $index->query($query);

    $numHits = count($hits);

?>

<p>

    Found <?=$hits?> result(s) for query <?=$query?>.

</p>

<?phpforeach($hitsas$hit){?>

    <h3><?=$hit->title?> (score: <?=$hit->score?>)</h3>

    <p>

        By <?=$hit->author?>

    </p>

    <p>

        <?=$hit->teaser?><br />

        <a href="<?=$hit->url?>">Read more...</a>

    </p>

<?php}?>

Here we also used an extra field called ‘score’. As mentioned previously, this is used as an indicator as to how well a document matched the query. Results with the highest score are listed first.

Creating a simple search engine

Using our code above, we can easily transform this into a simple site search engine. All we need to do is add a form and plug in the submitted query. Let’s assume this script is called search.php:

Highlight: PHP

<?php

    require_once('Zend/Search/Lucene.php');

___FCKpd___125

___FCKpd___126

___FCKpd___127

___FCKpd___128

___FCKpd___129

___FCKpd___130

___FCKpd___131

___FCKpd___132

___FCKpd___133

___FCKpd___134

___FCKpd___135

___FCKpd___136

___FCKpd___137

___FCKpd___138

___FCKpd___139

___FCKpd___140

___FCKpd___141

___FCKpd___142

___FCKpd___143

___FCKpd___144

___FCKpd___145

___FCKpd___146

___FCKpd___147

___FCKpd___148

___FCKpd___149

___FCKpd___150

___FCKpd___151

___FCKpd___152

Error handling

The one thing we haven’t dealt with yet are errors in the search. For instance, if we were to type in ‘title:’ with no query behind it then an error would occur. We handle this by catching the Zend_Search_Lucene_Exception exception.

Highlight: PHP

___FCKpd___153

___FCKpd___154

___FCKpd___155

___FCKpd___156

___FCKpd___157

___FCKpd___158

___FCKpd___159

___FCKpd___160

___FCKpd___161

___FCKpd___162

___FCKpd___163

___FCKpd___164

___FCKpd___165

___FCKpd___166

This means now that if an error occurs in the search, we simply assume zero hits were returned, thereby handling the error without indicating to the user that anything went wrong.

Of course, you could also choose to get the error message from the exception and output that instead ($ex->getMessage()).

Keeping the index up-to-date

The other thing we haven’t yet dealt with is if any of our documents are updated. There are several ways to handle this:

  • Update just the entry for the updated document straight away
  • Rebuild the entire index when a document is updated straight away

  • Rebuild the entire index at a certain time each day (or several times per day)

The ideal method really depends on the kind of data you have, how often it is updated, and how important it is for it the search index to be completely up-to-date.

To be honest, I haven’t figured out a way to update a single document in the index yet, but I may just be missing something simple. If you open the index (without the second parameter), and then index a document that is already in the index, then it will be duplicated (and hence returned twice in any matching search).

There isn’t too much documentation on this library yet, at least nothing about this specifically. If anybody knows how, please drop me an email or submit a comment with this article.

So at this point, the way to keep the index updated is to rebuild it from scratch when a document is updated.

Extending Zend_Search_Lucene

There are several aspects of Zend_Search_Lucene that can be extended, allowing a fully customized search solution. These include:

  • A custom tokenizer for determining keywords in a document
  • Custom scoring algorithms to determine how well a document matches a search query

  • A custom storage method, to your index is stored however and wherever you please

A custom tokenizer

There are many reasons why a custom tokenizer can be useful. Here are some ideas:

  • PDF tokenizer – a tokenizer that can parse a PDF file and find all the keywords
  • Image tokenizer – a tokenizer that can perform Optical Character Recognition (OCR), thereby allowing you to index words in an image (and you could store the image also, using the Binary field type)

  • HTML tokenizer – a tokenizer than can read HTML data, thereby knowing not to index HTML keywords but only the actual content. You could make further improvements on this also, such as finding all headings and treating them with higher preference to the rest of the content.

More information on this can be found at http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.analysis.

Custom scoring algorithms

Using a custom scoring algorithm, you can determine how favourably different fields in a document are looked upon. For example, you might want to treat matches in the ‘title’ field (if you have one) much more favourably than matches in the ‘contents’ field.

More information can be found on this at http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.scoring.

Custom storage method

You can change how indexes are written to disk by extending the Zend_Search_Lucene_Storage_Directory and Zend_Search_Lucene_Storage_File classes.

It may or may not be possible to change this store all indexed data in a database, but I haven’t actually tried this so I’m not sure.

More information on this can be found at http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.storage.

Conclusion

In this article we looked closely at Zend_Search_Lucene, part of the new Zend Framework. While it is still in development, it works sufficiently well to implement fulltext indexing and searching on any PHP 5 website.

We also looked briefly at some ways of extending the search capabilities.

Hopefully this has given you some ideas for your own website. Feel free to add comments and ideas to this article.


GET['query'] : '';
___FCKpd___126
___FCKpd___127
___FCKpd___128
___FCKpd___129
___FCKpd___130
___FCKpd___131
___FCKpd___132
___FCKpd___133
___FCKpd___134
___FCKpd___135
___FCKpd___136
___FCKpd___137
___FCKpd___138
___FCKpd___139
___FCKpd___140
___FCKpd___141
___FCKpd___142
___FCKpd___143
___FCKpd___144
___FCKpd___145
___FCKpd___146
___FCKpd___147
___FCKpd___148
___FCKpd___149
___FCKpd___150
___FCKpd___151
___FCKpd___152
Error handling
The one thing we haven’t dealt with yet are errors in the search. For instance, if we were to type in ‘title:’ with no query behind it then an error would occur. We handle this by catching the Zend_Search_Lucene_Exception exception.
Highlight: PHP
___FCKpd___153
___FCKpd___154
___FCKpd___155
___FCKpd___156
___FCKpd___157
___FCKpd___158
___FCKpd___159
___FCKpd___160
___FCKpd___161
___FCKpd___162
___FCKpd___163
___FCKpd___164
___FCKpd___165
___FCKpd___166
This means now that if an error occurs in the search, we simply assume zero hits were returned, thereby handling the error without indicating to the user that anything went wrong.
Of course, you could also choose to get the error message from the exception and output that instead ($ex->getMessage()).
Keeping the index up-to-date
The other thing we haven’t yet dealt with is if any of our documents are updated. There are several ways to handle this:
  • Update just the entry for the updated document straight away
  • Rebuild the entire index when a document is updated straight away
  • Rebuild the entire index at a certain time each day (or several times per day)
The ideal method really depends on the kind of data you have, how often it is updated, and how important it is for it the search index to be completely up-to-date.
To be honest, I haven’t figured out a way to update a single document in the index yet, but I may just be missing something simple. If you open the index (without the second parameter), and then index a document that is already in the index, then it will be duplicated (and hence returned twice in any matching search).
There isn’t too much documentation on this library yet, at least nothing about this specifically. If anybody knows how, please drop me an email or submit a comment with this article.
So at this point, the way to keep the index updated is to rebuild it from scratch when a document is updated.
Extending Zend_Search_Lucene
There are several aspects of Zend_Search_Lucene that can be extended, allowing a fully customized search solution. These include:
  • A custom tokenizer for determining keywords in a document
  • Custom scoring algorithms to determine how well a document matches a search query
  • A custom storage method, to your index is stored however and wherever you please
A custom tokenizer
There are many reasons why a custom tokenizer can be useful. Here are some ideas:
  • PDF tokenizer – a tokenizer that can parse a PDF file and find all the keywords
  • Image tokenizer – a tokenizer that can perform Optical Character Recognition (OCR), thereby allowing you to index words in an image (and you could store the image also, using the Binary field type)
  • HTML tokenizer – a tokenizer than can read HTML data, thereby knowing not to index HTML keywords but only the actual content. You could make further improvements on this also, such as finding all headings and treating them with higher preference to the rest of the content.
More information on this can be found at http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.analysis.
Custom scoring algorithms
Using a custom scoring algorithm, you can determine how favourably different fields in a document are looked upon. For example, you might want to treat matches in the ‘title’ field (if you have one) much more favourably than matches in the ‘contents’ field.
More information can be found on this at http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.scoring.
Custom storage method
You can change how indexes are written to disk by extending the Zend_Search_Lucene_Storage_Directory and Zend_Search_Lucene_Storage_File classes.
It may or may not be possible to change this store all indexed data in a database, but I haven’t actually tried this so I’m not sure.
More information on this can be found at http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.storage.
Conclusion
In this article we looked closely at Zend_Search_Lucene, part of the new Zend Framework. While it is still in development, it works sufficiently well to implement fulltext indexing and searching on any PHP 5 website.
We also looked briefly at some ways of extending the search capabilities.
Hopefully this has given you some ideas for your own website. Feel free to add comments and ideas to this article.

[转]结合php5与zend_search_lucene来创建一个全文搜索引擎相关推荐

  1. 用Python和OpenCV创建一个图片搜索引擎的完整指南

    无论你是将个人照片贴标签并分类,或是在公司的网站上搜索一堆照片,还是在为下一篇博客寻找合适的图片.在用文本和关键字来描述图片是非常痛苦的事. 我就遇到了这样的痛苦的事情,上周二我打开了一个很老的家庭相 ...

  2. 创建一个图片搜索引擎的完整指南

    大家都知道,通过文本或标签来搜索图片的体验非常糟糕. 无论你是将个人照片贴标签并分类,或是在公司的网站上搜索一堆照片,还是在为下一篇博客寻找合适的图片.在用文本和关键字来描述图片是非常痛苦的事. 我就 ...

  3. thinkphp用来做什么项目_thinkphp第1课:使用thinkphp创建一个项目

    一.thinkphp第一课: 1.在网站根目录下,建立一个项目名称orange: 2.在orange目录下,创建一个入口文件index.php: define('APP_DEBUG',true); / ...

  4. 利用thinkphp创建一个简单的站点

    本文我们将利用thinkphp创建一个简单的站点,这里所使用的thinkphp版本是5.0.24,这里是它的中文文档.如果有需要可以参考它的中文文档. thinkphp框架是一个典型的MVC框架,该框 ...

  5. linux创建一个交换分区,如何创建linux交换分区

    匿名用户 1级 2017-03-26 回答 1.mkswap 把一个分区格式化成为swap交换区: [root@localhost]# mkswap /dev/sda6 注:创建此分区为swap 交换 ...

  6. 创建一个Scalar-valued Function函数来实现LastIndexOf

    昨天有帮助网友解决的个字符串截取的问题,<截取字符串中最后一个中文词语(MS SQL)>http://www.cnblogs.com/insus/p/7883606.html 虽然实现了, ...

  7. 如何创建一个基础jQuery插件

    如何创建一个基础插件 How to Create a Basic Plugin 有时你想使一块功能性的代码在你代码的任何地方有效.比如,也许你想调用jQuery对象的一个方法,对该对象进行一系列的操作 ...

  8. 只需三分钟!只需创建一个vuex.js文件,让你马上学会使用Vuex,尽管Vuex是个鸡肋!(扔掉store文件夹和里面的index、getters、actions、mutations等js文件吧!)

    前情提示:有一天,我要实现一个效果→点击某个按钮改变一个全局变量,并且要让绑定了该变量的所有位置异步渲染.我试过用一个全局的js文件存放该变量,该变量值虽然改变了,但是没有做到异步渲染.接着我用win ...

  9. 学习在Unity中创建一个动作RPG游戏

    游戏开发变得简单.使用Unity学习C#并创建您自己的动作角色扮演游戏! 你会学到什么 学习C#,一种现代通用的编程语言. 了解Unity中2D发展的能力. 发展强大的和可移植的解决问题的技能. 了解 ...

最新文章

  1. 简单的Twitter:Heroku上的Play框架,AJAX,CRUD
  2. SQLServer常用系统视图
  3. BHO插件操作IE浏览器,js调用C#方法
  4. 一个我自己建的程序员资料分享站
  5. 车牌识别算法库EasyPR的编译实战
  6. 《网络攻防第六周作业》
  7. 使IE6支持png透明图片
  8. 软件测试:等价类划分-----EditBox问题增加文本框
  9. 【UFBA Practice Session for Brazilian ICPC Regionals - 2018】Carnival【强连通图求“关键边”】
  10. 微pe Linux,微PE工具箱 v2.1 正式版
  11. 按键精灵找文字的基础代码模板
  12. 外汇EA量化交易,怎么提高交易水平
  13. 计算机自动开机什么愿意,电脑自动开关机是什么原因 怎么解决呢
  14. git将一个分支的提交合并到另一个分支
  15. MySQL之INTERVAL()函数用法
  16. Mapreduce入门--词频统计
  17. 拜读阮一峰JavaScript教程笔记
  18. 拍摄古风写真照片要注意的事项啊!
  19. c# 读取excels
  20. rewind,fgetpos,lseek和fseek用法

热门文章

  1. 使用mybatis-plus时mybatis报错There is no getter for property named ‘xxx‘ in ‘class com.xxx.xxx.xxxMybatis
  2. 一文带你秒懂数据结构与算法的三大要素、五大特征!
  3. 2023需要重点关注的四大AI方向
  4. 马云内部邮件:新入职员工勿批判公司
  5. JAVASE高级部分
  6. python适合小白学吗_有没有适合零基础小白学习的python课程?
  7. Excel:12 个操作小技巧
  8. excel文件导入hive乱码_hive 从Excel中导入数据
  9. C语言3067答案,教师招聘《小学教育心理学》通关试题每日练(2020年03月03日-3067)...
  10. S3C2440 I2C总线控制