django-sphinx search on different models and m2m fields

How to search on different models and m2m fields using sphinx.

Often we need a global search, that is, we want a high speed search on different models supporting pagination and relevance.
For this purpose we need a full-text search engine; there are a lot of good software out there but in this article we'll use sphinx.

Background

Say that we are developing a website which contains books and articles; a book can have multiple authors as well as an article.
Say that we want books, articles and authors as result of one global search.ΒΈ

Models

class Author(models.Model):
    first_name = models.CharField(max_length=100)
    last_name = models.CharField(max_length=100)
    about_me = models.TextField()

    def __unicode__(self):
        return u'%s %s' % (self.first_name, self.last_name)

class Book(models.Model):
    title = models.CharField(max_length=200)
    authors = models.ManyToManyField(Author)
    description = models.TextField()

    def __unicode__(self):
        return self.title

class Article(models.Model):
    title = models.CharField(max_length=200)
    authors = models.ManyToManyField(Author)
    excerpt = models.TextField()

    def __unicode__(self):
        return self.title

Initial steps

A few steps before configuring django-sphinx and our search:

Sphinx configuration

We have to create a simple config file that sphinx will use to index our models.
But how can we do that if we have different tables for different models?

The answer is simple, from sphinx point of view, the data it indexes is a set of structured documents, each of which has the same set of fields, sphinx doesn't care if you use a table, a view or an xml.

So the point is to create different selects (3 in our case) with the same number of columns.
Our virtual table will have a title, a content and a list of authors (which will be empty for the author of course).

As I'm lazy, I'll use the django-sphinx config generation.
So open the terminal, cd to your project folder and run the django shell:

python ./manage.py shell
(InteractiveConsole)
>>> import djangosphinx
>>> from reference.models import *
>>> output1 = djangosphinx.generate_config_for_model(Book)
>>> output2 = djangosphinx.generate_config_for_model(Article)
>>> output3 = djangosphinx.generate_config_for_model(Author)
>>> 
>>> print output1
source reference_book
{
    type                = mysql
    strip_html          = 0
    index_html_attrs    =
    sql_host            = 
    sql_user            = root
    sql_pass            = 
    sql_db              = myproject
    sql_port            = 
    log                 = /var/log/sphinx/searchd.log

    sql_query_pre       =
    sql_query_post      =
    sql_query           = \
        SELECT id, title, description \
        FROM reference_book
    sql_query_info      = SELECT * FROM `reference_book` WHERE `id` = $id
}

index reference_book
{
    source          = reference_book
    path            = /var/data/reference_book
    docinfo         = extern
    morphology      = none
    stopwords       =
    min_word_len    = 2
    charset_type    = sbcs
    min_prefix_len  = 0
    min_infix_len   = 0
}
>>> 
>>> print output2
source reference_article
{
    type                = mysql
    strip_html          = 0
    index_html_attrs    =
    sql_host            = 
    sql_user            = root
    sql_pass            = 
    sql_db              = myproject
    sql_port            = 
    log                 = /var/log/sphinx/searchd.log

    sql_query_pre       =
    sql_query_post      =
    sql_query           = \
        SELECT id, title, excerpt \
        FROM reference_article
    sql_query_info      = SELECT * FROM `reference_article` WHERE `id` = $id
}

index reference_article
{
    source          = reference_article
    path            = /var/data/reference_article
    docinfo         = extern
    morphology      = none
    stopwords       =
    min_word_len    = 2
    charset_type    = sbcs
    min_prefix_len  = 0
    min_infix_len   = 0
}
>>> 
>>> print output3
source reference_author
{
    type                = mysql
    strip_html          = 0
    index_html_attrs    =
    sql_host            = 
    sql_user            = root
    sql_pass            = 
    sql_db              = myproject
    sql_port            = 
    log                 = /var/log/sphinx/searchd.log

    sql_query_pre       =
    sql_query_post      =
    sql_query           = \
        SELECT id, first_name, last_name, about_me \
        FROM reference_author
    sql_query_info      = SELECT * FROM `reference_author` WHERE `id` = $id
}

index reference_author
{
    source          = reference_author
    path            = /var/data/reference_author
    docinfo         = extern
    morphology      = none
    stopwords       =
    min_word_len    = 2
    charset_type    = sbcs
    min_prefix_len  = 0
    min_infix_len   = 0
}

Save the output in a file (sphinx.conf in my case).
djangosphinx.generate_config_for_model generates a very basic configuration that we will change in a while.
Shortly, source is your data source whilst index tells sphinx how to index the document (our model).

All we need to do is change the selects in order to simulate a generic table.

Sphinx wants a unique unsigned non-zero integer document id as first column to index the rows; it can't be our model id, of course, because it would likely be the same for a book, an article or an author.
To sort out this problem I'll use a concatenation of model_id and content_type_id.

Note that it might create problems if your model ids don't have the same number of digits.

For example, if Book and Article have 10 and 110 as content_type_id and 11 and 1 as model_id you will have two different models with the same sphinx_id (CONCAT(11, 10) = 1110, CONCAT(1, 110) = 1110).

If this is your case, you will need to generate the sphinx_id in a different way.

Besides that, we have to include the content type of our model in order to let django-sphinx get the right one.

Now let's take a look at our selects.

Book

select cast(concat(a.id, c.id) as unsigned) as sphinx_id, a.id id, a.title as name, a.description as content, GROUP_CONCAT(distinct b.id) as author_ids, c.id as content_type
from reference_book as a inner join reference_book_authors as b on b.book_id = a.id, django_content_type as c
where c.model = 'book' and c.app_label = 'reference'
group by a.id

Article

select cast(concat(a.id, c.id) as unsigned) as sphinx_id, a.id id, a.title as name, a.excerpt as content, GROUP_CONCAT(distinct b.id) as author_ids, c.id as content_type
from reference_article as a inner join reference_article_authors as b on b.article_id = a.id, django_content_type as c
where c.model = 'article' and c.app_label = 'reference'
group by a.id

Author

select cast(concat(a.id, b.id) as unsigned) as sphinx_id, a.id id, CONCAT(a.first_name, " ", a.last_name) as name, a.about_me as content, 0 as author_ids, b.id as content_type
from reference_author as a, django_content_type as b
where b.model = 'author' and b.app_label = 'reference'

Our virtual table will have the following columns: sphinx_id, id, name, content, author_ids and content_type.
Note that we use GROUP_CONCAT to get the list of ids from another table using an inner join.

Next step is to change the sphinx.conf propertly.

reference_book

source reference_book
{
    type                = mysql
    sql_host            = 127.0.0.1
    sql_user            = root
    sql_pass            = 
    sql_db              = myproject
    sql_port            = 

    sql_query_pre       =
    sql_query_post      =
    sql_query           = \
        select cast(concat(a.id, c.id) as unsigned) as sphinx_id, a.id id, a.title as name, a.description as content, GROUP_CONCAT(distinct b.id) as author_ids, c.id as content_type \
        from reference_book as a inner join reference_book_authors as b on b.book_id = a.id, django_content_type as c \
        where c.model = 'book' and c.app_label = 'reference' \
        group by a.id
    sql_query_info      = \
        select cast(concat(a.id, c.id) as unsigned) as sphinx_id, a.id id, a.title as name, a.description as content, GROUP_CONCAT(distinct b.id) as author_ids, c.id as content_type \
        from reference_book as a inner join reference_book_authors as b on b.book_id = a.id, django_content_type as c \
        where c.model = 'book' and c.app_label = 'reference' and `id` = $id \
        group by a.id

    sql_attr_multi      = uint author_ids from field
    sql_attr_uint       = id
    sql_attr_uint       = content_type
}

index reference_book
{
    source          = reference_book
    path            = /var/data/reference_book
    docinfo         = extern
    morphology      = none
    stopwords       =
    min_word_len    = 2
    enable_star     = 1
    charset_type    = sbcs
    min_prefix_len  = 2
    min_infix_len   = 0
}

reference_article

source reference_article
{
    type                = mysql
    sql_host            = 127.0.0.1
    sql_user            = root
    sql_pass            = 
    sql_db              = myproject
    sql_port            = 

    sql_query_pre       =
    sql_query_post      =
    sql_query           = \
        select cast(concat(a.id, c.id) as unsigned) as sphinx_id, a.id id, a.title as name, a.excerpt as content, GROUP_CONCAT(distinct b.id) as author_ids, c.id as content_type \
        from reference_article as a inner join reference_article_authors as b on b.article_id = a.id, django_content_type as c \
        where c.model = 'article' and c.app_label = 'reference' \
        group by a.id
    sql_query_info      = \
        select cast(concat(a.id, c.id) as unsigned) as sphinx_id, a.id id, a.title as name, a.excerpt as content, GROUP_CONCAT(distinct b.id) as author_ids, c.id as content_type \
        from reference_article as a inner join reference_article_authors as b on b.article_id = a.id, django_content_type as c \
        where c.model = 'article' and c.app_label = 'reference' and `id` = $id \
        group by a.id

    sql_attr_multi      = uint author_ids from field
    sql_attr_uint       = id
    sql_attr_uint       = content_type
}

index reference_article
{
    source          = reference_article
    path            = /var/data/reference_article
    docinfo         = extern
    morphology      = none
    stopwords       =
    min_word_len    = 2
    enable_star     = 1
    charset_type    = sbcs
    min_prefix_len  = 2
    min_infix_len   = 0
}

reference_author

source reference_author
{
    type                = mysql
    sql_host            = 127.0.0.1
    sql_user            = root
    sql_pass            = 
    sql_db              = myproject
    sql_port            = 

    sql_query_pre       =
    sql_query_post      =
    sql_query           = \
        select cast(concat(a.id, b.id) as unsigned) as sphinx_id, a.id id, CONCAT(a.first_name, " ", a.last_name) as name, a.about_me as content, 0 as author_ids, b.id as content_type \
        from reference_author as a, django_content_type as b \
        where b.model = 'author' and b.app_label = 'reference'
    sql_query_info      = \
        select cast(concat(a.id, b.id) as unsigned) as sphinx_id, a.id id, CONCAT(a.first_name, " ", a.last_name) as name, a.about_me as content, 0 as author_ids, b.id as content_type \
        from reference_author as a, django_content_type as b \
        where b.model = 'author' and b.app_label = 'reference' and `id` = $id

    sql_attr_multi      = uint author_ids from field
    sql_attr_uint       = id
    sql_attr_uint       = content_type
}

index reference_author
{
    source          = reference_author
    path            = /var/data/reference_author
    docinfo         = extern
    morphology      = none
    stopwords       =
    min_word_len    = 2
    enable_star     = 1
    charset_type    = sbcs
    min_prefix_len  = 2
    min_infix_len   = 0
}

A few changes here, I've removed some lines as they were deprecated and added 3 more lines for each source:

sql_attr_multi      = uint author_ids from field
sql_attr_uint       = id
sql_attr_uint       = content_type

The first one tells sphinx that author_ids is a multi value attribute, that is, a list of attribute ids.
The last two define sphinx attributes which allow us to filter the result.

Besides, in order to support a LIKE search I've added enable_star and set min_prefix_len to 2.

Indexing

Next step is to index our models.

This is the easier part as we just have to run the indexer with the option --all (index all).
Note that if you have searchd running, you might need to either kill the process or run the indexer with the option --rotate.

$ sudo indexer --config sphinx.conf --all
Sphinx 0.9.8-release (r1371)
Copyright (c) 2001-2008, Andrew Aksyonoff

using config file 'sphinx.conf'...
indexing index 'reference_book'...
collected 2 docs, 0.0 MB
collected 0 attr values
sorted 0.0 Mvalues, 100.0% done
sorted 0.0 Mhits, 100.0% done
total 2 docs, 491 bytes
total 0.010 sec, 49100.00 bytes/sec, 200.00 docs/sec
indexing index 'reference_article'...
collected 2 docs, 0.0 MB
collected 0 attr values
sorted 0.0 Mvalues, 100.0% done
sorted 0.0 Mhits, 100.0% done
total 2 docs, 1061 bytes
total 0.014 sec, 77354.91 bytes/sec, 145.82 docs/sec
indexing index 'reference_author'...
collected 4 docs, 0.0 MB
collected 0 attr values
sorted 0.0 Mvalues, 100.0% done
sorted 0.0 Mhits, 100.0% done
total 4 docs, 413 bytes
total 0.010 sec, 41300.00 bytes/sec, 400.00 docs/sec

Searching

Let's run the sphinx searchd and try out our configuration.

Add these lines to your sphinx.conf changing your settings properly.

searchd
{
    # IP address to bind on
    # optional, default is 0.0.0.0 (ie. listen on all interfaces)
    #
    # address               = 127.0.0.1
    # address               = 192.168.0.1


    # searchd TCP port number
    # mandatory, default is 3312
    port                = 3312

    # log file, searchd run info is logged here
    # optional, default is 'searchd.log'
    log                 = /var/log/sphinx/searchd.log

    # query log file, all search queries are logged here
    # optional, default is empty (do not log queries)
    query_log           = /var/log/sphinx/query.log

    # client read timeout, seconds
    # optional, default is 5
    read_timeout        = 5

    # maximum amount of children to fork (concurrent searches to run)
    # optional, default is 0 (unlimited)
    max_children        = 30

    # PID file, searchd process ID file name
    # mandatory
    pid_file            = /var/log/sphinx/searchd.pid

    # max amount of matches the daemon ever keeps in RAM, per-index
    # WARNING, THERE'S ALSO PER-QUERY LIMIT, SEE SetLimits() API CALL
    # default is 1000 (just like Google)
    max_matches         = 1000

    # seamless rotate, prevents rotate stalls if precaching huge datasets
    # optional, default is 1
    seamless_rotate     = 1

    # whether to forcibly preopen all indexes on startup
    # optional, default is 0 (do not preopen)
    preopen_indexes     = 0

    # whether to unlink .old index copies on succesful rotation.
    # optional, default is 1 (do unlink)
    unlink_old          = 1
}

run searchd as root:

$ sudo searchd --config sphinx.conf 
Sphinx 0.9.8-release (r1371)
Copyright (c) 2001-2008, Andrew Aksyonoff

using config file 'sphinx.conf'...

and open the python shell:


>>> import djangosphinx
>>> from djangosphinx.apis.current import SPH_MATCH_EXTENDED
>>> qs = djangosphinx.SphinxQuerySet(index='reference_book reference_article reference_author', mode=SPH_MATCH_EXTENDED)
>>> list(qs.query('dj*'))
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/Users/uber/workspaces/django/myproject/local_apps/djangosphinx/manager.py", line 236, in __iter__
    return iter(self._get_data())
  File "/Users/uber/workspaces/django/myproject/local_apps/djangosphinx/manager.py", line 397, in _get_data
    self._result_cache = list(self._get_results())
  File "/Users/uber/workspaces/django/myproject/local_apps/djangosphinx/manager.py", line 587, in _get_results
    objcache[ct][', '.join([unicode(p) for p in o.pks])] = o
AttributeError: 'Author' object has no attribute 'pks'

There is an annoying bug in django-sphinx when using multiple models, take a look at my ticket for the patch.

UPDATE: After a chat with David Cramer the bug has been fixed (thanks David).

Let's try now.

>>> import djangosphinx
>>> from djangosphinx.apis.current import SPH_MATCH_EXTENDED
>>> 
>>> qs = djangosphinx.SphinxQuerySet(index='reference_book reference_article reference_author', mode=SPH_MATCH_EXTENDED)
>>> 
>>> list(qs.query('dj*'))
[<Book: Practical Django Projects>, <Author: James Bennett>, <Article: So you want a dynamic form>]


>>> from reference.models import *
>>> 
>>> Book.objects.all()
[<Book: Practical Django Projects>, <Book: Python Cookbook>]
>>> 
>>> Article.objects.all()
[<Article: Design tools for the open web: reflections on the fixoutlook campaign>, <Article: So you want a dynamic form>]
>>> 
>>> Author.objects.all()
[<Author: James Bennett>, <Author: Alex Martelli>, <Author: Anna Ravenscroft>, <Author: David Ascher>]


>>> james_bennett = Author.objects.get(id=1)
>>> james_bennett
<Author: James Bennett>
>>> 
>>> james_bennett.book_set.all()
[<Book: Practical Django Projects>]
>>> 
>>> james_bennett.article_set.all()
[<Article: So you want a dynamic form>]

>>> list(qs.query('so*'))
[<Article: So you want a dynamic form>, <Author: James Bennett>, <Author: Alex Martelli>, <Book: Practical Django Projects>, <Article: Design tools for the open web: reflections on the fixoutlook campaign>, <Book: Python Cookbook>]
>>> 
>>> list(qs.query('so*').filter(author_ids__in=[1]))
[<Article: So you want a dynamic form>, <Book: Practical Django Projects>]

As you can see you can now search on different models with a single query, filter and order_by by attribute.

In conclusion

A single sphinx configuration file and a few lines of code is all you need to use a global search in your project.

31 Aug. 2009