debian-mirror-gitlab/elasticsearch-model/README.md
2019-12-04 22:21:29 +05:30

22 KiB

Elasticsearch::Model

The elasticsearch-model library builds on top of the the elasticsearch library.

It aims to simplify integration of Ruby classes ("models"), commonly found e.g. in Ruby on Rails applications, with the Elasticsearch search and analytics engine.

The library is compatible with Ruby 1.9.3 and higher.

Installation

Install the package from Rubygems:

gem install elasticsearch-model

To use an unreleased version, either add it to your Gemfile for Bundler:

gem 'elasticsearch-model', git: 'git://github.com/elasticsearch/elasticsearch-rails.git'

or install it from a source code checkout:

git clone https://github.com/elasticsearch/elasticsearch-rails.git
cd elasticsearch-rails/elasticsearch-model
bundle install
rake install

Usage

Let's suppose you have an Article model:

require 'active_record'
ActiveRecord::Base.establish_connection( adapter: 'sqlite3', database: ":memory:" )
ActiveRecord::Schema.define(version: 1) { create_table(:articles) { |t| t.string :title } }

class Article < ActiveRecord::Base; end

Article.create title: 'Quick brown fox'
Article.create title: 'Fast black dogs'
Article.create title: 'Swift green frogs'

Setup

To add the Elasticsearch integration for this model, require elasticsearch/model and include the main module in your class:

require 'elasticsearch/model'

class Article < ActiveRecord::Base
  include Elasticsearch::Model
end

This will extend the model with functionality related to Elasticsearch.

Feature Extraction Pattern

Instead of including the Elasticsearch::Model module directly in your model, you can include it in a "concern" or "trait" module, which is quite common pattern in Rails applications, using e.g. ActiveSupport::Concern as the instrumentation:

# In: app/models/concerns/searchable.rb
#
module Searchable
  extend ActiveSupport::Concern

  included do
    include Elasticsearch::Model

    mapping do
      # ...
    end

    def self.search(query)
      # ...
    end
  end
end

# In: app/models/article.rb
#
class Article
  include Searchable
end

The __elasticsearch__ Proxy

The Elasticsearch::Model module contains a big amount of class and instance methods to provide all its functionality. To prevent polluting your model namespace, this functionality is primarily available via the __elasticsearch__ class and instance level proxy methods; see the Elasticsearch::Model::Proxy class documentation for technical information.

The module will include important methods, such as search, into the class or module only when they haven't been defined already. Following two calls are thus functionally equivalent:

Article.__elasticsearch__.search 'fox'
Article.search 'fox'

See the Elasticsearch::Model module documentation for technical information.

The Elasticsearch client

The module will set up a client, connected to localhost:9200, by default. You can access and use it as any other Elasticsearch::Client:

Article.__elasticsearch__.client.cluster.health
# => { "cluster_name"=>"elasticsearch", "status"=>"yellow", ... }

To use a client with different configuration, just set up a client for the model:

Article.__elasticsearch__.client = Elasticsearch::Client.new host: 'api.server.org'

Or configure the client for all models:

Elasticsearch::Model.client = Elasticsearch::Client.new log: true

You might want to do this during your application bootstrap process, e.g. in a Rails initializer.

Please refer to the elasticsearch-transport library documentation for all the configuration options, and to the elasticsearch-api library documentation for information about the Ruby client API.

Importing the data

The first thing you'll want to do is importing your data into the index:

Article.import
# => 0

It's possible to import only records from a specific scope or query, transform the batch with the transform and preprocess options, or re-create the index by deleting it and creating it with correct mapping with the force option -- look for examples in the method documentation.

No errors were reported during importing, so... let's search the index!

Searching

For starters, we can try the "simple" type of search:

response = Article.search 'fox dogs'

response.took
# => 3

response.results.total
# => 2

response.results.first._score
# => 0.02250402

response.results.first._source.title
# => "Quick brown fox"

Search results

The returned response object is a rich wrapper around the JSON returned from Elasticsearch, providing access to response metadata and the actual results ("hits").

Each "hit" is wrapped in the Result class, and provides method access to its properties via Hashie::Mash.

The results object supports the Enumerable interface:

response.results.map { |r| r._source.title }
# => ["Quick brown fox", "Fast black dogs"]

response.results.select { |r| r.title =~ /^Q/ }
# => [#<Elasticsearch::Model::Response::Result:0x007 ... "_source"=>{"title"=>"Quick brown fox"}}>]

In fact, the response object will delegate Enumerable methods to results:

response.any? { |r| r.title =~ /fox|dog/ }
# => true

To use Array's methods (including any ActiveSupport extensions), just call to_a on the object:

response.to_a.last.title
# "Fast black dogs"

Search results as database records

Instead of returning documents from Elasticsearch, the records method will return a collection of model instances, fetched from the primary database, ordered by score:

response.records.to_a
# Article Load (0.3ms)  SELECT "articles".* FROM "articles" WHERE "articles"."id" IN (1, 2)
# => [#<Article id: 1, title: "Quick brown fox">, #<Article id: 2, title: "Fast black dogs">]

The returned object is the genuine collection of model instances returned by your database, i.e. ActiveRecord::Relation for ActiveRecord, or Mongoid::Criteria in case of MongoDB.

This allows you to chain other methods on top of search results, as you would normally do:

response.records.where(title: 'Quick brown fox').to_a
# Article Load (0.2ms)  SELECT "articles".* FROM "articles" WHERE "articles"."id" IN (1, 2) AND "articles"."title" = 'Quick brown fox'
# => [#<Article id: 1, title: "Quick brown fox">]

response.records.records.class
# => ActiveRecord::Relation::ActiveRecord_Relation_Article

The ordering of the records by score will be preserved, unless you explicitly specify a different order in your model query language:

response.records.order(:title).to_a
# Article Load (0.2ms)  SELECT "articles".* FROM "articles" WHERE "articles"."id" IN (1, 2) ORDER BY "articles".title ASC
# => [#<Article id: 2, title: "Fast black dogs">, #<Article id: 1, title: "Quick brown fox">]

The records method returns the real instances of your model, which is useful when you want to access your model methods -- at the expense of slowing down your application, of course. In most cases, working with results coming from Elasticsearch is sufficient, and much faster. See the elasticsearch-rails library for more information about compatibility with the Ruby on Rails framework.

When you want to access both the database records and search results, use the each_with_hit (or map_with_hit) iterator:

response.records.each_with_hit { |record, hit| puts "* #{record.title}: #{hit._score}" }
# * Quick brown fox: 0.02250402
# * Fast black dogs: 0.02250402

Searching multiple models

It is possible to search across multiple models with the module method:

Elasticsearch::Model.search('fox', [Article, Comment]).results.to_a.map(&:to_hash)
# => [
#      {"_index"=>"articles", "_type"=>"article", "_id"=>"1", "_score"=>0.35136628, "_source"=>...},
#      {"_index"=>"comments", "_type"=>"comment", "_id"=>"1", "_score"=>0.35136628, "_source"=>...}
#    ]

Elasticsearch::Model.search('fox', [Article, Comment]).records.to_a
# Article Load (0.3ms)  SELECT "articles".* FROM "articles" WHERE "articles"."id" IN (1)
# Comment Load (0.2ms)  SELECT "comments".* FROM "comments" WHERE "comments"."id" IN (1,5)
# => [#<Article id: 1, title: "Quick brown fox">, #<Comment id: 1, body: "Fox News">,  ...]

By default, all models which include the Elasticsearch::Model module are searched.

NOTE: It is not possible to chain other methods on top of the records object, since it is a heterogenous collection, with models potentially backed by different databases.

Pagination

You can implement pagination with the from and size search parameters. However, search results can be automatically paginated with the kaminari or will_paginate gems. (The pagination gems must be added before the Elasticsearch gems in your Gemfile, or loaded first in your application.)

If Kaminari or WillPaginate is loaded, use the familiar paging methods:

response.page(2).results
response.page(2).records

In a Rails controller, use the the params[:page] parameter to paginate through results:

@articles = Article.search(params[:q]).page(params[:page]).records

@articles.current_page
# => 2
@articles.next_page
# => 3

To initialize and include the Kaminari pagination support manually:

Kaminari::Hooks.init
Elasticsearch::Model::Response::Response.__send__ :include, Elasticsearch::Model::Response::Pagination::Kaminari

The Elasticsearch DSL

In most situation, you'll want to pass the search definition in the Elasticsearch domain-specific language to the client:

response = Article.search query:     { match:  { title: "Fox Dogs" } },
                          highlight: { fields: { title: {} } }

response.results.first.highlight.title
# ["Quick brown <em>fox</em>"]

You can pass any object which implements a to_hash method, or you can use your favourite JSON builder to build the search definition as a JSON string:

require 'jbuilder'

query = Jbuilder.encode do |json|
  json.query do
    json.match do
      json.title do
        json.query "fox dogs"
      end
    end
  end
end

response = Article.search query
response.results.first.title
# => "Quick brown fox"

Index Configuration

For proper search engine function, it's often necessary to configure the index properly. The Elasticsearch::Model integration provides class methods to set up index settings and mappings.

class Article
  settings index: { number_of_shards: 1 } do
    mappings dynamic: 'false' do
      indexes :title, analyzer: 'english', index_options: 'offsets'
    end
  end
end

Article.mappings.to_hash
# => {
#      :article => {
#        :dynamic => "false",
#        :properties => {
#          :title => {
#            :type          => "string",
#            :analyzer      => "english",
#            :index_options => "offsets"
#          }
#        }
#      }
#    }

Article.settings.to_hash
# { :index => { :number_of_shards => 1 } }

You can use the defined settings and mappings to create an index with desired configuration:

Article.__elasticsearch__.client.indices.delete index: Article.index_name rescue nil
Article.__elasticsearch__.client.indices.create \
  index: Article.index_name,
  body: { settings: Article.settings.to_hash, mappings: Article.mappings.to_hash }

There's a shortcut available for this common operation (convenient e.g. in tests):

Article.__elasticsearch__.create_index! force: true
Article.__elasticsearch__.refresh_index!

By default, index name and document type will be inferred from your class name, you can set it explicitely, however:

class Article
  index_name    "articles-#{Rails.env}"
  document_type "post"
end

Updating the Documents in the Index

Usually, we need to update the Elasticsearch index when records in the database are created, updated or deleted; use the index_document, update_document and delete_document methods, respectively:

Article.first.__elasticsearch__.index_document
# => {"ok"=>true, ... "_version"=>2}

Automatic Callbacks

You can automatically update the index whenever the record changes, by including the Elasticsearch::Model::Callbacks module in your model:

class Article
  include Elasticsearch::Model
  include Elasticsearch::Model::Callbacks
end

Article.first.update_attribute :title, 'Updated!'

Article.search('*').map { |r| r.title }
# => ["Updated!", "Lime green frogs", "Fast black dogs"]

The automatic callback on record update keeps track of changes in your model (via ActiveModel::Dirty-compliant implementation), and performs a partial update when this support is available.

The automatic callbacks are implemented in database adapters coming with Elasticsearch::Model. You can easily implement your own adapter: please see the relevant chapter below.

Custom Callbacks

In case you would need more control of the indexing process, you can implement these callbacks yourself, by hooking into after_create, after_save, after_update or after_destroy operations:

class Article
  include Elasticsearch::Model

  after_save    { logger.debug ["Updating document... ", index_document ].join }
  after_destroy { logger.debug ["Deleting document... ", delete_document].join }
end

For ActiveRecord-based models, use the after_commit callback to protect your data against inconsistencies caused by transaction rollbacks:

class Article < ActiveRecord::Base
  include Elasticsearch::Model

  after_commit on: [:create] do
    __elasticsearch__.index_document if self.published?
  end

  after_commit on: [:update] do
    __elasticsearch__.update_document if self.published?
  end

  after_commit on: [:destroy] do
    __elasticsearch__.delete_document if self.published?
  end
end

Asynchronous Callbacks

Of course, you're still performing an HTTP request during your database transaction, which is not optimal for large-scale applications. A better option would be to process the index operations in background, with a tool like Resque or Sidekiq:

class Article
  include Elasticsearch::Model

  after_save    { Indexer.perform_async(:index,  self.id) }
  after_destroy { Indexer.perform_async(:delete, self.id) }
end

An example implementation of the Indexer worker class could look like this:

class Indexer
  include Sidekiq::Worker
  sidekiq_options queue: 'elasticsearch', retry: false

  Logger = Sidekiq.logger.level == Logger::DEBUG ? Sidekiq.logger : nil
  Client = Elasticsearch::Client.new host: 'localhost:9200', logger: Logger

  def perform(operation, record_id)
    logger.debug [operation, "ID: #{record_id}"]

    case operation.to_s
      when /index/
        record = Article.find(record_id)
        Client.index  index: 'articles', type: 'article', id: record.id, body: record.as_indexed_json
      when /delete/
        Client.delete index: 'articles', type: 'article', id: record_id
      else raise ArgumentError, "Unknown operation '#{operation}'"
    end
  end
end

Start the Sidekiq workers with bundle exec sidekiq --queue elasticsearch --verbose and update a model:

Article.first.update_attribute :title, 'Updated'

You'll see the job being processed in the console where you started the Sidekiq worker:

Indexer JID-eb7e2daf389a1e5e83697128 DEBUG: ["index", "ID: 7"]
Indexer JID-eb7e2daf389a1e5e83697128 INFO: PUT http://localhost:9200/articles/article/1 [status:200, request:0.004s, query:n/a]
Indexer JID-eb7e2daf389a1e5e83697128 DEBUG: > {"id":1,"title":"Updated", ...}
Indexer JID-eb7e2daf389a1e5e83697128 DEBUG: < {"ok":true,"_index":"articles","_type":"article","_id":"1","_version":6}
Indexer JID-eb7e2daf389a1e5e83697128 INFO: done: 0.006 sec

Model Serialization

By default, the model instance will be serialized to JSON using the as_indexed_json method, which is defined automatically by the Elasticsearch::Model::Serializing module:

Article.first.__elasticsearch__.as_indexed_json
# => {"id"=>1, "title"=>"Quick brown fox"}

If you want to customize the serialization, just implement the as_indexed_json method yourself, for instance with the as_json method:

class Article
  include Elasticsearch::Model

  def as_indexed_json(options={})
    as_json(only: 'title')
  end
end

Article.first.as_indexed_json
# => {"title"=>"Quick brown fox"}

The re-defined method will be used in the indexing methods, such as index_document.

Please note that in Rails 3, you need to either set include_root_in_json: false, or prevent adding the "root" in the JSON representation with other means.

Relationships and Associations

When you have a more complicated structure/schema, you need to customize the as_indexed_json method - or perform the indexing separately, on your own. For example, let's have an Article model, which has_many Comments, Authors and Categories. We might want to define the serialization like this:

def as_indexed_json(options={})
  self.as_json(
    include: { categories: { only: :title},
               authors:    { methods: [:full_name], only: [:full_name] },
               comments:   { only: :text }
             })
end

Article.first.as_indexed_json
# => { "id"         => 1,
#      "title"      => "First Article",
#      "created_at" => 2013-12-03 13:39:02 UTC,
#      "updated_at" => 2013-12-03 13:39:02 UTC,
#      "categories" => [ { "title" => "One" } ],
#      "authors"    => [ { "full_name" => "John Smith" } ],
#      "comments"   => [ { "text" => "First comment" } ] }

Of course, when you want to use the automatic indexing callbacks, you need to hook into the appropriate ActiveRecord callbacks -- please see the full example in examples/activerecord_associations.rb.

Other ActiveModel Frameworks

The Elasticsearch::Model module is fully compatible with any ActiveModel-compatible model, such as Mongoid:

require 'mongoid'

Mongoid.connect_to 'articles'

class Article
  include Mongoid::Document

  field :id,    type: String
  field :title, type: String

  attr_accessible :id, :title, :published_at

  include Elasticsearch::Model

  def as_indexed_json(options={})
    as_json(except: [:id, :_id])
  end
end

Article.create id: '1', title: 'Quick brown fox'
Article.import

response = Article.search 'fox';
response.records.to_a
#  MOPED: 127.0.0.1:27017 QUERY        database=articles collection=articles selector={"_id"=>{"$in"=>["1"]}} ...
# => [#<Article _id: 1, id: nil, title: "Quick brown fox", published_at: nil>]

Full examples for CouchBase, DataMapper, Mongoid, Ohm and Riak models can be found in the examples folder.

Adapters

To support various "OxM" (object-relational- or object-document-mapper) implementations and frameworks, the Elasticsearch::Model integration supports an "adapter" concept.

An adapter provides implementations for common behaviour, such as fetching records from the database, hooking into model callbacks for automatic index updates, or efficient bulk loading from the database. The integration comes with adapters for ActiveRecord and Mongoid out of the box.

Writing an adapter for your favourite framework is straightforward -- let's see a simplified adapter for DataMapper:

module DataMapperAdapter

  # Implement the interface for fetching records
  #
  module Records
    def records
      klass.all(id: @ids)
    end

    # ...
  end
end

# Register the adapter
#
Elasticsearch::Model::Adapter.register(
  DataMapperAdapter,
  lambda { |klass| defined?(::DataMapper::Resource) and klass.ancestors.include?(::DataMapper::Resource) }
)

Require the adapter and include Elasticsearch::Model in the class:

require 'datamapper_adapter'

class Article
  include DataMapper::Resource
  include Elasticsearch::Model

  property :id,    Serial
  property :title, String
end

When accessing the records method of the response, for example, the implementation from our adapter will be used now:

response = Article.search 'foo'

response.records.to_a
# ~  (0.000057) SELECT "id", "title", "published_at" FROM "articles" WHERE "id" IN (3, 1) ORDER BY "id"
# => [#<Article @id=1 @title="Foo" @published_at=nil>, #<Article @id=3 @title="Foo Foo" @published_at=nil>]

response.records.records.class
# => DataMapper::Collection

More examples can be found in the examples folder. Please see the Elasticsearch::Model::Adapter module and its submodules for technical information.

Development and Community

For local development, clone the repository and run bundle install. See rake -T for a list of available Rake tasks for running tests, generating documentation, starting a testing cluster, etc.

Bug fixes and features must be covered by unit tests.

Github's pull requests and issues are used to communicate, send bug reports and code contributions.

To run all tests against a test Elasticsearch cluster, use a command like this:

curl -# https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-1.0.0.RC1.tar.gz | tar xz -C tmp/
SERVER=start TEST_CLUSTER_COMMAND=$PWD/tmp/elasticsearch-1.0.0.RC1/bin/elasticsearch bundle exec rake test:all

License

This software is licensed under the Apache 2 license, quoted below.

Copyright (c) 2014 Elasticsearch <http://www.elasticsearch.org>

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.