At Amberbit, we have a legacy project which uses MongoDB. We’ve been working on it for few years now. Through this time it obviously grow and changed a lot and at some point we started focusing more on it’s performance. The biggest issue we had was that we were using mongoid and mongoid/moped, which not only caused memory leaks (rack mini profiler) but also had an impact on performance as we needed to run really complicated queries that non-relational database couldn’t handle well. That’s why we agreed on switching to PostgreSQL.

Step by step, I want to show you how to migrate from MongoDB to PostgreSQL. You don’t need to use all of my solutions, but I promise they’ll save your time and make your code cleaner.

Repositories

The key is to make sure that you won’t lose any query, which could break some of your functionalities. First step is based on the recipe proposed by Andrzej Krzywda in Fearless Refactoring. He recommends creating repositories for each model you are using or at least for the biggest and most complicated ones. By moving all your queries there, you can easily modify them later and change their syntax in the Active Record way. Same strategy apply to your tests.

class UsersRepository
  def find(id, attributes: nil)
    users.only(attributes).where(id: id).first
  end

  def find_by_email(email, attributes: nil)
    users.only(attributes).where(email: email)
  end

  def all(attributes: nil)
    users.only(attributes).all
  end

  def all_by_name(name, attributes: nil)
    users.only(attributes).where(name: name)
  end

  private

  def users
    @users ||= User
  end
end

To speed up moving queries to the repository don’t forget to grep some commonly occurring patterns:

$ git grep ‘ User\.’
$ git grep ‘ user\.’
$ git grep ‘\.users

Remember to find places where model is called by associated model if it’s connected by has_many, has_one or belongs_to.

UserCommunity.find(id).users 

change it to

def all_by_user_community(id, attributes: nil)
  users.only(attributes).where(user_community_id: id)
end

You should also move all create, update, delete and even save calls to the Repository. It will help you later on, while working on callbacks. It will be much cleaner to place it there instead of using before_save actions inside the model.

def update(user, attrs={}l)
  user.update(attrs)
end

def save(user)
  user.save
end

attributes

As you might see, I’m passing an optional parameter attributes. It allows you to pass array of the names of fields you want to receive, which will speed up your query execution in situations when you don’t really need all fields of your model.

size, length, count

It would be also a good practise to move all .count calls to the repository you are creating. When using mongodb, there is no difference when using count, length or size. In postgres, there is a huge performance difference between them. Using length, will load the data each time and then execute the query. When you use count, it won’t load all data but just execute count query. Performance-wise is best to use size. It will load data only when they’re not already loaded otherwise it will just count them.

Setup Postgres

When you’re done with the first step, you can go ahead and start setting up postgres for your project. This is the simplest part. You need to add pg and active_record gems to your Gemfile, run bundle exec install and prepare your database.yml file to keep your db connection configuration. I was afraid I’ll have some problems with working on two different databases in one project, but it turns out that it’s much easier than I expected. As I said, it’s the easiest part :)

Generate postgres migration

Once you have your project set up with the proper gems and your model repositories are all prepared, you should tell your new database about those new models you are going to add. Creating migrations is really easy as you can use rails generate ...command. Although, you need to remember to define the type of a migration, as we have configured two different databases. The solution you need to apply is:

$ rails generate active_record:migration MigrationName

When you generate your migration file, I recommend using your mongoid Model file and copying all of the mongoid field to generated migration. Of course you’ll need to modify them to fit the proper Active Record migration syntax. You can also use Find/Replace tool (vim: :%s/pattern/patter/g).

field :name, String

will be

t.string :name

From my experience, you should also add mongo_id string field to each model, which may help you later with your existing data migration. Don’t forget about your model relations and to deal with all of the mongoid embedded documents by creating additional tables or moving some fields directly to the parent model. If necessary, create some indirect tables. After you finish, remember to run rake db:migrate. If something goes wrong, you can always do rake db:rollback to undo your last migration. I think, you know the drill :)

Prepare your ActiveRecord models

I believe that the best practice to create new models and a task for populating data from MongoDB to PostgreSQL, would be by creating additional directory like /models/mongodb. Next, copy there models you are going to migrate and modify them by adding namespace and name of the collection, which your model should be pointing at:

class Mongodb::User
  field :name
  field :email

  store_in collection: ‘users’  #remember to pluralize the name of your model
end

When you’re done with that, get back to your primary model file and modify it in the ActiveRecord way:

  • Remove all include Mongoid::... and field :name lines.

  • Add < ActiveRecord::Base after class Name line

  • Modify your repositories to use Active Record syntaxes (change all nin, in, eq, use joins, etc.)

  • If your file is referencing to some other models that you are not going to migrate now, you need to consider adding some custom methods that will make a proper connection and return the data from referenced objects. From my experience, it’s best not to split the migrations. It’s much easier to keep track of all model associations, and later to create script for populating data, when you already have all the referenced models created.

    def find(id, attributes: ‘*’)

    users.select(attributes).where(id: id).first

    end

Creating a task for data migration

The last step of the migration is to move all the data you have in your MongoDB to your new Postgres database. To do that I used few components:

  • Mongoid/Moped driver to get data directly from database in hash format.

  • Postgres transaction to be sure that the whole model migration will pass without any problems. Any brake or error during migration would trigger rollback, so we won’t need to check which records has already been created when trying to run it one more time after applying fixes.

  • Plain SQL to insert. It is much faster and cleaner way than using ActiveRecord. Also, if you already properly created all the records, you don’t need to validate them or run callbacks. You can use Bulk Insert gem that will help you generate all of the inserts in just one execution(it will speed it up by a lot).

    ActiveRecord::Base.transaction do

    attributes = User.attribute_names
    User.bulk_insert do |pg_model|
      Mongodb::User.collection.find().to_a.each do |user|
        attrs = user.slice(*attributes).merge(mongo_id: user['_id'].to_s)
        pg_model.add(attrs)
      end
    end

    end

If you are going to create some references, you should implement proper migration in the same transaction. It only depends, if the associated model, has been already created or not.

So, if you need to join user to the group, you’ll need to migrate the group first, and then, in user migration block, merge additional attributes with the id of the group in postgres database. Use mongo_id field to find the proper one:

attrs.merge(store_group_id: UserGroup.where(mongo_id: user[‘store_group_id’].to_s)
                                     .first.try(:id))

This is also the reason why I recommended creating additional mongo_id field in each model.

Also, remember about properly migrating many to many relations, which were arrays in mongo database, but now they need additional indirect table. After migrating both of the models(User and Tag), get one of them and create additional loop:

UserTag.bulk_insert do |postgres|
  Mongodb::Tag.collection.find().each do |tag|
    tag_id = Tag.where(mongo_id: record['_id'].to_s).first.try(:id)
    tag['user_ids'].each do |uid|
      postgres.add(user_id: User.where(mongo_id: sid.to_s).first.try(:id), tag_id: tag_id)
    end
  end
end

This is another situation where you definitely need your old mongo id value. This kind of migration will be probably the longest one. For me, running it on ~20k records takes around 3 min.

Conclusion

This transition was a time consuming process but we don’t regret it. It not only made our service faster but it also helped us to clear our code and make some good changes.

I bet there are more people who already had to deal with this problem so let me know about your experience and the solutions you worked out :)

Post by Rafał Maksymczuk

Rafał enjoys a range of technologies (Ruby, JavaScript), and has been working with us since 2013.