postgres

Handling Spelling Mistakes with Postgres Full Text Search

Background #

Postgres Full Text Search (FTS) is a great way to implement site search on a website running Postgres already, without requiring additional infrastructure.

On a recent engagement with a client, we were deciding between Postgres FTS and ElasticSearch. Ultimately we chose FTS because we could spin it up without having to add extra infrastructure, as we would with ElasticSearch.

Since the project was written in Ruby on Rails, we were able to use the excellent PgSearch gem to implement FTS in ActiveRecord.

Multisearch #

As we wanted a general site search, we needed to utilize multisearch. Multisearch combines multiple ActiveRecord models into one search 'document' table that you can search against. For example, if a user searches for some search term, and the search is configured for multisearch, then every single model that we mark as multisearchable will be searched for that term at the same time. See here for more detail.

Search Features #

PgSearch allows for different search features, tsearch, trigram, and dmetaphone. The default is tsearch, which uses the built-in Postgres Full Text Search.

This was great for our use case, since it also comes with highlighting, a feature that was required. The highlighting is from a field returned by Postgres FTS, where it returns the text around the search term for context and bolds the search terms.

Spelling Mistakes #

Unfortunately, tsearch does not handle misspelled words. However, as I mentioned before, PgSearch allows for other search features!

And trigram is a feature that can be installed via a Postgres extension (pg_trgm) that does just that.

Trigram #

  • The idea behind trigram search is to split pieces of text into sets of three-letter segments, and compare the sets to one another
  • If two trigram sets are similar enough, we assume there was a spelling mistake, and return the document with the correctly-spelled term.
  • As a quick example (ignoring whitespace): Consider the word Viget. Viget would make trigrams:
[vig, ige, get]
  • Now, consider our evil twin agency, Qiget. They would make trigrams
[qig, ige, get]
  • The two trigram sets match very closely, with only one of the trigrams not being the same. Thus, if we were to compare these with pg_trgm, we could reasonably tell that anyone typing 'Qiget' must have been actually looking for 'Viget', and just misspelled it.

Working Trigram into our existing solution #

PgSearch allows us to use multiple search features at once, so we can use tsearch and trigram side by side. Note that we cannot just replace tsearch with trigram due to needing some features in tsearch that are exclusive to it. Here is what an example configuration might look like.

PgSearch.multisearch_options = {
  using: {
    tsearch: {
      prefix: true,
      highlight: {
        MaxFragments: 1
      }
    },
    trigram: { 
      only: [:content]
    }
  }
}

Trigram (and timelines) causing issues #

While it was easy to slot Trigram into our multisearch, it caused a pretty serious performance hit. We were seeing 50x-75x slower searches with both features combined than with just tsearch. We needed to find a way to balance performance with handling misspellings

At the point that handling misspellings became prioritized, the entire search feature was almost fully QA'd and about ready to go out. There wasn't much time left in the budget to find a good solution for the issue.

This thread from the PgSearch repo sums it up pretty well – there were multiple other users that were/are having similar issues as we were. The top-rated comment in this thread is someone mentioning that the solution was to just use ElasticSearch ('top-rated' is doing a lot of heavy lifting. It did have the most likes...at two). We needed to find some sort of middle ground solution that we could act on quickly.

Postgres Documentation saves the day #

In the docs for the Trigram Postgres extension, the writers give an idea for using Trigram in conjunction with Full Text Search. The general idea is to create a separate words table that has a Trigram index on it.

Something like this worked for us. Note that we added an additional step with a temporary table. This was to allow us to filter out words that included non-alphabet characters.

execute <<-SQL
  -- Need to make a temp table so we can remove non-alphabet characters like websites
  CREATE TEMP TABLE temp_words AS
    SELECT word FROM ts_stat('SELECT to_tsvector(''simple'', content) FROM pg_search_documents');

  CREATE TABLE pg_search_words (
    id SERIAL PRIMARY KEY,
    word text
  );

  INSERT INTO pg_search_words (word)
    SELECT word
    FROM temp_words
    WHERE word ~ '^[a-zA-Z]+$';
  
  CREATE INDEX pg_words_idx ON pg_search_words USING GIN (word gin_trgm_ops);
  
  DROP TABLE temp_words;
SQL

This words table is therefore populated with every unique word that exists in your search content table. For us, this table was pretty large.

result = ActiveRecord::Base.connection.execute("SELECT COUNT(*) FROM pg_search_words").first['count']
puts result.first['count']
# => 1118644

Keeping the words table up-to-date #

As mentioned in the docs, this table is separate from your search table. Therefore, it needs to be either periodically regenerated or at least have any new words added to search content also added to this table.

One way to achieve this is with a trigger, which adds all new words (still filtering out non-alphabet characters) that are inserted into the documents table to the words table

create_trigger("pg_search_documents_after_insert_update_row_tr", generated: true, compatibility: 1)
  .on("pg_search_documents")
  .after(:insert, :update) do
  <<-SQL_ACTIONS
    CREATE TEMP TABLE temp_words AS
      SELECT word FROM ts_stat('SELECT to_tsvector(''simple'', ' || quote_literal(NEW.content) || ')');

    INSERT INTO pg_search_words (word)
      SELECT word
      FROM temp_words
      WHERE word ~ '^[a-zA-Z]+$';

    DROP TABLE temp_words;
  SQL_ACTIONS

end

Note that this does not handle records being deleted from the table – that would need to be something separate.

How we used the words table #

Assuming for simplicity the user's search term is a single word, if the search returns no results, we compare the search term's trigram set to the trigram index on the words table, and return the closest match.

Then, we'd show the closest match in a "Did you mean {correctly-spelled word}?" that hyperlinks to a search of the correctly-spelled word

Given more time, I would have liked to explore options to speed up the combined FTS and Trigram search. I'm certain we could have improved on the performance issues, but I can't say for sure that we could have gotten the search time down to a reasonable amount.

A future enhancement that would be pretty simple is to automatically search for that correctly-spelled word, removing the prompt to click the link. We could also change the text to something like "Showing results for {correctly-spelled word}".

Ultimately, I think with the situation at hand, we made the right call implementing Trigram this way. The search is just as fast as before, and now in the case of misspellings, a user just has to follow the link to the correctly-spelled word and they will see the results they wanted very quickly.




postgres

SE-Radio Episode 328: Bruce Momjian on the Postgres Query Planner

Postgres developer Bruce Momjian joins Robert Blumen for a discussion of the SQL query optimizer in the Postgres RDBMS. They delve into the internals of query planning and look at how developers can make it work for their apps.




postgres

SE-Radio Episode 362: Simon Riggs on Advanced Features of PostgreSQL

Simon Riggs, founder and CTO of 2nd Quadrant, discusses the advanced features of the Postgres database, that allow developers to focus on applications whilst the database does the heavy lifting of handling large and diverse quantities of data.




postgres

Episode 454: Thomas Richter Postgres as an OLAP database

Thomas Richter is the founder of Swarm64, a Postgres extension company designed to boost performance of your Postgres instance. This episode examines the internals of Postgres, performance considerations, and relational database types.




postgres

Episode 496: Bruce Momjian on Multi-Version Concurrency Control in Postgres (MVCC)

This week, Postgres server developer Bruce Momjian joins host Robert Blumen for a discussion of multi-version concurrency control (MVCC) in the Postgres database. They begin with a discussion of the isolation requirement in database transactions (I in ACID); how isolation can be achieved with locking; limitations of locking; how locking limits concurrency and creates variability in query runtimes; multi-version concurrency control as a means to achieve isolation; how Postgres manages multiple versions of a row; snapshots; copy-on-write and snapshots; visibility; database transaction IDs; how tx ids, snapshots and versions interact; the need for locking when there are multiple writers; how MVCC was added to Postgres; and how to clean up unused space left over from aged-out versions.




postgres

Episode 511: Ant Wilson on Supabase (Postgres as a Service)

Ant Wilson of Supabase discusses building an open source alternative to Firebase with PostgreSQL. SE Radio host Jeremy Jung spoke with Wilson about how Supabase compares to Firebase, building an API layer with postgREST, authentication using GoTrue...




postgres

SE Radio 583: Lukas Fittl on Postgres Performance

Lukas Fittl of pganalyze discusses the performance of Postgres, one of the world’s most popular database systems. SE Radio host Philip Winston speaks with Fittl about database indexing, queries, maintenance, scaling, and stored procedures. They also discuss some features of pganalyze, such as the index and vacuum advisors.




postgres

PHP CRUD Operations with PostgreSQL Server

CRUD (Create, Read, Update, and Delete) operations are used in the web application for the data manipulation in the database. There are four basic operations involved in the CRUD functionality that help to manage data with the database. We have already shared the tutorial to perform create (insert), read (select), update, and delete operations in PHP CRUD Operations with MySQL. In this tutorial, we will build PHP CRUD application with PostgreSQL server. PostgreSQL also known as Postgres is a relational database management system (RDBMS). The PostgreSQL database is open-source and free to use. We will connect with the PostgreSQL Server

The post PHP CRUD Operations with PostgreSQL Server appeared first on CodexWorld.




postgres

Rails, Angular, Postgres, and Bootstrap : powerful, effective, and efficient full-stack web development

Location: Engineering Library- QA76.76.D47C6675 2016




postgres

Don't Do This - PostgreSQL wiki




postgres

What I Wish Someone Told Me About Postgres




postgres

Problem Notes for SAS®9 - 66505: The OBS= option does not generate a limit clause when you use SAS/ACCESS Interface to PostgreSQL to access a Yellowbrick database

When you use SAS/ACCESS Interface to PostgreSQL to query a Yellowbrick database, the SAS OBS= option is not generating a limit clause on the query that is passed to the database. Click the



postgres

Beekeeper Studio | Free SQL editor and database manager for MySQL, Postgres, SQLite, and SQL Server. Available for Windows, Mac, and Linux.




postgres

Wimpie Nortje: Database migration libraries for PostgreSQL.

It may be tempting at the start of a new project to create the first database tables manually, or write SQL scripts that you run manually, especially when you first have to spend a significant amount of time on sifting through all the migration libraries and then some more to get it working properly.

Going through this process did slow me down at the start of the project but I was determined to use a migration tool because hunting inexplicable bugs that only happen in production just to find out there is a definition mismatch between the production and development databases is not fun. Using such a tool also motivates you to write both the setup and teardown steps for each table while the current design is still fresh in your mind.

At first I considered a standalone migration tool because I expect them to be very good at that single task. However, learning the idiosyncrasies of a new tool and trying to make it fit seamlessly into my development workflow seemed like more trouble than it is worth.

I decided to stick with a Common Lisp library and found the following seven that work with PostgreSQL and/or Postmodern:

I quickly discounted Crane and Mito because they are ORM (Object Relational Mapper) libraries which are way more complex than a dedicated migration library. Development on Crane have stalled some time ago and I don't feel it is mature enough for frictionless use yet. Mito declares itself as being in Alpha state; also not mature enough yet.

I only stumbled onto cl-mgr and Orizuru-orm long after making my decision so I did not investigate them seriously. Orizuru-orm is in any case an ORM which I would have discounted because it is too complex for my needs. CL-mgr looks simple, which is a good thing. It is based on cl-dbi which makes it a good candidate if you foresee switching databases but even if I discovered it sooner I would have discounted it for the same reason as CL-migrations.

CL-migrations looks very promising. It is a simple library focusing only on migrations. It uses clsql to interface with the database which bothered me because I already committed to using Postmodern and I try to avoid adding a lot of unused code to my projects. The positive side is that it interfaces to many different databases so it is a good candidate if you are not committed to using Postmodern. It is also a stable code base with no outstanding bug reports.

The two projects I focused on was Postmodern-passenger-pigeon and Database-migrations because they both use Postmodern for a database interface.

Postmodern-passenger-pigeon was in active development at the time and it seemed safer to use than Database-migrations because it can do dry runs, which is a very nice feature when you are upgrading your production database and face the possibility of losing data when things go awry. Unfortunately I could not get it working within a reasonable amount of time.

I finally settled on Database-migrations. It is a small code base, focused on one task, it is mature and it uses Postmodern so it does not pull in a whole new database interface into my project. There are however some less positive issues.

The first issue is a hindrance during development. Every time the migrations ASDF system (or the file containing it, as ASDF prefers that all systems be defined in a single file) is recompiled it adds all the defined migrations to the migrations list. Though each one will only be applied once to the DB it is still bothersome. One can then clear the list with (setf database-migrations::*migrations* nil) but then only newly modified migration files will be added. The solution then is to touch the .asd file after clearing the migrations list.

The second negative point is quite dangerous. The downgrade function takes a target version as parameter, with a default target of 0. This means that if you execute downgrade without specifying a target version you delete your whole database.

I am currently using Database-migrations and it works well for me. If for some reason I need to switch I will use cl-migrations.

Using Database-migrations

To address the danger of unintentionally deleting my database I created a wrapper function that does both upgrade and downgrade, and it requires a target version number.

Another practical issue I discovered is that upgrades and downgrades happen in the same order as they are defined in the migration file. If you create two tables in a single file where table 2 depends on table 1 then you can not revert / downgrade because Database-migrations will attempt to delete table 1 before table 2. The solution here is to use the def-queries-migration macro (instead of def-query-migration) which defines multiple queries simultaneously . If you get overwhelmed by a single definition that defines multiple tables the other option is to stick with one migration definition per file.




postgres

An intro to making Postgres high availability on Kubernetes

#351 — April 15, 2020

Read on the Web

Postgres Weekly

A Detailed Look at pg_show_plans — A few issues ago we linked to a basic introduction to pg_show_plans – this goes a little further. pg_show_plans lets you look at the execution plans of slow queries in real time as they’re being executed which can help you when troubleshooting.

Kaarel Moppel

Intersecting GPS Tracks to Identify Infected Individuals — I’m not a huge fan of COVID-19 related content, but this is a pretty interesting technique with numerous use cases. Essentially it uses PostGIS to identify overlapping paths.

Florian Nadler

Online Training: Learn PostgreSQL from Home — The remote PostgreSQL Database Administration training course is available at a discounted rate & will be conducted in two different timezones. The course covers day-to-day DBA operations, monitoring, server configurations, and more.

2ndQuadrant PostgreSQL Training sponsor

PostgreSQL's 'Related Projects' — Thanks to Andreas Scherbaum for pointing out a new page on the Postgres site dedicated to projects related to Postgres like the code that runs the Postgres web site, mailing list, build farm, package management system, etc.

PostgreSQL Global Development Group

Authentication Configuration in Postgres (and CockroachDB) — In Postgres, client authentication can be controlled via a ‘HBA’ (host-based authentication) file. It’s not something we see covered very often, so you might find this interesting, particularly as it compares things against CockroachDB.

Raphael ‘kena’ Poss

▶  Easy And Correct High Availability Postgres with Kubernetes — A 50 minute talk from PostgresOpen 2019 that goes all the way ‘from containers up’ until actually doing stuff with Postgres.

Steven Pousty

How To Set Up an Express API Backend Project With Postgres — A pretty extensive walkthrough of creating an HTTP API using Express with Node.js and Postgres on the backend, then deploying it all on Heroku.

Chidi Orji

A Beginners Guide to Basic Indexing in Postgres

James Bannister

eBook: The Most Important Events to Monitor in Your Postgres Logs — In this eBook, we are looking at the Top 6 Postgres log events for monitoring query performance and preventing downtime.

pganalyze sponsor

Documenting the Citus Extension to Postgres: An Interview with Joe Nelson — Joe, a.k.a. begriffs, talks about why he works on documentation, why the multi-tenant and real-time analytics tutorials matter, the INSERT..SELECT with repartitioning feature, and what development platform Citus uses for docs.

Citus Data (Microsoft)

Procedural vs Query Approaches for Finding Packages — Explorations of a query that can be used to display which packages are available for a given FreeBSD port. Get your head around the data model and the ideas here apply to all sorts of situations.

Dan Langille

???? Upcoming Events

All in-person events we had listed are cancelled or postponed due to the COVID outbreak, so we're now linking to webinars, livestreams, and similar online events.

If you have any, just hit reply and if it's Postgres related (and either free or not too expensive) we'll include it in a future issue. Just one this week:

???? – requires e-mail address or registration
???? – costs money to participate

???? Seen on Twitter

Saw this tweet and thought it was a pretty neat reminder of the sorts of things we can do with Postgres. Justin kindly let us include it:

Click through to the original tweet if you want to see the code better. Neat use for a generated column!




postgres

Workloads, acceleration, and making Postgres better

#353 — April 29, 2020

Read on the Web

Postgres Weekly

7 Things That Could Be Improved in Postgres — As 1990s dance pop group D:Ream sang in 1994, Things Can Only Get Better.. including Postgres ???? Luckily these are all ‘nice to have’s but I dare say we’ll see some of them (such as automatic tuning and auto-vacuuming improvements) appear over time.

Kaarel Moppel

How The Citus Distributed Query Executor Adapts to a Postgres Workload — Citus is the popular extension for horizontally scaling Postgres and its query executor has seen some huge updates lately.

Citus Data (Microsoft)

eBook: The Most Important Events to Monitor in Your Postgres Logs — In this pganalyze eBook, we are looking at the Top 6 Postgres log events for monitoring query performance and preventing downtime.

pganalyze sponsor

Swarm64 DA 4.0: A Database Acceleration Extension for Postgres — Swarm64 started life as a FPGA-driven way to accelerate Postgres performance, but can now work without FPGAs too. This is not a free product but if you want to give it a run, there’s a trial or it can be spun up from the AWS Marketplace.

Yana Krasteva

Postgres Performance Goalposts — An interesting heuristic from Bruce here on what to do if you expect your connections, queries, or write queries to be above/below certain levels.

Bruce Momjian

A Tale of Password Authentication Methods in Postgres“Let’s say you want to implement a password authentication method in a client/server protocol..” Here’s the story of how Postgres came up with its approaches.

Peter Eisentraut

How to Set application_name When Using psql — As Craig says: “Setting your application name in Postgres is SO USEFUL. It will help a lot for debugging when you’ve got multiple different apps/services connecting to the same database.”

Denish Patel

How to Upgrade Postgres from v11 to v12 on Ubuntu 20.04 — Now that Ubuntu 20.04 is out, this might be on your mind!

Paolo Melchiorre

Working with Amazon Aurora PostgreSQL: What Happened to the Stats? — Apparently there’s a bug with numerous versions of Aurora PostgreSQL that causes certain stats to be lost on restart.

Michael Vitale

Postgres Vision 2020 - Free Online Conference (June 23-24) — Learn how today’s IT leaders are using Postgres. Join from anywhere in the world and listen from 30+ Postgres experts.

EnterpriseDB sponsor

A Deep Dive into PostGIS Nearest Neighbor Search — Take a deep dive into the Postgres and PostGIS internals to find out how K-nearest neighbor accelerates local search.

Martin Davis

My Favorite Postgres Extensions: Part One — A basic high level look at pg_partman and postgres_fdw.

Nawaz Ahmed

Kanel: Generate TypeScript Types from Postgres

Kristian Dupont

Postgres.app: The Easiest Way to Get Started with Postgres on the Mac — I’ve used this for years, it’s super popular, but if there’s just a handful of developers out there who’d benefit from it and don’t know about it, this reminder will be worth it :-) It continues to get very frequent updates.

Jakob Egger, Chris Pastl, and Mattt Thompson

???? Upcoming Events

All in-person events we had listed are cancelled or postponed due to the COVID outbreak, so we're now linking to webinars, livestreams, and similar online events.

If you have any, just hit reply and if it's Postgres related (and either free or not too expensive) we'll include it in a future issue. Just one this week:

  • ???? Postgres Vision 2020 on June 23-24. A full attempt at an online Postgres conference across multiple days with multiple tracks.

???? – requires e-mail address or registration
???? – costs money to participate