As the name suggests POST-GRES, Postgres was introduced in 1986 as a successor to Ingres database. It was the evolution of RDBMS (Relational Database Management Systems). In 1996 the name was changed from Postgres to PostgreSQL which reflects its support for SQL.
PostgreSQL over MySQL
Unlike MySQL which works primarily on relational database models, PostgreSQL is an Object-relational database management system and it offers support for SQL and NoSQL.
Postgres implements Multiversion Concurrency Control (MVCC) which means it supports parallel query plans and can handle complex queries better than MySQL. It also supports modern datatypes like JSON, XML, hstore whereas MySQL only supports JSON.
Today, PostgreSQL is being used by tech giants like Apple, Cisco, Instagram etc. Who says you can't use it with great success as well? Let's see which are the prerequisites for this first.
For this article we used these tools:
- PostgreSQL (12.5) on ubuntu 20.04;
- Explain.dalibo.com for visualizing Explain Analyze outputs;
- Fake2db for inserting fake data to database tables.
The setup is straight-forward: install PostgreSQL, then using the Fake2db tool we can insert fake data to our custom models or use the default model provided by them. We went with the default model, Customers, which has these attributes: name, address, country, city, register_date, birthdate, email, phone_number and locale. Also we added around 250,000 records in the database to be as much closer to the practical dataset as possible.
To optimize the database we first have to analyze the queries, know how exactly the query is fetching data from the database, and what index is being used. Then: how much time is being used by queries? What plan is being selected?
To answer all these questions, we will be using the EXPLAIN keyword while running the query and it will provide us with all the information about the query.
When appended before the query, this keyword gives us the information about the execution plan that the PostgreSQL planner generates.
There are different parameters that can also be used along with explain keyword like ANALYZE which gives us the actual time spent in execution of a query statement. VERBOSE displays additional information about plans. COSTS gives us the information about the startup and total cost of each node. Lastly BUFFERS gives us information about buffer usage. Now let's use the explain keyword and try to get all the records and analyze the information returned by it. Query is simple as you'd expect:
```````````````````EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS) select * from customer;
This is the returned output:
Now let's see what each of these lines does and what are the components.
This plan does not have any nodes because it is the simplest query, but we will surely see them later in this article. For now, just remember that every line which starts from this symbol “->” is a node and the lines indented below are the information regarding that node. The node on the top will have cumulative information of all child nodes.
Seq Scan specifies that PostgreSQL will go read all the rows in a table and eliminate rows if not satisfied with filters, then return the result.
In our case, we fetched all the rows so none of them got discarded.
Cost has two numbers (i.e. number..number), the first number representing the startup cost of an operation. For us it is 0.00 because PostgreSQL started from reading the first page and then moved along.
If we were to add a condition like
order by, it will first have to sort all the data. After sorting and before fetching and returning the result, it had already worked on sorting some number in cost, which will show up as startup cost in plan.
This is a screenshot of the plan when we use order by in query. The second number in cost tells us about the total time it took to sort, fetch and return the data.
Note that now the startup cost is bumped up to 24756.25 from 0.00, when we used this query:
``````````EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS) select * from customer order by name;
Next, we focus on
rows, which is the estimation of the number of rows that will be returned by the query. This estimate is done by PostgreSQL and it knows how many rows you have, or how many rows will (likely) match the criteria. It helps a lot when PostgreSQL is making decisions on which plan to use.
Width is the estimation about how many bytes of information will be returned from a single row on average.
Actual time is self-explanatory: it tells how many mili-seconds it took to complete the plan.
However, cost estimation is done in arbitrary units and it depends on different factors like
Rows in the actual section represents how many rows were returned actually in the result of the query and
loops shows how many times this operation was ran in total.
To see the query plan in an easier-to-understand format, check this resource. Just paste in the output of the query plan and the actual query you ran. Here is the screenshot of the second query and its query plan using this website:
Indexes are copies created on specific columns of table, that help you quickly find the desired record from table.
Utilizing that column, let’s say the query is that we want to find a specific customer by name. We run this query without using an index, then we will observe the changes after using one.
-----------EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS, Format JSON) select * from customer where "name" ='Mary';
Total time it took to complete the query is 32.1 ms - this is without using any index.
Next we create an index on column 'name' and see if there are any improvements in overall execution time. Here is the syntax to create index:
------------------CREATE INDEX customer_name ON customer(name)
Here are the results of the same query but we created an index this time.
Voilà! The time of query is now 21.6ms, meaning 67% less than the previous run.
One more thing to look at: in the previous query, the database did parallel seq scan which is like going to the table and checking every entry in the database to see if it is what we searched for. When we have millions of records, you can just imagine what it means using this approach to query the database.
Now on to the index database: when you use it, it selects Bitmap Heap Scan which is like a well-planned blueprint of the data in the table.
Instead of going and searching for desired records in table one by one, using indexes database knows where a particular subset of data is located and jumps straight to it.
In our case, the database went straight to records that were in the 'M' subset which was the condition of our query.
Looking at this whole situation, you can get an idea of what it's like to have billions of records and not using indexes.
Indexes improve our query time, our hardware cost and saves us the waiting time for a query.
The next step from single column index is using multiple fields in an index, syntax is same:
------------------CREATE INDEX customer_name ON customer(name, email)
This index will further improve execution time of the query if we had last name also in where clause. The Pro tip is to have columns that you will put in a query to have indexes created on them.
Check and Balance
You have to understand that creating indexes costs us space on the hard drive, and having too many indexes can alter our database performance. Also the burden on the database is to update those indexes whenever there is an update, insert or delete operation on the database. So we will have to optimize this.
Not every join and index can be optimal in every condition or database.
For example, if we have 10 records in our database of customers, using an index is overkill because databases first have to fetch the index and then do the search. Instead, a simple sequential scan will give results faster in this case.
Like any other MySql database, Postgres supports joins between tables in a database. The joins supported by Postgres includes Cross, Inner, Left Outer, Right Outer and Full Outer Join. We are not going to go in depth of each join type as there are lots of articles explaining details about joins and when and how to use them. Rather, we will look over what scan types the database uses when joins are applied.
Nested Loops is a scan type used when we apply a join in a query. In this particular scan the inner table of the join is searched for by matching rows with the outer table. It is fast and is best for small tables. Used in Cross Joins.
Merge Joins is used in sorted data sets, it is best when we have large datasets in the database. However, it has a high startup cost (startup cost includes operations done on datasets before looking and returning the desired result, i.e. sorting).
Hash Join builds the hash for the inner table and then scans the outer table unlike nested loops. It also has a high startup cost due to creation of a hash for the inner table but its execution is less. Used in Inner joins.
Let’s join our tables customer and
detailed_registration using an inner join to get the customer's name and address. Here is the query:
```````EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS, FORMAT JSON) select c.name, dr.adress from customer c inner join detailed_registration dr on c.email = dr.email ;````````` c.email = dr.email ;`````````
In the visualization of the explain keyword, we get the whole flow wherein the query was executed. It used
Hash Join which we learned about a few minutes ago. So hash was created on the customer table and then a detailed registration table was scanned.
Based on the size of the database, using index or not makes a lot of difference in regards to execution time. Next time you use an index or get stuck at joining the tables and querying the data, keep in mind these points and you’ll surely get better results.
Vacuum is a Postgres command just link EXPLAIN. In Postgres, updated key-value tuples are not removed from the table when updated or when changed.
To remove those tuples/records, we use the Vacuum command which kind of reclaims that space in the drive. By default, Vacuum is enabled in Postgres. If not, you can easily Vacuum a table using the command
Wrapping it up
Using database and web applications on the same server is a poor practice in multiple ways. The hardware most of the time fails to provide for both database and application and in the end we have a crashed database or web application.
I would suggest you use a standalone database when you are in a production environment and use this type of architecture only in testing. Sudden crashes in either database or application can corrupt data or even destroy the database.
We hope you enjoyed learning about PostgreSQL and how to optimize it. If you still have questions, we warmly welcome them in the comments section ☺️. And if you learned something in this article, maybe you want to learn some more by subscribing to our newsletter (check below).