Performance of Multiple Joins

问题

Greetings Overflowers,

I need to query against objects with many/complex spacial conditions. In relational databases that is translated to many joins (possibly 10+). I'm new to this business and wondering whether to go with MS SQL Server 2008 R2 or Oracle 11g or document-based solutions such as RavenDB or simply go with some spacial database (GIS)...

Any thoughts ?

Regards

UPDATE: Thank you all for your answers. Would anybody opt for document/spatial databases ? My database would consist of tens of millions to few billion records. Mostly read-only. Almost no updates unless in case of mistakes in input. Overnight inserts and not that frequent. The join tables are predicted beforehand but the number of self joins (tables joining themselves multiple times) is not. Small pages of results from such queries are going to be viewed on an highly interactive website so response time is critical. Any predictions on how this can perform on MS SQL Server 2008 R2 or Oracle 11g ? I'm also concerned about boosting performance by adding more servers, which one scales better ? How about PostgresQL ?

回答1:

Build and test.

That's the only way to know whether your idea is going to work. There are free versions of Oracle, SQL Server, and Teradata available for downloading. PostgreSQL is free, period.

Database design help might not be free. SQL performance suffers from bad design more than any other single cause.

I did a test (proof of concept) yesterday (?? days are running together in my head) on 20 tables of 50 million rows, natural keys (no id numbers), 20 left joins, median access time of 40 milliseconds. Using a commodity desktop computer with slow disks and 2 gigs of RAM.

Edit: It seems there's also a free, single-server version of Greenplum that's only constrained to two CPU sockets, no limitation on CPU cores. No limitation on database size, either. I'm feelin' the need to play with a couple of terabytes.

回答2:

It is much more common to perform 10+ joins on a set of tables in practical application than you might think. The ramifications of inner vs outer joins that get that high are different but I wouldn't be overly worried unless the amount of data you are outer joining on becomes very large. Databases are optimized for dealing with sets.

Example:

Just yesterday I wrote a query that performs 13 inner joins. It executes on a 50,000+ record set in less than a second.

回答3:

Agreed, it isn't so much the joins that is the problem as the amount of data being queried. Though I will admit that working in an environment that uses MS SQL Server 2005, MS SQL Server 2008 R2, and ORACLE 10g and 11g, it does seem like our MS SQL databases are slightly more prone to dead locks when large queries are run.

回答4:

One of the big unknowns with your question is how dynamic is the SQL and for similar SQL statements, how often do the values in the predicates change? Do they use bind parameters instead of inline values (they should where possible). If there is a lot of opportunity for reuse, Oracle would be my choice.

Regardless of the complexity of the SQL, Oracle has an array of features that can help. Materialized views and SQL rewrite can provide drastic performance benefits in cases where mildly aged results are acceptable over realtime results. Also with 11g comes result set caching as well.

Once the db chooses an optimization plan it is not so much the number of joins that matters as how well the database is tuned for those specific joins. Indexing, up-to-date statistics, and materialized views may be critical.

回答5:

Both MS SQL Server 2008 R2 and ORACLE 11g should be able to handle that without difficulty. In terms of expandability I would recommend Oracle 11g in a RAC environment. You could also do Microsoft clustering with MS SQL Server 2008 R2, but in my experience Oracle's RAC is a more solid solution.

At the same time, the applications that you plan to use with the database should also play a role in the decision. If you will be using MS SharePoint or other MS applications, then MS SQL Server 2008 R2 may be a better solution.

In terms of PostgreSQL, I don't have much experience with it, but I have heard nightmare stories from people who have used it in a enterprise environment and large business situation. From what I know it is not exactly scalability friendly. Personally, I think MySQL would be a better solution then PostgreSQL if you are looking for an open source solution, but keep in mind open source sql solutions with not be the easiest when it comes to scalability or a high availability environment if that is your ultimate goal.

来源：https://stackoverflow.com/questions/4972444/performance-of-multiple-joins

标签

database

performance

join

Spatial