Disadvantages of table with too many columns

问题

I have some data that I need to put in a PostgreSQL database. These data are related to schools. So there are a LOT of attributes relating to the school, mostly small integers, floats or small texts. And all the data changes yearly. So I'm creating an entity called YearlyData and putting the attributes there. But the thing is, the number of attributes are around 50-60. Now these cannot be normalized because they are plain attributes of the school itself. So I ethically cannot separate them into tables. But I'm not sure if this would harm my performance.

I can try to categorize those data and put them in separate tables and point to them from the YearlyData table. But then trying to search schools with 20-30+ parameters would cause insane amount of joins, I'm guessing. I'm also not sure if that'd harm my performance.

Any expert advice?

回答1:

There are a few things to consider here:

Does the list of attributes change significantly over time
Does the list of attributes require custom user-defined attributes
Are there different attributes for different schools (i.e. many attributes only apply to one or a few schools)?

If any of these are true, you might think about a properties store approach like EAV, hstore, json fields, xml fields, etc.

If not - if you have a fairly static list of properties where most of them make sense for most of the rows - then there's not really a problem with having them as 60 individual columns. It'll be easier to add indexes for commonly searched for sets of attributes, including partial and composite indexes, etc, and searches - particularly those for many different attributes - will be much faster.

See also: Database design - should I use 30 columns or 1 column with all data in form of JSON/XML?

There's also a compromise option available to you: A main table for the most important details you look up a lot, plus side-tables for logical groupings of attributes. Say:

yearly_summary (
    yearly_summary_id serial primary key,
    school_id integer,
    total_students integer,
    ...
)

plus

yearly_student_stats(
    yearly_summary_id integer primary key references yearly_summary(yearly_summy_id) on delete cascade,
    ...
)

etc. The integer primary key that's also a foreign key means you have an enforced 1:1 (optional) relationship to the other table. This approach can be useful if you have a few logical groupings of attributes that you can cluster into side-tables.

I'd also be surprised if a little more thought didn't reveal things that do make sense to normalize. Do you have year7_blah, year8_blah, year9_blah etc columns? If so: Great candidate for normalization.

回答2:

PostgreSQL stores rows on what is called data pages that are 8kb in size. You can think of these as buckets that have a finite size limit. The disadvantage to wide rows is that the database can fit less row on a data page. It's faster for the database engine to bring back 1,000 rows off one page than it is to bring 1,000 rows that are spread over multiple pages. In that case one read vs 1,000 with disk IO being your enemy. This is something to be aware of not to avoid. Wide tables are often needed you can live with the overhead. In your case you will use 240 bytes per a row roughly(4 bytes per integer * 60 rows).

来源：https://stackoverflow.com/questions/17901804/disadvantages-of-table-with-too-many-columns

标签

postgresql

database-design