Design Pattern for Custom Fields in Relational Database

前端未结

关注

 4  628

I have assigned a task to create (relatively) simple reporting system. In these system, user will be shown a table result of report. A table has some fields and each field g

相关标签:

4条回答

一个人的身影

2020-12-08 17:54

Use MariaDB, with it's Dynamic Columns. Effectively, that lets you put all the miscellany columns into a single column, yet still give you efficient access to them.

I would keep a few of the common fields in their own columns.

More discussion of EAV and suggestions (and how to do it without Dynamic Columns).

0 讨论(0)
发布评论:

提交评论
- 加载中...
醉话见心

2020-12-08 17:55
Avoid stringly-typed data by replacing VALUE with NUMBER_VALUE, DATE_VALUE, STRING_VALUE. Those three types are good enough most of the time. You can add XMLTYPE and other fancy columns later if they're needed. And for Oracle, use VARCHAR2 instead of CHAR to conserve space.

Always try to store values as the correct type. Native data types are faster, smaller, easier to use, and safer.

Oracle has a generic data type system (ANYTYPE, ANYDATA, and ANYDATASET), but those types are difficult to use and should be avoided in most cases.

Architects often think using a single field for all data makes things easier. It makes it easier to generate pretty pictures of the data model but it makes everything else more difficult. Consider these issues:
1. You cannot do anything interesting with data without knowing the type. Even to display data it's useful to know the type to justify the text. In 99.9% of all use cases it will be obvious to the user which of the 3 columns is relevant.
2. Developing type-safe queries against stringly-typed data is painful. For example, let's say you want to find "Date of Birth" for people born in this millennium:
```
select *
from ReportFieldValue
join ReportField
    on ReportFieldValue.ReportFieldid = ReportField.id
where ReportField.name = 'Date of Birth'
    and to_date(value, 'YYYY-MM-DD') > date '2000-01-01'
```
  Can you spot the bug? The above query is dangerous, even if you stored the date in the correct format, and very few developers know how to properly fix it. Oracle has optimizations that make it difficult to force a specific order of operations. You'll need a query like this to be safe:
```
select *
from
(
    select ReportFieldValue.*, ReportField.*
        --ROWNUM ensures type safe by preventing view merging and predicate pushing.
        ,rownum
    from ReportFieldValue
    join ReportField
        on ReportFieldValue.ReportFieldid = ReportField.id
    where ReportField.name = 'Date of Birth'
)
where to_date(value, 'YYYY-MM-DD') > date '2000-01-01';
```
  You don't want to have to tell every developer to write their queries that way.
0 讨论(0)
发布评论:

提交评论
- 加载中...

感动是毒

2020-12-08 17:55

Your design is a variation of the Entity Attribute Value (EAV) data model, which is often regarded as an anti-pattern in database design.

Maybe a better approach for you would be to create a reporting values table with, say, 300 columns (NUMBER_VALUE_1 through NUMBER_VALUE_100, VARCHAR2_VALUE_1..100, and DATE_VALUE_1..100).

Then, design the rest of your data model around tracking which reports use which columns and what they use each column for.

This has two benefits: first, you are not storing dates and numbers in strings (the benefits of which have already been pointed out), and second, you avoid many of the performance and data integrity issues associated with the EAV model.

EDIT -- adding some empirical results of an EAV model

Using an Oracle 11g2 database, I moved 30,000 records from one table into an EAV data model. I then queried the model to get those 30,000 records back.

SELECT SUM (header_id * LENGTH (ordered_item) * (SYSDATE - schedule_ship_date))
FROM   (SELECT rf.report_type_id,
               rv.report_header_id,
               rv.report_record_id,
               MAX (DECODE (rf.report_field_name, 'HEADER_ID', rv.number_value, NULL)) header_id,
               MAX (DECODE (rf.report_field_name, 'LINE_ID', rv.number_value, NULL)) line_id,
               MAX (DECODE (rf.report_field_name, 'ORDERED_ITEM', rv.char_value, NULL)) ordered_item,
               MAX (DECODE (rf.report_field_name, 'SCHEDULE_SHIP_DATE', rv.date_value, NULL)) schedule_ship_date
        FROM   eav_report_record_values rv INNER JOIN eav_report_fields rf ON rf.report_field_id = rv.report_field_id
        WHERE  rv.report_header_id = 20 
        GROUP BY rf.report_type_id, rv.report_header_id, rv.report_record_id)

The results were:

1 row selected.

Elapsed: 00:00:22.62

Execution Plan
----------------------------------------------------------

----------------------------------------------------------------------------------------------------
| Id  | Operation                       | Name                        | Rows  | Bytes | Cost (%CPU)|
----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |                             |     1 |  2026 |    53  (67)|
|   1 |  SORT AGGREGATE                 |                             |     1 |  2026 |            |
|   2 |   VIEW                          |                             |   130K|   251M|    53  (67)|
|   3 |    HASH GROUP BY                |                             |   130K|   261M|    53  (67)|
|   4 |     NESTED LOOPS                |                             |       |       |            |
|   5 |      NESTED LOOPS               |                             |   130K|   261M|    36  (50)|
|   6 |       TABLE ACCESS FULL         | EAV_REPORT_FIELDS           |   350 | 15050 |    18   (0)|
|*  7 |       INDEX RANGE SCAN          | EAV_REPORT_RECORD_VALUES_N1 |   130K|       |     0   (0)|
|*  8 |      TABLE ACCESS BY INDEX ROWID| EAV_REPORT_RECORD_VALUES    |   372 |   749K|     0   (0)|
----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   7 - access("RV"."REPORT_HEADER_ID"=20)
   8 - filter("RF"."REPORT_FIELD_ID"="RV"."REPORT_FIELD_ID")

Note
-----
   - 'PLAN_TABLE' is old version


Statistics
----------------------------------------------------------
          4  recursive calls
          0  db block gets
     275480  consistent gets
        465  physical reads
          0  redo size
        307  bytes sent via SQL*Net to client
        252  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

That's 22 seconds to get 30,000 rows of 4 columns each. That is way too long. From a flat table we'd be looking at under 2 seconds, easy.

0 讨论(0)

被撕碎了的回忆

2020-12-08 17:55

Well, you have a very good point about storing data in the correct data types.
And i agree that this does pose a problem for user-defined data systems.

One way of solveing this problem is by adding tables for each data type group (ints, floating points, strings, binary and dates, instead of keeping the value in the ReportFieldValue table. However, this will make your life harder since you will have to select and join multiple tables in order to get a single result.

another way would be to add a data type column in the ReportFieldValue and create a user defined function to dynamically cast the data from strings to the appropriate data type (using the value in the data type column), so that you can use that for sorting, searching etc`.

Sql server also has a data type called sql_variant that should support multiple types, and though I've never worked with it it's documentation seems promising.

0 讨论(0)
发布评论:

提交评论
- 加载中...