I have assigned a task to create (relatively) simple reporting system. In these system, user will be shown a table result of report. A table has some fields and each field g
Use MariaDB, with it's Dynamic Columns. Effectively, that lets you put all the miscellany columns into a single column, yet still give you efficient access to them.
I would keep a few of the common fields in their own columns.
More discussion of EAV and suggestions (and how to do it without Dynamic Columns).
Avoid stringly-typed data by replacing VALUE
with NUMBER_VALUE
, DATE_VALUE
, STRING_VALUE
. Those three types are good enough most of the time.
You can add XMLTYPE and other fancy columns later if they're needed. And for Oracle, use VARCHAR2 instead of CHAR to conserve space.
Always try to store values as the correct type. Native data types are faster, smaller, easier to use, and safer.
Oracle has a generic data type system (ANYTYPE, ANYDATA, and ANYDATASET), but those types are difficult to use and should be avoided in most cases.
Architects often think using a single field for all data makes things easier. It makes it easier to generate pretty pictures of the data model but it makes everything else more difficult. Consider these issues:
Developing type-safe queries against stringly-typed data is painful. For example, let's say you want to find "Date of Birth" for people born in this millennium:
select *
from ReportFieldValue
join ReportField
on ReportFieldValue.ReportFieldid = ReportField.id
where ReportField.name = 'Date of Birth'
and to_date(value, 'YYYY-MM-DD') > date '2000-01-01'
Can you spot the bug? The above query is dangerous, even if you stored the date in the correct format, and very few developers know how to properly fix it. Oracle has optimizations that make it difficult to force a specific order of operations. You'll need a query like this to be safe:
select *
from
(
select ReportFieldValue.*, ReportField.*
--ROWNUM ensures type safe by preventing view merging and predicate pushing.
,rownum
from ReportFieldValue
join ReportField
on ReportFieldValue.ReportFieldid = ReportField.id
where ReportField.name = 'Date of Birth'
)
where to_date(value, 'YYYY-MM-DD') > date '2000-01-01';
You don't want to have to tell every developer to write their queries that way.
Your design is a variation of the Entity Attribute Value (EAV) data model, which is often regarded as an anti-pattern in database design.
Maybe a better approach for you would be to create a reporting values table with, say, 300 columns (NUMBER_VALUE_1 through NUMBER_VALUE_100, VARCHAR2_VALUE_1..100, and DATE_VALUE_1..100).
Then, design the rest of your data model around tracking which reports use which columns and what they use each column for.
This has two benefits: first, you are not storing dates and numbers in strings (the benefits of which have already been pointed out), and second, you avoid many of the performance and data integrity issues associated with the EAV model.
Using an Oracle 11g2 database, I moved 30,000 records from one table into an EAV data model. I then queried the model to get those 30,000 records back.
SELECT SUM (header_id * LENGTH (ordered_item) * (SYSDATE - schedule_ship_date))
FROM (SELECT rf.report_type_id,
rv.report_header_id,
rv.report_record_id,
MAX (DECODE (rf.report_field_name, 'HEADER_ID', rv.number_value, NULL)) header_id,
MAX (DECODE (rf.report_field_name, 'LINE_ID', rv.number_value, NULL)) line_id,
MAX (DECODE (rf.report_field_name, 'ORDERED_ITEM', rv.char_value, NULL)) ordered_item,
MAX (DECODE (rf.report_field_name, 'SCHEDULE_SHIP_DATE', rv.date_value, NULL)) schedule_ship_date
FROM eav_report_record_values rv INNER JOIN eav_report_fields rf ON rf.report_field_id = rv.report_field_id
WHERE rv.report_header_id = 20
GROUP BY rf.report_type_id, rv.report_header_id, rv.report_record_id)
The results were:
1 row selected.
Elapsed: 00:00:22.62
Execution Plan
----------------------------------------------------------
----------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)|
----------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 2026 | 53 (67)|
| 1 | SORT AGGREGATE | | 1 | 2026 | |
| 2 | VIEW | | 130K| 251M| 53 (67)|
| 3 | HASH GROUP BY | | 130K| 261M| 53 (67)|
| 4 | NESTED LOOPS | | | | |
| 5 | NESTED LOOPS | | 130K| 261M| 36 (50)|
| 6 | TABLE ACCESS FULL | EAV_REPORT_FIELDS | 350 | 15050 | 18 (0)|
|* 7 | INDEX RANGE SCAN | EAV_REPORT_RECORD_VALUES_N1 | 130K| | 0 (0)|
|* 8 | TABLE ACCESS BY INDEX ROWID| EAV_REPORT_RECORD_VALUES | 372 | 749K| 0 (0)|
----------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
7 - access("RV"."REPORT_HEADER_ID"=20)
8 - filter("RF"."REPORT_FIELD_ID"="RV"."REPORT_FIELD_ID")
Note
-----
- 'PLAN_TABLE' is old version
Statistics
----------------------------------------------------------
4 recursive calls
0 db block gets
275480 consistent gets
465 physical reads
0 redo size
307 bytes sent via SQL*Net to client
252 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
That's 22 seconds to get 30,000 rows of 4 columns each. That is way too long. From a flat table we'd be looking at under 2 seconds, easy.
Well, you have a very good point about storing data in the correct data types.
And i agree that this does pose a problem for user-defined data systems.
One way of solveing this problem is by adding tables for each data type group (ints, floating points, strings, binary and dates, instead of keeping the value in the ReportFieldValue
table.
However, this will make your life harder since you will have to select and join multiple tables in order to get a single result.
another way would be to add a data type column in the ReportFieldValue
and create a user defined function to dynamically cast the data from strings to the appropriate data type (using the value in the data type column), so that you can use that for sorting, searching etc`.
Sql server also has a data type called sql_variant that should support multiple types, and though I've never worked with it it's documentation seems promising.