I am working on a prototype of a search system.
I have a table in oracle with some fields. I generated data that looks real. Around 300.000 rows. For example:
You can use all the insights provided here. Some additional points I wanted to share.
Solr does duplication of the data for providing the fast search over indexed data. One important thing about solr is, it uses immutable data structure for storing all the data.
You can disable the document level Term Vectors storage if you are not using solr highlighting feature of the solr.
Additionally, Solr uses many different compression techniques for different type of data. It uses bit packing/vint compression for posting lists and numerical values. LZ4 compression for stored fields and term vectors. It uses FST data structure for storing the Term Dictionary. FST is an special implementation of Trie data structure.