Thoughts on minimize code and maximize data philosophy

前端 未结 6 1834
小鲜肉
小鲜肉 2021-02-12 19:24

I have heard of the concept of minimizing code and maximizing data, and was wondering what advice other people can give me on how/why I should do this when building my own syste

6条回答
  •  一整个雨季
    2021-02-12 19:53

    Other answers have already dug into how you can often code complex behavior with simple code that just reacts to the pattern of its particular input. You can think of the data as a domain-specific language, and of your code as an interpreter (maybe a trivial one).

    Given lots of data you can go further: the statistics can power decisions. Peter Norvig wrote a great chapter illustrating this theme in Beautiful Data, with text, code, and data all available online. (Disclosure: I'm thanked in the acknowledgements.) On pp. 238-239:

    How does the data-driven approach compare to a more traditional software development process wherein the programmer codes explicit rules? ... Clearly, the handwritten rules are difficult to develop and maintain. The big advantage of the data-driven method is that so much knowledge is encoded in the data, and new knowledge can be added just by collecting more data. But another advantage is that, while the data can be massive, the code is succinct—about 50 lines for correct, compared to over 1,500 for ht://Dig’s spelling code. ...

    Another issue is portability. If we wanted a Latvian spelling-corrector, the English metaphone rules would be of little use. To port the data-driven correct algorithm to another language, all we need is a large corpus of Latvian; the code remains unchanged.

    He shows this concretely with code in Python using a dataset collected at Google. Besides spelling correction, there's code to segment words and to decipher cryptograms -- in just a couple pages, again, where Grady Booch's book spent dozens without even finishing it.

    "The Unreasonable Effectiveness of Data" develops the same theme more broadly, without all the nuts and bolts.

    I've taken this approach in my work for another search company and I think it's still underexploited compared to table-driven/DSL programming, because most of us weren't swimming in data so much until the last decade or two.

提交回复
热议问题