Software projects and development in a research environment [closed]

…衆ロ難τιáo~ 提交于 2019-12-02 14:01:58
ire_and_curses

It's extremely difficult. The environment both you and Stefano Borini describe is very accurate. I think there are three key factors which propagate the situation.

  1. Short-term thinking
  2. Lack of formal training and experience
  3. Continuous turnover of grad students/postdocs to shoulder the brunt of new development

Short-term thinking. There are a few reasons that short-term thinking is the norm, most of them already well explained by Stefano. As well as the awful pressure to publish and the lack of recognition for software creation, I would emphasise the number of short-term contracts. There is simply very little advantage for more junior academics (PhD students and postdocs) to spend any time planning long-term software strategies, since contracts are 2-3 years. In the case of longer-term projects e.g. those based around the simulation code of a permanent member of staff, I have seen some applications of basic software engineering, things like simple version control, standard test cases, etc. However even in these cases, project management is extremely primitive.

Lack of formal training and experience. This is a serious handicap. In astronomy and astrophysics, programming is an essential tool, but understanding of the costs of development, particularly maintenance overheads, is extremely poor. Because scientists are normally smart people, there is a feeling that software engineering practices don't really apply to them, and that they can 'just make it work'. With more experience, most programmers realise that writing code that mostly works isn't the hard part; maintaining and extending it efficiently and safely is. Some scientific code is throwaway, and in these cases the quick and dirty approach is adequate. But all too often, the code will be used and reused for years to come, bringing consequent grief to all involved with it.

Continuous turnover of grad students/postdocs for new development. I think this is the key feature that allows the academic approach to software to continue to survive. If the code is horrendous and takes days to understand and debug, who pays that price? In general, it's not the original author (who has probably moved on). Nor is it the permanent member of staff, who is often only peripherally involved with new development. It is normally the graduate student who is implementing new algorithms, producing novel approaches, trying to extend the code in some way. Sometimes it will be a postdoc, hired specifically to work on adding some feature to an existing code, and contractually obliged to work on this area for some fraction of their time.

This model is hugely inefficient. I know a PhD student in astrophysics who spent over a year trying to implement a relatively basic piece of mathematics, only a few hundred lines of code, in an existing n-body code. Why did it take so long? Because she literally spent weeks trying to understand the existing, horribly written code, and how to add her calculation to it, and months more ineffectively debugging her problems due to the monolithic code structure, coupled with her own lack of experience. Note that there was almost no science involved in this process; just wasting time grappling with code. Who ultimately paid that price? Only her. She was the one who had to put more hours in to try and get enough results to make a PhD. Her supervisor will get another grad student after she's gone - and so the cycle continues.

The point I'm trying to make is that the problem with the software creation process in academia is endemic within the system itself, a function of the resources available and the type of work that is rewarded. The culture is deeply embedded throughout academia. I don't see any easy way of changing that culture through external resources or training. It's the system itself that needs to change, to reward people for writing substantial code, to place increased scrutiny on the correctness of results produced using scientific code, to recognise the importance of training and process in code, and to hold supervisors jointly responsible for wasting the time of the members of their research group.

I'll tell you my experience.

It is undoubt that a lot of software gets created and wasted in the academia. Fact is that it's difficult to adapt research software, purposely created for a specific research objective, to a more general environment. Also, the product of academia are scientific papers, not software. The value of software in academia is zero. The data you produce with that software is evaluated, once you write a paper on it (which takes a lot of editorial time).

In most cases, however, research groups have recognized frequent patterns, which can be polished, tested and archived as internal knowledge. This is what I do with my personal toolkit. I grow it according to my research needs, only with those features that are "cross-project". Developing a personal toolkit is almost a requirement, as your scientific needs are most likely unique for some verse (otherwise you would not be doing research) and you want to have as low amount of external dependencies as possible (since if something evolves and breaks your stuff, you will not have the time to fix it).

Everything else, however, is too specific for a given project to be crystallized. I therefore tend not to encapsulate something that is clearly a one-time solver. I do, however, go back and improve it if, later on, other projects require the same piece of code.

Short project span, and the heat of research (e.g. the publish or perish vision so central today), requires agile, quick languages, and in general, languages that can be grasped quickly. Ph.Ds in genomics and quantum chemistry don't have formal programming background. In some cases, they don't even like it. So the language must be quick, easy, clean, flexible, and easy to understand later on. The latter point is capital, as there's no time to produce documentation, and it's guaranteed that in academia, everyone will leave sooner or later, you burn the group experience to zero every three years or so. Academia is a high risk industry that periodically fires all their hard-formed executors, keeping only some managers. Having a code that is maintainable and can be easily grasped by someone else is therefore capital. Also, never underestimate the power of a google search to solve your problems. With a well deployed language you are more likely to find answers to gotchas and issues you can stumble on.

Management is a problem as well. Waterfall is out of discussion. There is no time for paperwork programming (requirements, specs, design). Spiral is quite ok, but as low paperwork as possible is clearly recommended. Fact is that anything that does not give you an article in academia is wasted time. If you spend one month writing specs, it's a month wasted, and your contract expires in 11 months. Moreover, that fatty document counts zero or close to zero for your career (as many other things: administration and teaching are two examples). Of course, Agile methods are also out of discussion. Most development is made by groups that are far, and in general have a bunch of other things to do as well. Coding concentration comes in brief bursts during "spare time" between articles, and before or after meetings. The bazaar is the most likely, but the bazaar carries a lot of issues as well.

So, to answer your question, the best strategy is "slow accumulation" of known good software, development in small bursts with a quick and agile method and language. Good coding practices need to be taught during lectures, as good laboratory practices are taught during practical courses (eg. never put water in sulphuric acid, always the opposite)

The hardest part is the transition between "this is just for a paper" and "we're really going to use this."

If you know that the code will only be for a paper, fine, take short cuts. Hardcode everything you can. Don't waste time on extensive validation if the programmer is the only one who will ever run the code. Etc. The problem is when someone says "Great! Now let's use this for real" or "Now let's use it for this entirely different scenario than what it was developed and tested for."

A related challenge is having to explain why the software isn't ready for prime time even though it obviously works, i.e. it's prototype quality and not production quality. What do you mean you need to rewrite it?

I would recommend that you/they read "Clean Code"

http://www.amazon.co.uk/Clean-Code-Handbook-Software-Craftsmanship/dp/0132350882/ref=sr_1_1?ie=UTF8&s=books&qid=1251633753&sr=8-1

The basic idea of this book is that if you do not keep the code "clean", eventually the mess will stop you from making any progress.

The kind of big science I do (particle physics) has a small number of large, long-running projects (ROOT and Geant4, for instance). These are developed mostly by actual programming professionals. Using processes that would be recognized by anyone else in the industry.

Then, each collaboration has a number of project-wide programs which are developed collaboratively under the direction of a small number of senior programming scientists. These use at least the basic tools (always version control, often some kind of bug tracking or automated builds).

Finally almost every scientist works on their own programs. Use of process on these programs is very spotty, and they often suffer from all the ills that others have discussed (short lifetimes, poor coding skills, no review, lots of serial maintainers, Not Invented Here Syndrome, etc. etc.). The only advantage that is available here compared to small group science, is that they work with the tools I talked about above, so there is something that you can point to and say "That is what you want to achieve.".

Don't really have that much more to add to what has already been said. It's a difficult balance to strike because our priorities are different - academia is all about discovering new things, software engineering is more about getting things done according to specifications.

The most important thing I can think of is to try and extricate yourself from the culture of in-house development that goes on in academia and try to maintain a disciplined approach to development, difficult as that may be in many cases owing to time restraints, lack of experience etc. This control-freakery sucks away at individual responsibility and decision-making and leaves it in the hands of a few who do not necessarily know best

Get a good book on software development, Code Complete already mention is excellent, as well as any respected book on algorithms and data structures. Read up on how you will need to manage your data eg do you need fast lookup / hash-tables / binary trees. Don't reinvent the wheel - use the libraries and things like STL otherwise you are likely to be wasting time. There is a vast amount on the web including this very fine blog.

Many academics, besides sometimes being primadonna-ish and precious about any approach seen as businesslike, tend to be quite vague in their objectives. To put it mildly. For this reason alone it is vital to build up your own software arsenal of helper functions and recipes, eventually, hopefully ending up with a kind of flexible experimental framework that enables you to try out any combination of things without being to restricted to any particular problem area. Strongly resist the temptation to just dive into the problem at hand.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!