Looking for a C++ implementation of the C4.5 algorithm

回眸只為那壹抹淺笑 提交于 2019-12-04 09:59:30

I may have found a possible C++ "implementation" of C5.0 (See5.0), but I haven't been able to dig into the source code enough to determine if it really works as advertised.

To reiterate my original concerns, the author of the port states the following about the C5.0 algorithm:

Another drawback with See5Sam [C5.0] is the impossibility to have more than one application tree at the same time. An application is read from files each time the executable is run and is stored in global variables here and there.

I will update my answer as soon as I get some time to look into the source code.

Update

It's looking pretty good, here is the C++ interface:

class CMee5
{
  public:

    /**
      Create a See 5 engine from tree/rules files.
      \param pcFileStem The stem of the See 5 file system. The engine
             initialisation will look for the following files:
              - pcFileStem.names Vanilla See 5 names file (mandatory)
              - pcFileStem.tree or pcFileStem.rules Vanilla See 5 tree or rules
                file (mandatory)
              - pcFileStem.costs Vanilla See 5 costs file (mandatory)
    */
    inline CMee5(const char* pcFileStem, bool bUseRules);

    /**
      Release allocated memory for this engine.
    */
    inline ~CMee5();

    /**
      General classification routine accepting a data record.
    */
    inline unsigned int classifyDataRec(DataRec Case, float* pOutConfidence);

    /**
      Show rules that were used to classify the last case.
      Classify() will have set RulesUsed[] to
      number of active rules for trial 0,
      first active rule, second active rule, ..., last active rule,
      number of active rules for trial 1,
      first active rule, second active rule, ..., last active rule,
      and so on.
    */
    inline void showRules(int Spaces);

    /**
      Open file with given extension for read/write with the actual file stem.
    */
    inline FILE* GetFile(String Extension, String RW);

    /**
      Read a raw case from file Df.

      For each attribute, read the attribute value from the file.
      If it is a discrete valued attribute, find the associated no.
      of this attribute value (if the value is unknown this is 0).

      Returns the array of attribute values.
    */
    inline DataRec GetDataRec(FILE *Df, Boolean Train);
    inline DataRec GetDataRecFromVec(float* pfVals, Boolean Train);
    inline float TranslateStringField(int Att, const char* Name);

    inline void Error(int ErrNo, String S1, String S2);

    inline int getMaxClass() const;
    inline int getClassAtt() const;
    inline int getLabelAtt() const;
    inline int getCWtAtt() const;
    inline unsigned int getMaxAtt() const;
    inline const char* getClassName(int nClassNo) const;
    inline char* getIgnoredVals();

    inline void FreeLastCase(void* DVec);
}

I would say that this is the best alternative I've found so far.

A C++ implementation for C4.5 called YaDT is available here, in the "Decision Trees" section:
http://www.di.unipi.it/~ruggieri/software.html

This is the source code for the last version:
http://www.di.unipi.it/~ruggieri/YaDT/YaDT1.2.5.zip

From the paper where the tool is described:

[...] In this paper, we describe a new from-scratch C++ implementation of a decision tree induction algorithm, which yields entropy-based decision trees in the style of C4.5. The implementation is called YaDT, an acronym for Yet another Decision Tree builder. The intended contribution of this paper is to present the design principles of the implementation that allowed for obtaining a highly efficient system. We discuss our choices on memory representation and modelling of data and metadata,on the algorithmic optimizations and their effect on memory and time performances, and on the trade-off between efficiency and accuracy of pruning heuristics. [...]

The paper is available here.

If I'm reading this correctly...it appears not to be organized as a C API, but as a C program. A data set is fed in, then it runs an algorithm and gives you back some rule descriptions.

I'd think the path you should take depends on whether you:

  1. merely want a C++ interface for supplying data and retrieving rules from the existing engine, or...

  2. want a C++ implementation that you can tinker with in order to tweak the algorithm to your own ends

If what you want is (1) then you could really just spawn the program as a process, feed it input as strings, and take the output as strings. That would probably be the easiest and most future-proof way of developing a "wrapper", and then you'd only have to develop C++ classes to represent the inputs and model the rule results (or match existing classes to these abstractions).

But if what you want is (2)...then I'd suggest trying whatever hacks you have in mind on top of the existing code in either C or Java--whichever you are most comfortable. You'll get to know the code that way, and if you have any improvements you may be able to feed them upstream to the author. If you build a relationship over the longer term then maybe you could collaborate and bring the C codebase slowly forward to C++, one aspect at a time, as the language was designed for.

Guess I just think the "When in Rome" philosophy usually works better than Port-In-One-Go, especially at the outset.


RESPONSE TO UPDATE: Process isolation takes care of your global variable issue. As for performance and data set size, you only have as many cores/CPUs and memory as you have. Whether you're using processes or threads usually isn't the issue when you're talking about matters of scale at that level. The overhead you encounter is if the marshalling is too expensive.

Prove the marshalling is the bottleneck, and to what extent... and you can build a case for why a process is a problem over a thread. But, there may be small tweaks to existing code to make marshalling cheaper which don't require a rewrite.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!