How to build a good training data set for machine learning and predictions?

倖福魔咒の 提交于 2019-12-12 09:01:28

问题


I have a school project to make a program that uses the Weka tools to make predictions on football (soccer) games.

Since the algorithms are already there (the J48 algorithm), I need just the data. I found a website that offers football game data for free and I tried it in Weka but the predictions were pretty bad so I assume my data is not structured properly.

I need to extract the data from my source and format it another way in order to make new attributes and classes for my model. Does anyone know of a course/tutorial/guide on how to properly create your attributes and classes for machine learning predictions? Is there a standard that describes the best way of choosing the attributes of a data set for training a machine learning algorithm? What's the approach on this?

here's an example of the data that I have at the moment: http://www.football-data.co.uk/mmz4281/1516/E0.csv

and here is what the columns mean: http://www.football-data.co.uk/notes.txt


回答1:


The problem may be that the data set you have is too small. Suppose you have ten variables and each variable has a range of 10 values. There are 10^10 possible configurations of these variables. It is unlikely your data set will be this large let alone cover all of the possible configurations. The trick is to narrow down the variables to the most relevant to avoid this large potential search space.

A second problem is that certain combinations of variables may be more significant than others.

The J48 algorithm attempts to to find the most relevant variable using entropy at each level in the tree. each path through the tree can be thought of as an AND condition: V1==a & V2==b ...

This covers the significance due to joint interactions. But what if the outcome is a result of A&B&C OR W&X&Y? The J48 algorithm will find only one and it will be the one where the the first variable selected will have the most overall significance when considered alone.

So, to answer your question, you need to not only find a training set which will cover the most common variable configurations in the "general" population but find an algorithm which will faithfully represent these training cases. Faithful meaning it will generally apply to unseen cases.

It's not an easy task. Many people and much money are involved in sports betting. If it were as easy as selecting the proper training set, you can be sure it would have been found by now.

EDIT:

It was asked in the comments how to you find the proper algorithm. The answer is the same way you find a needle in a haystack. There is no set rule. You may be lucky and stumble across it but in a large search space you won't ever know if you have. This is the same problem as finding the optimum point in a very convoluted search space.

A short-term answer is to

  • Think about what the algorithm can really accomplish. The J48 (and similar) algorithms are best suited for classification where the influence of the variables on the result are well known and follow a hierarchy. Flower classification is one example where it will likely excel.

  • Check the model against the training set. If it does poorly with the training set then it will likely have poor performance with unseen data. In general, you should expect the model to performance against the training to exceed the performance against unseen data.

  • The algorithm needs to be tested with data it has never seen. Testing against the training set, while a quick elimination test, will likely lead to overconfidence.
  • Reserve some of your data for testing. Weka provides a way to do this. The best case scenario would be to build the model on all cases except one (Leave On Out Approach) then see how the model performs on the average with these.

But this assumes the data at hand are not in some way biased.

A second pitfall is to let the test results bias the way you build the model.For example, trying different models parameters until you get an acceptable test response. With J48 it's not easy to allow this bias to creep in but if it did then you have just used your test set as an auxiliary training set.

  • Continue collecting more data; testing as long as possible. Even after all of the above, you still won't know how useful the algorithm is unless you can observe its performance against future cases. When what appears to be a good model starts behaving poorly then it's time to go back to the drawing board.

Surprisingly, there are a large number of fields (mostly in the soft sciences) which fail to see the need to verify the model with future data. But this is a matter better discussed elsewhere.

This may not be the answer you are looking for but it is the way things are.

In summary,

  1. The training data set should cover the 'significant' variable configurations
  2. You should verify the model against unseen data

Identifying (1) and doing (2) are the tricky bits. There is no cut-and-dried recipe to follow.



来源:https://stackoverflow.com/questions/36179454/how-to-build-a-good-training-data-set-for-machine-learning-and-predictions

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!