ml.net sentiment analysis warning about format errors & bad values

问题

I've been having a problem with my ml.net console app. This is my first time using ml.net in Visual Studio so I was following this tutorial from microsoft.com, which is a sentiment analysis using binary classification.

I'm trying to process some test data in the form of tsv files to get a positive or negative sentiment analysis, but in debugging I'm receiving warnings there being 1 format error and 2 bad values.

I decided to ask all you great devs here on Stack to see if anyone can help me find a solution.

Here's an image of the debugging below:

Here's the link to my test data:
wiki-data
wiki-test-data

Finally, here's my code for those who what to reproduce the problem:

There's 2 c# files: SentimentData.cs & Program.cs.

1 - SentimentData.cs:

using System;
using System.Collections.Generic;
using System.Text;
using Microsoft.ML.Runtime.Api;

namespace MachineLearningTut
{
 public class SentimentData
 {
    [Column(ordinal: "0")]
    public string SentimentText;
    [Column(ordinal: "1", name: "Label")]
    public float Sentiment;
 }

 public class SentimentPrediction
 {
    [ColumnName("PredictedLabel")]
    public bool Sentiment;
 }
}

2 - Program.cs:

using System;
using Microsoft.ML.Models;
using Microsoft.ML.Runtime;
using Microsoft.ML.Runtime.Api;
using Microsoft.ML.Trainers;
using Microsoft.ML.Transforms;
using System.Collections.Generic;
using System.Linq;
using Microsoft.ML;
using Microsoft.ML.Data;
using System.Threading.Tasks;

namespace MachineLearningTut
{
class Program
{
    const string _dataPath = @".\Data\wikipedia-detox-250-line-data.tsv";
    const string _testDataPath = @".\Data\wikipedia-detox-250-line-test.tsv";
    const string _modelpath = @".\Data\Model.zip";

    static async Task Main(string[] args)
    {
        var model = await TrainAsync();

        Evaluate(model);

        Predict(model);
    }

    public static async Task<PredictionModel<SentimentData, SentimentPrediction>> TrainAsync()
    {
        var pipeline = new LearningPipeline();

        pipeline.Add(new TextLoader (_dataPath).CreateFrom<SentimentData>());

        pipeline.Add(new TextFeaturizer("Features", "SentimentText"));

        pipeline.Add(new FastForestBinaryClassifier() { NumLeaves = 5, NumTrees = 5, MinDocumentsInLeafs = 2 });

        PredictionModel<SentimentData, SentimentPrediction> model = pipeline.Train<SentimentData, SentimentPrediction>();

        await model.WriteAsync(path: _modelpath);

        return model;
    }

    public static void Evaluate(PredictionModel<SentimentData, SentimentPrediction> model)
    {
        var testData = new TextLoader(_testDataPath).CreateFrom<SentimentData>();

        var evaluator = new BinaryClassificationEvaluator();

        BinaryClassificationMetrics metrics = evaluator.Evaluate(model, testData);

        Console.WriteLine();
        Console.WriteLine("PredictionModel quality metrics evaluation");
        Console.WriteLine("-------------------------------------");
        Console.WriteLine($"Accuracy: {metrics.Accuracy:P2}");
        Console.WriteLine($"Auc: {metrics.Auc:P2}");
        Console.WriteLine($"F1Score: {metrics.F1Score:P2}");

    }

    public static void Predict(PredictionModel<SentimentData, SentimentPrediction> model)
    {
        IEnumerable<SentimentData> sentiments = new[]
        {
            new SentimentData
            {
                SentimentText = "Please refrain from adding nonsense to Wikipedia."
            },

            new SentimentData
            {
                SentimentText = "He is the best, and the article should say that."
            }
        };

        IEnumerable<SentimentPrediction> predictions = model.Predict(sentiments);

        Console.WriteLine();
        Console.WriteLine("Sentiment Predictions");
        Console.WriteLine("---------------------");

        var sentimentsAndPredictions = sentiments.Zip(predictions, (sentiment, prediction) => (sentiment, prediction));

        foreach (var item in sentimentsAndPredictions)
        {
            Console.WriteLine($"Sentiment: {item.sentiment.SentimentText} | Prediction: {(item.prediction.Sentiment ? "Positive" : "Negative")}");
        }
        Console.WriteLine();
    }
}

}

If anyone would like to see the code or more details on the solution, ask me on the chat and I'll send it. Thanks in advance!!! [Throws a Thumbs Up]

回答1:

I think I got a fix for you. A couple of things to update:

First, I think you got your SentimentData properties switched to what the data has. Try changing it to

[Column(ordinal: "0", name: "Label")]
public float Sentiment;

[Column(ordinal: "1")]
public string SentimentText;

Second, use the useHeader parameter in the TextLoader.CreateFrom method. Don't forget to add that to the other one for the validation data, as well.

pipeline.Add(new TextLoader(_dataPath).CreateFrom<SentimentData>(useHeader: true));

With those two updates, I got the below output. Looks like a nice model with an AUC of 85%!

回答2:

Another thing that helps with text type datasets is indicating that the text has quotes:

TextLoader("someFile.txt").CreateFrom<Input>(useHeader: true, allowQuotedStrings: true)

回答3:

There is bad formated value on 252 and 253 rows. May me there fields wich contains the delimiter charachter. If you post code or sample data we can be more precise.

来源：https://stackoverflow.com/questions/50907009/ml-net-sentiment-analysis-warning-about-format-errors-bad-values

标签

visual-studio

ml.net