Language detection with data in PostgreSQL

后端 未结 6 557
我在风中等你
我在风中等你 2020-12-31 15:17

I have a table in PostgreSQL where a column is a text. I need a library or tool that can identify the language of each text for a test purpose.

There is no need for

6条回答
  •  再見小時候
    2020-12-31 15:53

    You can use PL/Perl (CREATE FUNCTION langof(text) LANGUAGEplperluAS ...) with Lingua::Identify CPAN module.

    Perl script:

    #!/usr/bin/perl
    use Lingua::Identify qw(langof);
    undef $/;
    my $textstring = <>;  ## warning - slurps whole file to memory
    my $a = langof( $textstring );    # gives the most probable language
    print "$a\n";
    

    And the function:

    create or replace function langof( text ) returns varchar(2)
    immutable returns null on null input
    language plperlu as $perlcode$
        use Lingua::Identify qw(langof);
        return langof( shift );
    $perlcode$;
    

    Works for me:

    filip@filip=# select langof('Pójdź, kiń-że tę chmurność w głąb flaszy');
     langof
    --------
     pl
    (1 row)
    
    Time: 1.801 ms
    

    PL/Perl on Windows

    PL/Perl language libary (plperl.dll) comes preinstalled in latest Windows installer of postgres.

    But to use PL/Perl, you need Perl interpreter itself. Specifically, Perl 5.14 (at the time of this writing). Most common installer is ActiveState, but it's not free. Free one comes from StrawberryPerl. Make sure you have PERL514.DLL in place.

    After installing Perl, login to your postgres database and try to run

    CREATE LANGUAGE plperlu;
    

    Language identification library

    If quality is your concern, you have some options: You can improve Lingua::Identify yourself (it's open source) or you could try another library. I found this one, which is commercial but looks promising.

提交回复
热议问题