How does github figure out a project's language?

后端 未结 5 1593
小蘑菇
小蘑菇 2020-12-04 11:41

I was recently working on a github project in both JavaScript and C++, and noticed that github tagged the project as C++. If you have to pick a single language, this is prob

5条回答
  •  野趣味
    野趣味 (楼主)
    2020-12-04 12:17

    First, know that you can override the language detected for files in your repository using Linguist overrides.

    Now, in a nutshell,

    1. Each repository is tagged with the first language from language statistics.
    2. Language statistics count the total size of files for each detected programming or markup language. Vendored, documentation, and generated files are not counted.
    3. The language of each file is detected by the open source project Linguist.

    How does Linguist detect languages?

    Linguist relies on the following strategies, in order, and returns the language as soon as it found a perfect match (strategy with a single language returned).

    1. Look for Emacs and Vim modelines.
    2. Known filename. Some filenames are associated to specific languages (think Makefile).
    3. Look for a shebang. A file with a #!/bin/bash shebang will be classified as Shell.
    4. Known file extension. Languages have a set of extensions associated to them. There are, however, lots of conflicts with this strategy. The conflicting results (think C++, C and Objective-C for .h) are refined by the subsequent strategies.
    5. A set of heuristic rules. They usually rely on regular expressions over the content of files to try and identify the language (e.g., ^[^#]+:- for Prolog).
    6. A naive Bayesian classifier trained on sample files. Last strategy, lowest accuracy. The Bayesian classifier always takes a subset of languages as input; it is not meant to classify among all languages. The best match found by the classifier is returned.

    What are unvendored and documentation files?

    Linguist considers some files as vendored, meaning they are not included in language statistics. These include third-party libraries such as jQuery and are defined in the vendor.yml configuration file. You can also vendor or unvendor files in your repository using Linguist overrides.

    Similarly, documentation files are defined in documentation.yml and can be changed using Linguist overrides.

    How are generated files detected?

    Linguist relies on simple rules to detect generated files, using both the paths and the content of files. Generated files are not counted in language statistics and are not displayed in diffs on github.com.

    What about programming and markup languages?

    In Linguist, each language is given a type. These types can be found in the main configuration file, languages.yml. Only the programming and markup languages are counted in statistics.

提交回复
热议问题