How to decode HTML Entities in C?

后端 未结 5 1436
广开言路
广开言路 2020-11-30 11:29

I\'m interested in unescaping text for example: \ maps to \\ in C. Does anyone know of a good library?

As reference the Wikipedia

相关标签:
5条回答
  • 2020-11-30 11:42
    QString UNESC(const QString &txt) {
        QStringList bld;
        static QChar AMP = '&', SCL = ';';
        static QMap<QString, QString> dec = {
            {"&lt;", "<"}, {"&gt;", ">"}
          , {"&amp;", "&"}, {"&quot;", R"(")"}, {"&#039;", "'"} };
    
        if(!txt.contains(AMP)) { return txt; }
    
        int bgn = 0, pos = 0;
        while((pos = txt.indexOf(AMP, pos)) != -1) {
            int end = txt.indexOf(SCL, pos)+1;
            QString val = dec[txt.mid(pos, end - pos)];
    
            bld << txt.mid(bgn, pos - bgn);
    
            if(val.isEmpty()) {
                end = txt.indexOf(AMP, pos+1);
                bld << txt.mid(pos, end - pos);
            } else {
                bld << val;
            }// else // if(val.isEmpty())
    
            bgn = end; pos = end;
        }// while((pos = txt.indexOf(AMP, pos)) != -1)
    
        return bld.join(QString());
    }// UNESC
    
    0 讨论(0)
  • 2020-11-30 11:47

    I wrote my own unescape code; very simplified, but does the job: pn_util.c

    0 讨论(0)
  • 2020-11-30 11:51

    I had some free time today and wrote a decoder from scratch: entities.c, entities.h.

    The only function with external linkage is

    size_t decode_html_entities_utf8(char *dest, const char *src);
    

    If src is a null pointer, the string will be taken from dest, ie the entities will be decoded in-place. Otherwise, the decoded string will be put in dest - which should point to a buffer big enough to hold strlen(src) + 1 characters - and src will be unchanged.

    The function will return the length of the decoded string.

    Please note that I haven't done any extensive testing, so there's a high probability of bugs...

    0 讨论(0)
  • 2020-11-30 11:51

    For another open source reference in C to decoding these HTML entities you can check out the command line utility uni2ascii/ascii2uni. The relevant files are enttbl.{c,h} for entity lookup and putu8.c which down converts from UTF32 to UTF8.

    uni2ascii

    0 讨论(0)
  • 2020-11-30 11:55

    Function Description: Convert special HTML entities back to characters. Need to do some modifications to fit your requirement.

    char* HtmlSpecialChars_Decode(char* encodedHtmlSpecialEntities)
    {
    int encodedLen = 0;
    int escapeArrayLen = 0;
    static char decodedHtmlSpecialChars[TITLE_SIZE];
    char innerHtmlSpecialEntities[MAX_CONFIG_ITEM_SIZE];
    
    /* This mapping table can be extended if necessary. */
    static const struct {
        const char* encodedEntity;
        const char decodedChar;
    } entityToChars[] = {
            {"&lt;", '<'},
            {"&gt;", '>'},
            {"&amp;", '&'},
            {"&quot;", '"'},
            {"&#039;", '\''},
        };
    
    if(strchr(encodedHtmlSpecialEntities, '&') == NULL)
        return encodedHtmlSpecialEntities;
    
    memset(decodedHtmlSpecialChars, '\0', TITLE_SIZE);
    memset(innerHtmlSpecialEntities, '\0', MAX_CONFIG_ITEM_SIZE);
    escapeArrayLen = sizeof(entityToChars) / sizeof(entityToChars[0]);
    
    
    strcpy(innerHtmlSpecialEntities, encodedHtmlSpecialEntities);
    encodedLen = strlen(innerHtmlSpecialEntities);
    
    for(int i = 0; i < encodedLen; i++)
    {
        if(innerHtmlSpecialEntities[i] == '&')
        {
            /* Potential encode char. */
            char * tempEntities = innerHtmlSpecialEntities + i;
    
            for(int j = 0; j < escapeArrayLen; j++)
            {
                if(strncmp(tempEntities, entityToChars[j].encodedEntity, strlen(entityToChars[j].encodedEntity)) == 0)
                {
                    int index = 0;
                    strncat(decodedHtmlSpecialChars, innerHtmlSpecialEntities, i);
    
                    index = strlen(decodedHtmlSpecialChars);
                    decodedHtmlSpecialChars[index] = entityToChars[j].decodedChar;
                    if(strlen(tempEntities) > strlen(entityToChars[j].encodedEntity))
                    {
                        /* Not to the end, continue */
                        char temp[MAX_CONFIG_ITEM_SIZE] = {'\0'};
                        strcpy(temp, tempEntities + strlen(entityToChars[j].encodedEntity));
                        memset(innerHtmlSpecialEntities, '\0', MAX_CONFIG_ITEM_SIZE);
                        strcpy(innerHtmlSpecialEntities, temp);
    
                        encodedLen = strlen(innerHtmlSpecialEntities);
                        i = -1;
                    }
                    else
                        encodedLen = 0;
    
                    break;
                }
            }
        }
    }
    
    if(encodedLen != 0)
        strcat(decodedHtmlSpecialChars, innerHtmlSpecialEntities);
    
    return decodedHtmlSpecialChars;
    

    }

    0 讨论(0)
提交回复
热议问题