encoding | 易学教程

Java - read UTF-8 file with a single emoji symbol

阅读更多关于 Java - read UTF-8 file with a single emoji symbol

问题 I have a file with a single unicode symbol. The file is encoded in UTF-8. It contains a single symbol represented as 4 bytes. https://www.fileformat.info/info/unicode/char/1f60a/index.htm F0 9F 98 8A When I read the file I get two symbols/chars. The program below prints ? 2 ? ? 55357 56842 ====================================== &#55357;&#56842; 16 & ====================================== ? 2 ? ====================================== Is this normal... or a bug? Or am I misusing something? How

Java - read UTF-8 file with a single emoji symbol

阅读更多关于 Java - read UTF-8 file with a single emoji symbol

Use accent senstive primary key in MySQL

阅读更多关于 Use accent senstive primary key in MySQL

问题 Desired result : Have an accent sensitive primary key in MySQL. I have a table of unique words, so I use the word itself as a primary key (by the way if someone can give me an advice about it, I have no idea if it's a good design/practice or not). I need that field to be accent (and why not case) sensitive, because it must distinguish between, for instance, 'demandé' and 'demande' , two different inflexions of the French verb "demander". I do not have any problem to store accented words in

How can I create and fit vocab.bpe file (GPT and GPT2 OpenAI models) with my own corpus text?

阅读更多关于 How can I create and fit vocab.bpe file (GPT and GPT2 OpenAI models) with my own corpus text?

问题 This question is for those who are familiar with GPT or GPT2 OpenAI models. In particular, with the encoding task (Byte-Pair Encoding). This is my problem: I would like to know how I could create my own vocab.bpe file. I have a spanish corpus text that I would like to use to fit my own bpe encoder. I have succeedeed in creating the encoder.json with the python-bpe library, but I have no idea on how to obtain the vocab.bpe file. I have reviewed the code in gpt-2/src/encoder.py but, I have not

Swift Decode [String: Any]

阅读更多关于 Swift Decode [String: Any]

问题 So I have this API that returns a dictionary of [String: Any] , I know that what comes as Any is Decodable or an array of Decodable however I can't for the life of me figure out how to take that dictionary and decode it to some struct: What I have goes basically like this: public func call<T: Codable> (completion handler: @escaping (T?) -> ()) { let promise = api.getPromise () promise.done (on: DispatchQueue.main, { (results: [String:Any]) let decodedResults:T? = results.decode (as: T.self) /

How can I globally ignore invalid byte sequences in UTF-8 strings?

阅读更多关于 How can I globally ignore invalid byte sequences in UTF-8 strings?

问题 I have an Rails application surviving from migrations since Rails version 1 and I would like to ignore all invalid byte sequences on it, to keep the backwards compatibility. I can't know the input encoding . Exemple: > "- Men\xFC -".split("n") ArgumentError: invalid byte sequence in UTF-8 from (irb):4:in `split' from (irb):4 from /home/fotanus/.rvm/rubies/ruby-2.0.0-rc2/bin/irb:16:in `<main>' I can overcome this problem in one line, by using the following, for example: > "- Men\xFC -".unpack(

How can I globally ignore invalid byte sequences in UTF-8 strings?

阅读更多关于 How can I globally ignore invalid byte sequences in UTF-8 strings?

Wrong encoding when importing JSON with Polish characters in JavaScript

阅读更多关于 Wrong encoding when importing JSON with Polish characters in JavaScript

问题 I have the below JSON file "locations.json": { "lubelskie": [ "abramów", "adamów", "aleksandrów", "annopol", "baranów", "batorz", "bełżec", "bełżyce" ] } I import the JSON in to my Class, using the below statement: import locations from "./locations.json"; class areas { constructor() { console.log(locations); } } export default areas; The console output I get is below: { lubelskie: ["abramÃ³w", "adamÃ³w", "aleksandrÃ³w", "annopol", "baranÃ³w", "batorz", "beÅ‚Å¼ec", "beÅ‚Å¼yce"] } The problem

Wrong encoding when importing JSON with Polish characters in JavaScript

阅读更多关于 Wrong encoding when importing JSON with Polish characters in JavaScript

How to convert Utf8 file to CP1252 by Unix

阅读更多关于 How to convert Utf8 file to CP1252 by Unix

问题 I'm trying to transform txt file encoding from UTF8 to ANSI (cp1252). I need this because the file is used in a fixed position Oracle import (external Table) which apparently only supports CP1252. If I import an UTF-8 file, some special characters turn up as two incorrect characters instead. I'm working in a Unix machine (my OS is HP UX). I have been looking for an answer on the web but I don't find any way to do this conversion. For exmple, the POSIX iconv command doesn't have this choose,