encoding

Java - read UTF-8 file with a single emoji symbol

女生的网名这么多〃 提交于 2020-08-05 09:38:39
问题 I have a file with a single unicode symbol. The file is encoded in UTF-8. It contains a single symbol represented as 4 bytes. https://www.fileformat.info/info/unicode/char/1f60a/index.htm F0 9F 98 8A When I read the file I get two symbols/chars. The program below prints ? 2 ? ? 55357 56842 ====================================== �� 16 & ====================================== ? 2 ? ====================================== Is this normal... or a bug? Or am I misusing something? How

Java - read UTF-8 file with a single emoji symbol

ⅰ亾dé卋堺 提交于 2020-08-05 09:37:12
问题 I have a file with a single unicode symbol. The file is encoded in UTF-8. It contains a single symbol represented as 4 bytes. https://www.fileformat.info/info/unicode/char/1f60a/index.htm F0 9F 98 8A When I read the file I get two symbols/chars. The program below prints ? 2 ? ? 55357 56842 ====================================== �� 16 & ====================================== ? 2 ? ====================================== Is this normal... or a bug? Or am I misusing something? How

Use accent senstive primary key in MySQL

时光怂恿深爱的人放手 提交于 2020-08-05 06:01:36
问题 Desired result : Have an accent sensitive primary key in MySQL. I have a table of unique words, so I use the word itself as a primary key (by the way if someone can give me an advice about it, I have no idea if it's a good design/practice or not). I need that field to be accent (and why not case) sensitive, because it must distinguish between, for instance, 'demandé' and 'demande' , two different inflexions of the French verb "demander". I do not have any problem to store accented words in

How can I create and fit vocab.bpe file (GPT and GPT2 OpenAI models) with my own corpus text?

无人久伴 提交于 2020-08-05 05:23:31
问题 This question is for those who are familiar with GPT or GPT2 OpenAI models. In particular, with the encoding task (Byte-Pair Encoding). This is my problem: I would like to know how I could create my own vocab.bpe file. I have a spanish corpus text that I would like to use to fit my own bpe encoder. I have succeedeed in creating the encoder.json with the python-bpe library, but I have no idea on how to obtain the vocab.bpe file. I have reviewed the code in gpt-2/src/encoder.py but, I have not

Swift Decode [String: Any]

主宰稳场 提交于 2020-08-03 06:01:09
问题 So I have this API that returns a dictionary of [String: Any] , I know that what comes as Any is Decodable or an array of Decodable however I can't for the life of me figure out how to take that dictionary and decode it to some struct: What I have goes basically like this: public func call<T: Codable> (completion handler: @escaping (T?) -> ()) { let promise = api.getPromise () promise.done (on: DispatchQueue.main, { (results: [String:Any]) let decodedResults:T? = results.decode (as: T.self) /

How can I globally ignore invalid byte sequences in UTF-8 strings?

纵然是瞬间 提交于 2020-08-01 01:05:22
问题 I have an Rails application surviving from migrations since Rails version 1 and I would like to ignore all invalid byte sequences on it, to keep the backwards compatibility. I can't know the input encoding . Exemple: > "- Men\xFC -".split("n") ArgumentError: invalid byte sequence in UTF-8 from (irb):4:in `split' from (irb):4 from /home/fotanus/.rvm/rubies/ruby-2.0.0-rc2/bin/irb:16:in `<main>' I can overcome this problem in one line, by using the following, for example: > "- Men\xFC -".unpack(

How can I globally ignore invalid byte sequences in UTF-8 strings?

主宰稳场 提交于 2020-08-01 01:04:56
问题 I have an Rails application surviving from migrations since Rails version 1 and I would like to ignore all invalid byte sequences on it, to keep the backwards compatibility. I can't know the input encoding . Exemple: > "- Men\xFC -".split("n") ArgumentError: invalid byte sequence in UTF-8 from (irb):4:in `split' from (irb):4 from /home/fotanus/.rvm/rubies/ruby-2.0.0-rc2/bin/irb:16:in `<main>' I can overcome this problem in one line, by using the following, for example: > "- Men\xFC -".unpack(

Wrong encoding when importing JSON with Polish characters in JavaScript

旧街凉风 提交于 2020-07-23 06:36:44
问题 I have the below JSON file "locations.json": { "lubelskie": [ "abramów", "adamów", "aleksandrów", "annopol", "baranów", "batorz", "bełżec", "bełżyce" ] } I import the JSON in to my Class, using the below statement: import locations from "./locations.json"; class areas { constructor() { console.log(locations); } } export default areas; The console output I get is below: { lubelskie: ["abramów", "adamów", "aleksandrów", "annopol", "baranów", "batorz", "bełżec", "bełżyce"] } The problem

Wrong encoding when importing JSON with Polish characters in JavaScript

≯℡__Kan透↙ 提交于 2020-07-23 06:34:21
问题 I have the below JSON file "locations.json": { "lubelskie": [ "abramów", "adamów", "aleksandrów", "annopol", "baranów", "batorz", "bełżec", "bełżyce" ] } I import the JSON in to my Class, using the below statement: import locations from "./locations.json"; class areas { constructor() { console.log(locations); } } export default areas; The console output I get is below: { lubelskie: ["abramów", "adamów", "aleksandrów", "annopol", "baranów", "batorz", "bełżec", "bełżyce"] } The problem

How to convert Utf8 file to CP1252 by Unix

南楼画角 提交于 2020-07-22 12:47:05
问题 I'm trying to transform txt file encoding from UTF8 to ANSI (cp1252). I need this because the file is used in a fixed position Oracle import (external Table) which apparently only supports CP1252. If I import an UTF-8 file, some special characters turn up as two incorrect characters instead. I'm working in a Unix machine (my OS is HP UX). I have been looking for an answer on the web but I don't find any way to do this conversion. For exmple, the POSIX iconv command doesn't have this choose,