optimization | 易学教程

How to print the “actual” learning rate in Adadelta in pytorch

阅读更多关于 How to print the “actual” learning rate in Adadelta in pytorch

问题 In short : I can't draw lr/epoch curve when using adadelta optimizer in pytorch because optimizer.param_groups[0]['lr'] always return the same value. In detail : Adadelta can dynamically adapts over time using only first order information and has minimal computational overhead beyond vanilla stochastic gradient descent [1]. In pytorch, the source code of Adadelta is here https://pytorch.org/docs/stable/_modules/torch/optim/adadelta.html#Adadelta Since it requires no manual tuning of learning

What GCC optimization flags and techniques are safe across CPUs?

阅读更多关于 What GCC optimization flags and techniques are safe across CPUs?

问题 When compiling/linking C/C++ libraries or programs that are meant to work on all implementations of an ISA (e.g. x86-64), what optimization flags are safe from the correctness and run-time performance perspectives? I want optimizations that yield correct results and won't be detrimental performance-wise for a particular CPU. E.g I would like to avoid optimization flags that yield run-time performance improvements on an 8th-gen Intel Core i7, but result in performance degradation on an AMD

Timeout for Z3 Optimize

阅读更多关于 Timeout for Z3 Optimize

问题 How do you set a timeout for the z3 optimizer such that it will give you the best known solution when it runs out of time? from z3 import * s = Optimize() # Hard Problem print(s.check()) print(s.model()) Follow-up question, can you set z3 to randomized hill climbing or does it always perform a complete search? 回答1: Long answer short, you can't . That's simply not how the optimizer works. That is, it doesn't find a solution and then try to improve it. If you interrupt it or set a time-out,

Timeout for Z3 Optimize

阅读更多关于 Timeout for Z3 Optimize

Multiply-add vectorization slower with AVX than with SSE

阅读更多关于 Multiply-add vectorization slower with AVX than with SSE

问题 I have a piece of code that is being run under a heavily contended lock, so it needs to be as fast as possible. The code is very simple - it's a basic multiply-add on a bunch of data which looks like this: for( int i = 0; i < size; i++ ) { c[i] += (double)a[i] * (double)b[i]; } Under -O3 with enabled SSE support the code is being vectorized as I would expect it to be. However, with AVX code generation turned on I get about 10-15% slowdown instead of speedup, and I can't figure out why. Here's

Performance of gzipped json vs efficient binary serialization

阅读更多关于 Performance of gzipped json vs efficient binary serialization

问题 JSON and Gzip is a simple way to serialize data. These are widely implemented across programming languages. Also this representation is portable across systems (is it?). My question is whether json+gzip is good enough (less then 2x cost) compared to very efficient binary serialization methods? I'm looking for space and time costs while serializing various kinds of data. 回答1: Serialising with json+gzip uses 25% more space than rawbytes+gzip for numbers and objects. For limited precision

Performance of gzipped json vs efficient binary serialization

阅读更多关于 Performance of gzipped json vs efficient binary serialization

Performance of gzipped json vs efficient binary serialization

阅读更多关于 Performance of gzipped json vs efficient binary serialization

Declaring an empty destructor prevents the compiler from calling memmove() for copying contiguous objects

阅读更多关于 Declaring an empty destructor prevents the compiler from calling memmove() for copying contiguous objects

问题 Consider the following definition of Foo : struct Foo { uint64_t data; }; Now, consider the following definition of Bar , which has the same data member as Foo , but has an empty user-declared destructor: struct Bar { ~Bar(){} // <-- empty user-declared dtor uint64_t data; }; Using gcc 8.2 with -O2 , the function copy_foo() : void copy_foo(const Foo* src, Foo* dst, size_t len) { std::copy(src, src + len, dst); } results in the following assembly code: copy_foo(Foo const*, Foo*, size_t): salq

How does loop address alignment affect the speed on Intel x86_64?

阅读更多关于 How does loop address alignment affect the speed on Intel x86_64?

问题 I'm seeing 15% performance degradation of the same C++ code compiled to exactly same machine instructions but located on differently aligned addresses. When my tiny main loop starts at 0x415220 it's faster then when it is at 0x415250. I'm running this on Intel Core2 Duo. I use gcc 4.4.5 on x86_64 Ubuntu. Can anybody explain the cause of slowdown and how I can force gcc to optimally align the loop? Here is the disassembly for both cases with profiler annotation: 415220 576 12.56%