atom is slow when using it with big map

问题

I have an ETL using clojure, each thread could load different part of a file, and it also needs to get a key from business key. The data structure to store the business key to key mapping is a hash map like:

{"businessKey1" 1, 
 "businessKey2" 2,
 "businessKey3" 3, 
 "businessKey4" 4, 
 "businessKey5" 5 }

When ETL loading data from file, it parses each line from the file into columns, if the business key column could be found in the map, just return the key, e.g if it found businessKey1, then return 1. But if it found businessKey6, then need to call a web service to create a new key. I planned to use atom, so when each thread found a new key, use atom to modify the map. But the performance is supper bad. I tested the following code, it's very slow, and there are lots of GC activity.

(def a (atom {}))
(map #(swap! a (partial merge {% 1})) (range 10000))
(println a)

What's the best solution for this? Should I use ConcurrentHashMap in java?

回答1:

The main source of the bad performance seems to be the use of (partial merge {% 1})

A more idiomatic form is the following:

(let [a (atom {})] 
  (doall (map #(swap! a merge {% 1}) (range 10000))) (println @a)))

Even faster is to used assoc and not to create a temporary map every time:

(let [a (atom {})] 
  (doall (map #(swap! a assoc % 1) (range 10000))) (println @a)))

If you want to iterate over a seq for the side effects, better use doseq:

(count (let [a (atom {})] (doseq [r (range 10000)] (swap! a assoc r 1))))

An atom is not necessary and what you want can be expressed as a reduction:

(count (reduce (fn [m r] (assoc m r 1)) {} (range 10000)))

回答2:

You can avoid using an atom here by using Clojure reducers:

(require '[clojure.core.reducers :as r])

(defn lookup [k]
  ; do remote call, here it just returns 1  
  1)

(defn f
  ([] {})
  ([acc k] (if (get acc k)
             acc
            (assoc acc k (lookup k)))))

(r/fold merge f (vec (range 10000)))

clojure.core.reducers/fold will automatically run this in parallel and combine the results.

来源：https://stackoverflow.com/questions/27964880/atom-is-slow-when-using-it-with-big-map

标签

clojure