Count the number of rows that contain a letter/number

此生再无相见时 提交于 2019-12-06 01:35:31

This query should do the job:

Test case:

CREATE TEMP TABLE films (id serial, film text);
INSERT INTO films (film) VALUES
 ('District 9')
,('Surrogates')
,('The Invention Of Lying')
,('Pandorum')
,('UP')
,('The Soloist')
,('Cloudy With A Chance Of Meatballs')
,('The Imaginarium of Doctor Parnassus')
,('Cirque du Freak: The Vampires Assistant')
,('Zombieland')
,('9')
,('The Men Who Stare At Goats')
,('A Christmas Carol')
,('Paranormal Activity');

Query:

SELECT l.letter, COALESCE(y.ct, 0) AS ct
FROM  (
    SELECT chr(generate_series(97, 122)) AS letter  -- a-z in UTF8!
    UNION ALL
    SELECT generate_series(0, 9)::text              -- 0-9
    ) l
LEFT JOIN (
    SELECT letter, count(id) AS ct
    FROM  (
        SELECT DISTINCT  -- count film once per letter
               id, unnest(string_to_array(lower(film), NULL)) AS letter
        FROM   films
        ) x
    GROUP  BY 1
    ) y  USING (letter)
ORDER  BY 1;

Change string_to_array() so a NULL separator splits the string into characters (Pavel Stehule)

Previously this returned a null value.

  • You can use regexp_split_to_table(lower(film), ''), instead of unnest(string_to_array(lower(film), NULL)) (works in versions pre-9.1!), but it is typically a bit slower and performance degrades with long strings.

  • I use generate_series() to produce the [a-z0-9] as individual rows. And LEFT JOIN to the query, so every letter is represented in the result.

  • Use DISTINCT to count every film once.

  • Never worry about 1000 rows. That is peanuts for modern day PostgreSQL on modern day hardware.

A fairly simple solution which only requires a single table scan would be the following.

SELECT 
    'a', SUM( (title ILIKE '%a%')::integer),
    'b', SUM( (title ILIKE '%b%')::integer),
    'c', SUM( (title ILIKE '%c%')::integer)
FROM film

I left the other 33 characters as a typing exercise for you :)

BTW 1000 rows is tiny for a postgresql database. It's beginning to get large when the DB is larger then the memory in your server.

edit: had a better idea

SELECT chars.c, COUNT(title)
FROM (VALUES ('a'), ('b'), ('c')) as chars(c)
    LEFT JOIN film ON title ILIKE ('%' || chars.c || '%')
GROUP BY chars.c
ORDER BY chars.c 

You could also replace the (VALUES ('a'), ('b'), ('c')) as chars(c) part with a reference to a table containing the list of characters you are interested in.

This will give you the result in a single row, with one column for each matching letter and digit.

SELECT
  SUM(CASE WHEN POSITION('a' IN filmname) > 0 THEN 1 ELSE 0 END) AS "A",
  SUM(CASE WHEN POSITION('b' IN filmname) > 0 THEN 1 ELSE 0 END) AS "B",
  SUM(CASE WHEN POSITION('c' IN filmname) > 0 THEN 1 ELSE 0 END) AS "C",
  ...
  SUM(CASE WHEN POSITION('z' IN filmname) > 0 THEN 1 ELSE 0 END) AS "Z",
  SUM(CASE WHEN POSITION('0' IN filmname) > 0 THEN 1 ELSE 0 END) AS "0",
  SUM(CASE WHEN POSITION('1' IN filmname) > 0 THEN 1 ELSE 0 END) AS "1",
  ...
  SUM(CASE WHEN POSITION('9' IN filmname) > 0 THEN 1 ELSE 0 END) AS "9"
FROM films;

A similar approach like Erwins, but maybe more comfortable in the long run:

Create a table with each character you're interested in:

CREATE TABLE char (name char (1), id serial);
INSERT INTO char (name) VALUES ('a');
INSERT INTO char (name) VALUES ('b');
INSERT INTO char (name) VALUES ('c');

Then grouping over it's values is easy:

SELECT char.name, COUNT(*) 
  FROM char, film 
  WHERE film.name ILIKE '%' || char.name || '%' 
  GROUP BY char.name 
  ORDER BY char.name;

Don't worry about ILIKE.

I'm not 100% happy about using the keyword 'char' as table title, but hadn't had bad experiences so far. On the other hand it is the natural name. Maybe if you translate it to another language - like 'zeichen' in German, you avoid ambiguities.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!