Issues with wikipedia dump table pagelinks

烂漫一生 提交于 2019-12-12 13:32:04

问题


I downloaded the enwiki-latest-pagelinks.sql.gz dump from dumps.wikimedia.org/enwiki/latest/.

I upacked the file, its uncompressed size is 37G.

The table structure is this:

SHOW CREATE TABLE wp_dump.pagelinks;

CREATE TABLE `pagelinks` (
  `pl_from` int(8) unsigned NOT NULL DEFAULT '0',
  `pl_namespace` int(11) NOT NULL DEFAULT '0',
  `pl_title` varbinary(255) NOT NULL DEFAULT '',
  `pl_from_namespace` int(11) NOT NULL DEFAULT '0',
  UNIQUE KEY `pl_from` (`pl_from`,`pl_namespace`,`pl_title`),
  KEY `pl_namespace` (`pl_namespace`,`pl_title`,`pl_from`),
  KEY `pl_backlinks_namespace` (`pl_from_namespace`,`pl_namespace`,`pl_title`,`pl_from`)
) ENGINE=InnoDB DEFAULT CHARSET=binary

I imported the table into a new, empty database:

mysql -D wp_dump -u root -p < enwiki-latest-pagelinks.sql

The computer I am running the task on has 16G of RAM and the mysql database is located on a SSD, so I was assuming that despite the table's size the import would not take too long.

However, the task is running since over a day and still running. There are no other processes accessing mysql and there is no workload on the computer.

The database file itself now is 79G large.

ls -lh

-rw-r----- 1 mysql mysql   65 May 11 17:40 db.opt
-rw-r----- 1 mysql mysql 8,6K May 12 07:06 pagelinks.frm
-rw-r----- 1 mysql mysql  79G May 13 16:59 pagelinks.ibd

The table now has over 500 million rows.

SELECT table_name, table_rows FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'wp_dump';

+------------+------------+
| table_name | table_rows |
+------------+------------+
| pagelinks  |  520919860 |
+------------+------------+

I am wondering:

Is the enwiki-latest-pagelinks.sql really over 79G large?

Does pagelinks really contain over 500 million rows?

Does it really take that long to import the pagelinks table?

Can you provide some metrics the expected table size and the row amount, please?

Update: 14th may, 2017:

insert still running; pagelinks.ibdfile now 130G; number of rows now almost 700 million

Update: 16th may, 2017:

insert still running; pagelinks.ibdfile now 204G; number of rows now over 1.2 billion

I calculated the rows inserted per second over the last two days:

rows/sek = 3236

And: It is many thousand inserts per insert statement in the sql script (head -41 enwiki-latest-pagelinks.sql | tail -1 | grep -o "(" | wc -l is 30471)

So, my follow-up / modified questions:

Is the number of rows and the idb file size to be expected given the sql file size of 37G and the table structure (as listed above)?

Is rows/sek = 3236 a good value (meaning that it takes a few days to insert the table)?

What may be the limiting speed factor / how can I speed up the import?

  • Disable the indexes (and calculate them after the insert)?
  • Optimize transactions (commit (nothing set in script) / autocommit (now ON))?
  • Optimize variable settings (e.g. innodb_buffer_pool_size, now 134217728)?

回答1:


37GB of data --> 79GB of InnoDB table seems reasonable...

  • Title: 2 quotes and 1 comma --> 1 byte for length
  • Ints: several bytes, plus comma --> 4 bytes for INT (regardless of the (...) after INT. See MEDIUMINT.
  • 20-30 bytes overhead per row
  • 20-40% overhead for BTrees.
  • UNIQUE index becomes PRIMARY KEY and clusters with data --> very little overhead.
  • Other two indexes: Each is virtually the same size as the data. This more that allows for the increased size.

Adding it all together, I would expect the table to be more than 120GB. So, there are probably some details missing. A guess: The dump is one row per INSERT, instead of the less verbose many-rows-per-INSERT.

As for performance, it all depends on the SELECTs. Set innodb_buffer_pool_size to somewhere around 11G. This may work efficiently enough for caching the 79G.

More

Change UNIQUE to PRIMARY, for clarity and because InnoDB really needs a PK.

Check the source data. Is it in (pl_from,pl_namespace,pl_title) order? If not, can you sort the file before loading? If you can, that, alone, should significantly help the speed.

128MB for the buffer_pool is also significantly hampering progress.




回答2:


@Sim Betren: I was currently importing the same table, I can get about 7700 rows/s. Which means about 600.000.000 rows a day. Probably the most important thing is to get the right settings on InnoDB:

https://dba.stackexchange.com/questions/83125/mysql-any-way-to-import-a-huge-32-gb-sql-dump-faster

innodb_buffer_pool_size = 4G
innodb_log_buffer_size = 256M
innodb_log_file_size = 1G
innodb_write_io_threads = 16
innodb_flush_log_at_trx_commit = 0

Those settings work good. From what I read and tried, InnoDB loves high memory settings. Ideally, one would use a 16Gb or even 32Gb machine, then increase these settings even more. But I got 7700 rows/s on a modest setup, that's almost 10 years old:

  • Intel Q6700 quad
  • 8 Gb DDR2 memory

I combined that 10 year old hardware with a 2017 model 500Gb SSD, which is dedicated to the job and handles both the reading and the writing. the reason for using the older hardware is that the SSD is the most important part of the setup (because of IOPS). Plus by using older hardware I saved a bit of money. However, the hardware is limited to 8Gb of DDR2. A newer dedicated machine with 32Gb or 64Gb internal memory could really fly I reckon.

Software setup:

  • Linux Mint 64bit
  • MySQL Server 5.7.18 for Ubuntu
  • MySQL Workbench for importing

I also tried this on Windows 10 and the speed is almost the same on both. So you could try Windows too.

Note: I did try changing the engine to MyISAM. MyISAM can be pretty fast, also around 8000 rows/sec or more. But the import always became corrupted for some reason. So I would stick to InnoDB

Update 17-06-2017:

Finished the import. The table "pagelinks" is about 214Gb large with 1200 million rows. About 112Gb is raw data, 102Gb are indexes. The original uncompressed file was about 37Gb.

It took about 2 days and 6 hours to import. Avg speed = 5350 rows/second. With high end equipment (huge memory, preferably 64Gb or more) and optimal settings, it can probably be done faster. But I let it run on a dedicated machine 24/7 and I wasn't in a hurry, so 2 days seems OK.

Update 18-06-2017:

Also imported "page.sql" because this contains the names connected to the ID's. The uncompressed file is about 5Gb, import took 1 hour. Which seems quick: the pagelink file is about 37Gb which is 7x bigger than "page.sql". Yet takes 50x longer to import. So, there are a few reasons why "pagelinks" takes so long: (A) probably because it doesn't fit in memory (B) The table structure, many data per insert (C) Settings. But most likely it's memory.

Conclusion: try get a PC with 32Gb or 64Gb internal memory. Maybe even more. And use SSD's that can keep up with that memory, 500Gb or more. The SSD is more important than memory so try that first.




回答3:


@Sim Betren:

I want to open a completely new answer, since I have discovered a new solution. Splitting up the file is probably the best answer. As discussed in the other answer, InnoDB works best when the entire model fits in memory. The delays start when it needs to swap stuff on disk. The pagelinks file is 37Gb and that's simply too big for most machines to fit easily into memory. Maybe a $1000+ dedicated machine with endless memory can do it, most desktops can't. So what you can do:

  • The plan is to split the file. First thing to do, is to separate the SQL structure from the data.
  • There's probably better ways to do that, but a program I found was this: SqlDumpSplitter2

  • That dump splitter program maybe old but it worked on pagelinks. It is Windows only though. I simply told it to split the unpacked 37Gb file into 37 chunks of 1Gb and it dutifully did. Checked the data and it seems to be working. You could also use 74 chunks of 500Mb.

  • The import of each file takes maybe 10 to 20 minutes per 1Gb.
  • Total time: about 1 to 2 hours for splitting the 37Gb file. About 6 to 12 hours for importing. That easily beats the previous answer I gave
  • When importing, use the same big data settings as in the previous answer. And try to find a machine that has big memory 16Gb or 32Gb preferred.

What is most important here is: it doesn't really matter that much how you split it. Just split the file anyway you can. Then build it up by re-creating the structure and the data separately. This way import could be down from 2 days to maybe as little as a few hours. Given a big dedicated machine it can probably be done in just 1 to 6 hours.



来源:https://stackoverflow.com/questions/43954631/issues-with-wikipedia-dump-table-pagelinks

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!