Storing Integer IP Addresses in MySQL

There was a brief discussion regarding the performance benefits of storing IP addresses in a database as an integer, rather than as a varchar. One commenter was making the argument that the numeric-to-string and string-to-numeric conversions cost involved in storing IPs as integers was more significant than the space savings.

While the space savings are easily apparent, and I’ve seen demonstrations of how much faster search operations are, I could not recall anyone ever doing an analysis of how long it takes to insert IPs as strings or integers into the database. Naturally, I had to determine an answer to this so I’d know for future reference.

My findings show that, beyond the disk space savings and the much faster index queries possible with integer-based IP addresses, even with the database doing the string-to-numeric conversion, it is 9-12% faster to store IP addresses into the database as integers, rather than strings.

Background

There are two ways you can store an IPv4 address in a database. In its native format, an IP address is a four-byte long integer, and can be represented as an int unsigned.

In its human-readable format, an IP address is a string, with a minimum length of 7 characters (0.0.0.0) and a maximum length of 15 (255.255.255.255). This gives it an average length (assuming uniform random distribution) of 13.28 characters. Accordingly, one would store it in a database field of type varchar(15). In order for the database to keep track of exactly how much data is in the column, an additional byte of data must be added to store the length of the string. This brings the actual data-storage costs of an IP represented as a string to an average of 14.28 bytes (assuming the characters can be represented by one byte per character, as with latin1 or utf8).

This means that storing an IP address as a string requires, on average, about 10 bytes of extra data. In an application that saves access logs, that 10 bytes of data will eventually add up, and that should be reason enough to store IPs as an integer, rather than a string.

Disk Space Is Cheap, So No Big Deal, Right?

There are other costs associated with having larger data fields. If the column is indexed, the index will be larger as well. Larger indexes tend to perform slower than smaller indexes. Additionally, while disk space is plentiful and cheap, RAM is considerably more limited, so more memory will be used to cache the data or indexes, potentially pushing other, more valuable content out of the cache.

Further, while disks have been gradually getting faster, it still takes a relatively long time to read data from (and even longer to write to) a disk, and CPUs have gotten faster much more quickly than disks. The more data that has to be moved around, the more time the CPU wastes moving that data instead of performing more interesting work.

So, What’s the Actual Cost?

I wrote a PHP script that generates and inserts 1,000,000 random IP addresses into four mysql tables and timed the results. All tests were done with PHP 5.3alpha2 and MySQL 5.0.67 on a 2.16 GHz MacBook Pro (Intel Core Duo), with MyISAM tables.

The test uses four tables as described below. Each table has three columns: an id int unsigned not null auto_increment primary key column, and ip and s as described below:

table ip s description
inetbench_long int unsigned not null char(1) not null fixed row length
inetbench_long_dynamic int unsigned not null varchar(1) not null dynamic row length
inetbench_varchar varchar(15) not null varchar(1) not null dynamic row length; IP stored as string
inetbench_varchar_utf8 varchar(15) not null varchar(1) not null same as inetbench_varchar, but charset = utf8

The PHP benchmark script generates 1 million random IP addresses, and then inserts that data into MySQL To make MySQL do as much work as possible, for the inetbench_long tables, the sql query string uses INET_ATON() to convert the IP to an integer, rather than attempting to do it in PHP.

The results:

table insert time avg row length data length index length total length
inetbench_long 132.35 sec 10 bytes 10,000,000 22,300,672 32,300,627
inetbench_long_dynamic 132.71 sec 20 bytes 20,000,000 22,300,672 42,300,627
inetbench_varchar 144.44 sec 24 bytes 24,504,148 36,341,760 60,845,908
inetbench_varchar_utf8 148.86 sec 24 bytes 24,504,148 36,341,760 60,845,908

MySQL is adding 1 byte of overhead per row to inetbench_long, 10 bytes/row to inetbench_long_dynamic, and an average of 5 bytes/row of overhead to inetbench_varchar and inetbench_varchar_utf8.

The s column was added to test the performance difference of fixed vs. dynamic row storage, but it turns out there’s not much difference.

You can download the code used to generate these results here.

Analysis

Storing IPs as a string, besides requiring more disk space, takes 9% longer than storing them as integers, even with the overhead of converting the IP from a string to an integer. If the table uses utf8 encoding, it’s 12% slower. (This should not be surprising: UTF-8 is inherently slower to process than a strictly 8-bit encoding.) Storing data as an integer in a table with a dynamic row length is not appreciably slower. The indexes on the string tables are 63% larger.

For good measure, I tested a few select queries against the tables. The results were actually somewhat interesting.

Doing a search where you’re looking for a specific IP, or a range of IPs that can be satisfied with a like clause resulted in no significant difference between the integer and varchar storage. (This was somewhat surprising to me.)

select benchmark(10000000, (select count(*) from inetbench_long where ip between inet_aton('172.0.0.0') and inet_aton('172.255.255.255'))); 1 row in set (0.41 sec)

select benchmark(10000000, (select count(*) from inetbench_varchar where ip like '172.%')); 1 row in set (0.42 sec)

However, as expected, when the search range can’t be represented with a simple like query, the speed difference between numeric and string indexes really show:

select benchmark(35000000, (select count(*) from inetbench_long where ip between inet_aton('172.16.0.0') and inet_aton('172.31.255.255'))); 1 row in set (1.43 sec)

select count(*) from inetbench_varchar where inet_aton(ip) between inet_aton('172.16.0.0') and inet_aton('172.31.255.255'); 1 row in set (1.47 sec)

select count(*) from inetbench_varchar_utf8 where inet_aton(ip) between inet_aton('172.16.0.0') and inet_aton('172.31.255.255'); 1 row in set (1.72 sec)

This results in an integer search that’s about 35 million times faster than the string search. Also, the utf8 table is about 17% slower than the latin1 table (which isn’t too surprising, given UTF-8’s overhead).

In Conclusion

With “only” a difference of 12 microseconds per insert query, it may not make sense to change an existing database if you’re not doing many queries against stored IPs. If you’re doing IP range queries, though, you probably want to convert your tables. Any new development should be storing IP addresses in the database as integers by default. The space and time savings are worth it.

Once IPv6 becomes more prevalent, the savings will only become larger: a 128-bit (16 byte) IPv6 address can be up to 39 characters long when represented in a “human readable” format. (Storing IPv6 addresses in the database is going to be a bit more difficult, as MySQL doesn’t have a native 16-byte-wide data type.)

The load on MySQL when inserting integer IPs could likely be slightly reduced by doing that conversion in your application, rather than using MySQL’s INET_ATON() function.

在看高性能MySQL第3版(4.1.7节)时,作者建议当存储IPv4地址时,应该使用32位的无符号整数(UNSIGNED INT)来存储IP地址,而不是使用字符串。 但是没有给出具体原因。为了搞清楚这个原因,查了一些资料,记录下来。

相对字符串存储,使用无符号整数来存储有如下的好处:

  • 节省空间,不管是数据存储空间,还是索引存储空间

  • 便于使用范围查询(BETWEEN…AND),且效率更高

通常,在保存IPv4地址时,一个IPv4最小需要7个字符,最大需要15个字符,所以,使用VARCHAR(15)即可。MySQL在保存变长的字符串时,还需要额外的一个字节来保存此字符串的长度。而如果使用无符号整数来存储,只需要4个字节即可。

另外还可以使用4个字段分别存储IPv4中的各部分,但是通常这不管是存储空间和查询效率应该都不是很高(可能有的场景适合使用这种方式存储)。

使用无符号整数来存储也有缺点:

  • 不便于阅读

  • 需要手动转换

对于转换来说,MySQL提供了相应的函数来把字符串格式的IP转换成整数INET_ATON,以及把整数格式的IP转换成字符串的INET_NTOA。如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
mysql> select inet_aton('192.168.0.1');
+--------------------------+
| inet_aton('192.168.0.1') |
+--------------------------+
| 3232235521 |
+--------------------------+
1 row in set (0.00 sec)

mysql> select inet_ntoa(3232235521);
+-----------------------+
| inet_ntoa(3232235521) |
+-----------------------+
| 192.168.0.1 |
+-----------------------+
1 row in set (0.00 sec)

对于IPv6来说,使用VARBINARY同样可获得相同的好处,同时MySQL也提供了相应的转换函数,即INET6_ATONINET6_NTOA

对于转换字符串IPv4和数值类型,可以放在应用层,下面是使用java代码来对二者转换:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
package com.mikan;

/**
* @author Mikan
*/
public class IpLongUtils {
/**
* 把字符串IP转换成long
*
* @param ipStr 字符串IP
* @return IP对应的long值
*/
public static long ip2Long(String ipStr) {
String[] ip = ipStr.split("\\.");
return (Long.valueOf(ip[0]) << 24) + (Long.valueOf(ip[1]) << 16)
+ (Long.valueOf(ip[2]) << 8) + Long.valueOf(ip[3]);
}

/**
* 把IP的long值转换成字符串
*
* @param ipLong IP的long值
* @return long值对应的字符串
*/
public static String long2Ip(long ipLong) {
StringBuilder ip = new StringBuilder();
ip.append(ipLong >>> 24).append(".");
ip.append((ipLong >>> 16) & 0xFF).append(".");
ip.append((ipLong >>> 8) & 0xFF).append(".");
ip.append(ipLong & 0xFF);
return ip.toString();
}

public static void main(String[] args) {
System.out.println(ip2Long("192.168.0.1"));
System.out.println(long2Ip(3232235521L));
System.out.println(ip2Long("10.0.0.1"));
}

}

输出结果为:

1
2
3
3232235521  
192.168.0.1
167772161