Continuing to talk about the design of Groups.io, today I’ll talk about our database design.
Groups.io is built on top of Postgresql. We use GORP to handle marshaling our database objects. We split our data over several separate databases. The databases are all currently running in one Postgresql instance, but this will allow us to easily split data over several physical databases as we scale up. A downside to this is that we end up having to manage more database connections now, and the code is more complicated, but we won’t have to change any code in the future when we split the databases over multiple machines (sharding is a whole other thing).
There are no joins in the Groups.io system and there are no foreign key constraints. We enforce constraints in an application layer. We did this for future scalability. It did require more work in the beginning and it remains to be seen if we engaged in an act of premature optimization. Every record in every table has a 64-bit integer primary key.
We have 3 database machines. DB01 is our main database machine. DB02 is a warm-standby, and DB03 is a hot-standby. We use wall-e to backup DB01’s database to S3. DB02 uses wall-e to pull its data from S3 to keep warm. All three machines also run Elasticsearch as part of a cluster. We run statistics on DB03.
Our data is segmented into the following main databases: userdb, archivedb, activitydb, deliverydb, integrationdb.
The userdb contains user, group and subscription records. Subscriptions provide a mapping from users to groups, and we copy down several bits of information from users and groups into the subscription records, to make some processing easier. Here are some of the copied down columns:
GroupName string // Group.Name Email string // User.Email UserName string // User.UserName FullName string // User.FullName UserStatus uint8 // User.Status Privacy uint8 // Group.Privacy
We maintain these columns in an application layer above the database. By duplicating this information in the subscription record, we greatly reduce the number of user and group record fetches we need to do throughout the system. These fields rarely change, so there’s not a large write penalty. There is definitely a memory penalty, with the expanded subscription record. But I figured that was a good trade off.
The archivedb stores everything related to message archives. The main tables are the thread table and the message table. We store every message in the message table, as raw compressed text, but before we insert each message, we strip out any attachments, and instead store them in Amazon’s S3. This reduces the average size of emails to a much more manageable level.
The activitydb stores activity logging records for each group.
The deliverydb stores bounce information for users.
The integrationdb stores information relating to the various integrations available in Groups.io
We use Elasticsearch for our search, and our indexes mirror the Postgresql tables. We have a Group index, a Thread index and a Message index. I tried a couple Go Elasticsearch libraries and didn’t like any of them, so I wrote my own simple library to talk to our cluster.
In future articles, I’ll talk about some aspects of the code itself. Are there any specific topics you’d like me to address? Please let me know.
Are you unhappy with Yahoo Groups or Google Groups? Or are you looking for an email groups service for your company? Please try Groups.io.