Groups.io Update

It’s been almost six months since I launched Groups.io and four months since I’ve talked about it here on the blog, so I figured it was time for an update. I’ve been heads down working on new features and bug fixes. Here’s a short list of the major features added during that time:

Slack Member Sync

Mailing lists and chat, like peanut butter and chocolate, go great together. Do you have a Slack Team? You can now link it with your Groups.io group. Our new Slack Member Sync feature lets you synchronize your Slack and Groups.io member lists. When someone joins your Groups.io group, they will automatically get an invite to join your Slack Team. And when someone joins your Slack Team, they’ll automatically get added to your Groups.io group. You can configure the sync to be automatic or you can sync members by hand. Access the new member sync area from the Settings page for your group.

As an aside, another potentially great combination, bacon and chocolate, do not go great together. Trust us, we’ve tried.

Google Log-in

You can now log into Groups.io using Google. For new users, this allows them to skip the confirmation email step, making it quicker and easier to join your groups.

Markdown and Syntax Highlighting Support

You can now post messages using Markdown and emoji characters. And we support syntax highlighting of code snippets.

Archive Management Tools

The heart of a group is the message archive. And nobody likes a unorganized archive. We’ve added the ability to split and merge threads. Has a thread changed topics half way through? Split it into two threads. Or if two threads are talking about the same thing, you can merge them. You can also delete individual messages, and change the subject of threads.

Subgroups

Groups.io now supports subgroups. A subgroup is a group within another group. When viewing your group on the website, you can create a subgroup by clicking the ‘Subgroup’ tab on the left side. The email address of a subgroup is of the form parentgroup+subgroup@groups.io

Subgroups have all the functionality of normal groups, with one exception. To be a member of a subgroup, you must be a member of the parent group. A subgroup can be open to all members of the parent group, or it can be restricted. Archives can be viewable by members of the parent group, or they can be private to the members of the subgroup. Subgroups are listed on the group home page, or they can be completely hidden.

Calendar, Files and Wiki

Every group now has a dedicated full-featured Calendar, Files section, and Wiki.

In other news, we also started an Easy Group Transfer program, for people who wish to move their groups from Yahoo or Google over to Groups.io.

Email groups are all about community, and I’m pleased that the Beta group has developed into a valuable community, helping define new features and scope out bugs. I’m working to be as transparent as possible about the development of Groups.io through that group, and through a dedicated Trello board which catalogs requested features and bug reports. If you’re interested, please join and help shape the future of Groups.io!

Advertisements

Groups.io Database Design

Continuing to talk about the design of Groups.io, today I’ll talk about our database design.

Database Design

Groups.io is built on top of Postgresql. We use GORP to handle marshaling our database objects. We split our data over several separate databases. The databases are all currently running in one Postgresql instance, but this will allow us to easily split data over several physical databases as we scale up. A downside to this is that we end up having to manage more database connections now, and the code is more complicated, but we won’t have to change any code in the future when we split the databases over multiple machines (sharding is a whole other thing).

There are no joins in the Groups.io system and there are no foreign key constraints. We enforce constraints in an application layer. We did this for future scalability. It did require more work in the beginning and it remains to be seen if we engaged in an act of premature optimization. Every record in every table has a 64-bit integer primary key.

We have 3 database machines. DB01 is our main database machine. DB02 is a warm-standby, and DB03 is a hot-standby. We use wall-e to backup DB01’s database to S3. DB02 uses wall-e to pull its data from S3 to keep warm. All three machines also run Elasticsearch as part of a cluster. We run statistics on DB03.

Our data is segmented into the following main databases: userdb, archivedb, activitydb, deliverydb, integrationdb.

Userdb

The userdb contains user, group and subscription records. Subscriptions provide a mapping from users to groups, and we copy down several bits of information from users and groups into the subscription records, to make some processing easier. Here are some of the copied down columns:

GroupName string // Group.Name
Email string // User.Email
UserName string // User.UserName
FullName string // User.FullName
UserStatus uint8 // User.Status
Privacy uint8 // Group.Privacy

We maintain these columns in an application layer above the database. By duplicating this information in the subscription record, we greatly reduce the number of user and group record fetches we need to do throughout the system. These fields rarely change, so there’s not a large write penalty. There is definitely a memory penalty, with the expanded subscription record. But I figured that was a good trade off.

Archivedb

The archivedb stores everything related to message archives. The main tables are the thread table and the message table. We store every message in the message table, as raw compressed text, but before we insert each message, we strip out any attachments, and instead store them in Amazon’s S3. This reduces the average size of emails to a much more manageable level.

Activitydb

The activitydb stores activity logging records for each group.

Deliverydb

The deliverydb stores bounce information for users.

Integrationdb

The integrationdb stores information relating to the various integrations available in Groups.io

Search

We use Elasticsearch for our search, and our indexes mirror the Postgresql tables. We have a Group index, a Thread index and a Message index. I tried a couple Go Elasticsearch libraries and didn’t like any of them, so I wrote my own simple library to talk to our cluster.

Next Time

In future articles, I’ll talk about some aspects of the code itself. Are there any specific topics you’d like me to address? Please let me know.

Are you unhappy with Yahoo Groups or Google Groups? Or are you looking for an email groups service for your company? Please try Groups.io.

What Runs Groups.io

I always appreciate when people talk about how they’ve built a particular piece of software or a web service, so I thought I’d talk about some of the architecture choices I made when building Groups.io, my recently launched email groups service. This will be a multi-part series.

Go

One of the goals I had when I first started working on Groups.io was to use it as an opportunity to learn the new language Go. Groups.io is written completely in Go and is my first project in the language. As a diehard C programmer (ONElist was written in C, and Bloglines was written in C++), it took very little time to get up to speed on Go and I now consider myself a huge fan of the language. There are many reasons why I like to code in Go. It’s compiled, so it’s fast and you get all the code checks you miss from interpreted languages. It generates stand alone binaries, which is great for distributing to production machines. It’s got a great standard library. It’s easy to write multithreaded code (threads are called goroutines). The documentation system is good. But besides all that, the philosophy behind Go just fits my mental model better than any other language I’ve worked in. It all combines to make programming in Go the most fun I’ve had coding in a very long time.

Components

Groups.io consists of several components that interact with each other. All interactions are done using JSON over HTTP.

Web

The web server handles all web traffic, naturally. It is proxied behind nginx, because I believe that makes for a more flexible and slightly more secure system. Nginx terminates the encrypted HTTPS traffic and passes the unencrypted traffic to the web process. We use the standard Go HTML template system for our web templates, and we use several parts of the Gorilla web toolkit. We use Bootstrap for our HTML framework.

Smtpd

The smtpd daemon handles incoming SMTP traffic for the groups.io domain. It is also proxied behind nginx. The email it handles consists mainly of group messages, although there are some other messages as well, including bounce messages. It sends group and bounce messages to the messageserver for processing. Other messages are forwarded, using a set of rules, to other email addresses. We based smtpd heavily on Go-Guerrilla’s SMTPd.

Messageserver

The messageserver daemon processes group messages, bounce messages and email commands. For group messages, it verifies that the poster is subscribed and has permission to post to the group, it archives the message and sends it out to the group subscribers, using Karl to send the messages. It also sends the messages to our Elasticsearch cluster. Bounce and email command messages are processed as well. All group messages are processed through the messageserver, whether they arrive through the smtpd, or whether they were posted through the web site.

Karl

Karl, named after Karl ‘The Mailman’ Malone, is our email sending process. It is responsible for all emails originating from the groups.io domain. It is passed an email message, a footer template, a sender, and a set of data about each receiver the message should be sent to. For each receiver, it evaluates the template, inserting subscriber specific information, and then merges it with the email message before sending it out. It also handles DKIM signing of emails. It stores all emails using Google’s leveldb database until they are successfully sent.

A reasonable question to ask is why didn’t I outsource the email delivery part of the service. There are several companies that provide email delivery outsourcing. In general, outsourcing is a way to save development time. But when I thought about it, I did not think I’d be able to save much time by outsourcing; I’d still have to connect our data with whatever templating system the email delivery service used. And Karl did not take very long to write. But more importantly, email delivery is a core competency of our service and I believe we have to own that.

Errord

Errord is a simple logging process, used to log error messages and stack traces from any core dumps in any of the other processes. I can look at the errord log and instantly see if anything in the system has crashed and where it crashed.

Rsscrawler, Instagramcrawler

Rsscrawler and instagramcrawler are cronjobs that deal with the Feed and Instagram integrations, respectively. Rsscrawler looks for updates in feeds that are integrated with our groups, and Instagramcrawler does the same for instagram accounts. They’re currently run twice an hour. If they find an update, they generate a group message and pass it along to the messageserver.

Bouncer

Bouncer is a cronjob that is run once a day to manage bouncing users.

Expirethreads

Expirethreads is a cronjob that’s run twice an hour to expire threads that are tagged with hashtags that have an expiration.

Senddigests

Senddigests is a cronjob that’s run once a night, to generate digest emails for users with digest subscriptions.

Next Time

In future articles, I’ll talk about the machine cluster running Groups.io, the database design behind the service, and some aspects of the code itself. Are there any specific topics you’d like me to address? Please let me know.

Are you unhappy with Yahoo Groups or Google Groups? Or are you looking for an email groups service for your company? Please try Groups.io.

Software Development Hypothesis

Hypothesis: The inter-team communication requirements when doing distributed software development force better communication habits upon everyone, which can lead to an overall better development process.
Explanation: When a group is working together in an office, a majority of communication happens verbally and generally informally (ie. talking in the halls or in meetings). These communications are generally not recorded and archived. Knowledge is lost and/or spread unevenly among the group. With a modern distributed development group, the majority of communication is forced to be text, through email, IM, chat rooms and wikis. The knowledge is (within the limit of the tools) easily accessible later, by the entire group. What would seem a disadvantage, that of not all being together in the same building (or even in the same time zone), ends up being an advantage.
Agree or Disagree?