Should Old Acquaintance be Forgot: Tidying up Mac Mail
As the year is closing down, why not spend some of the free time to explore your email data using R and the tidyverse? When I learned that Mac OS Mail stores its internal data in a SQLite database file I was hooked. A quick dive in your email archive might uncover some of your old acquaintances. Let’s take a peak.
Obviously, the below is only applicable when you are a regular user of the Mail app for Mac OS. As a first step, you need to locate the file Envelope Index
that tends to be located in ~/Library/Mail/V6/MailData/
. Copy this file somewhere and adjust the path provided in mail_db
to point to this copy. Do not work with the original file.
library(DBI)
library(tidyverse)
library(lubridate)
library(ExPanDaR)
library(ggridges)
mail_db <- "data/EI"
con <- dbConnect(RSQLite::SQLite(), mail_db)
Now you have established a database connection to your copy of Mac OS Mail’s internal data. I you receive an error message, check whether you have the required packages (including RSQLite
) installed. With this established connection, we can now see what the database has in store for us.
kable(dbListTables(con), col.names = "List of Tables")
List of Tables |
---|
action_ews_messages |
action_imap_messages |
action_labels |
action_messages |
addresses |
attachments |
duplicates_unread_count |
events |
ews_copy_action_messages |
ews_folders |
imap_copy_action_messages |
imap_labels |
imap_messages |
labels |
last_spotlight_check_date |
local_message_actions |
mailbox_actions |
mailboxes |
messages |
properties |
recipients |
sqlite_sequence |
subjects |
threads |
Hey, this is an impressive list of tables. For this blog post, I am mostly interested in exploring the development of my email activity in terms of senders and receivers over time. Thus, I focus on the relations messages
, addresses
and recipients
.
messages <- dbListFields(con, "messages")
recipients <- dbListFields(con, "recipients")
addresses <- dbListFields(con, "addresses")
max_members <- max(length(messages),
length(recipients),
length(addresses))
length(messages) <- max_members
length(recipients) <- max_members
length(addresses) <- max_members
df <- data.frame(messages, recipients, addresses, stringsAsFactors = FALSE)
df[is.na(df)] <- ""
kable(df)
messages | recipients | addresses |
---|---|---|
ROWID | ROWID | ROWID |
message_id | message_id | address |
document_id | type | comment |
in_reply_to | address_id | |
remote_id | position | |
sender | ||
subject_prefix | ||
subject | ||
date_sent | ||
date_received | ||
date_created | ||
date_last_viewed | ||
mailbox | ||
remote_mailbox | ||
flags | ||
read | ||
flagged | ||
size | ||
color | ||
type | ||
conversation_id | ||
snippet | ||
fuzzy_ancestor | ||
automated_conversation | ||
root_status | ||
conversation_position | ||
deleted |
OK. These relations provide enough data to play. Let’s create a table that is organized by message and contains sender and receiver info as well as sending/receiving time.
sql <- paste("SELECT messages.ROWID as message_id, date_sent,",
"date_received, a1.address as sender_address,",
"a1.comment as sender_comment, a2.address as recipient_address,",
"a2.comment as recipient_comment, snippet",
"FROM messages left join addresses AS a1 on messages.sender = a1.ROWID",
"LEFT JOIN recipients on recipients.message_id = messages.ROWID",
"LEFT JOIN addresses AS a2 on recipients.address_id = a2.ROWID")
res <- dbSendQuery(con, sql)
df <- dbFetch(res)
dbClearResult(res)
dbDisconnect(con)
df[,c("date_sent", "date_received")] <- lapply(df[,c("date_sent", "date_received")],
function(x) as.POSIXct(x, origin = "1970-01-01"))
The resulting data frame contains all messages, including messages sent to multiple recipients (with me on the sending or receiving end). To limit the messages to the ones where I am involved, I match sender_address
and receiver_address
to a vector my_adresses
containing my email addresses (not disclosed here for privacy reasons).
df %>%
filter(tolower(sender_address) %in% my_addresses |
tolower(recipient_address) %in% my_addresses) %>%
distinct(.keep_all = TRUE) %>%
arrange(date_received) -> emails
In addition, I prepare a panel dataset that contains data at email address year level.
emails %>%
filter(!tolower(sender_address) %in% my_addresses) %>%
mutate(address = tolower(sender_address),
year = year(date_received)) %>%
group_by(year, address) %>%
summarise(emails_received_from = n()) -> emails_received
emails %>%
filter(!tolower(recipient_address) %in% my_addresses) %>%
mutate(address = tolower(recipient_address),
year = year(date_received)) %>%
group_by(year, address) %>%
summarise(emails_sent_to = n()) -> emails_sent
panel <- full_join(emails_received, emails_sent) %>%
replace_na(list(emails_sent_to = 0, emails_received_from = 0)) %>%
arrange(year, -emails_sent_to, -emails_received_from)
Time for a first analysis. How does my email in- and outflow develop over the years?
panel %>%
group_by(year) %>%
summarise(sent_mails = sum(emails_sent_to),
received_mails = sum(emails_received_from)) %>%
gather(key = "var", value = "count", -year) %>%
ggplot(aes(year, count)) +
geom_bar(aes(fill = var), position = "dodge", stat="identity") +
scale_fill_discrete(name="Direction",
labels=c("Emails received", "Emails sent")) +
xlab("Year") +
ylab("Number of emails") +
theme_minimal() +
theme(legend.position=c(.3, .8))
Hmm. I have no interest in extrapolating this time trend… Just for fun and giggles: When do I send emails (by time of day)?
emails %>%
filter(sender_address %in% my_addresses) %>%
mutate(sent_tod = hour(date_sent)) %>%
ggplot() +
geom_histogram(aes(sent_tod), binwidth=1, fill = "#00376C") +
xlab("Time of day [24 hours]") +
ylab("Number of emails") +
theme_minimal()
No real surprises here (at least for me), besides that I seem to take a dip in the early afternoon. Does this sending behavior exhibit any interesting time trends?
emails %>%
filter(sender_address %in% my_addresses) %>%
mutate(sent_tod = hour(date_sent) + minute(date_sent)/60 + second(date_sent)/3600,
year = as.factor(year(date_sent))) %>%
ggplot(aes(x = sent_tod, y = year, fill = ..x..)) +
geom_density_ridges_gradient(scale =2, rel_min_height = 0.01) +
ylab("Year") +
xlab("Time of day [24 hours]") +
theme_minimal() +
theme(legend.position = "none")
2002 and 2003 stand out (I was based in the U.S. for most of this period) and it seems as if I am gradually becoming more of an early starter.
But enough on this. This is supposed to be about old acquaintances to fit this special end-of-the-year mood. Let’s dive more into the personal sphere. For this, I define an “email contact” as an email address with an email exchange in a given year, meaning that I was both, at the receiving and sending end, at least once.
panel %>%
group_by(year) %>%
filter(emails_sent_to > 0,
emails_received_from > 0) %>%
summarise(nr_email_contacts = n()) %>%
ggplot(aes(year, nr_email_contacts)) +
geom_bar(stat="identity", fill = "#00376C") +
xlab("Year") +
ylab("Head count email contacts") +
theme_minimal()
I would never have never guessed that I am exchanging emails with that many people (No, I am not a spammer …, I hope). Who are my “top contacts”?
panel %>%
filter(emails_sent_to > 0,
emails_received_from > 0) %>%
group_by(address) %>%
summarise(email_contacts = sum(emails_sent_to) + sum(emails_received_from),
sent_rec_ratio = sum(emails_sent_to) / sum(emails_received_from),
emails_sent_to = sum(emails_sent_to),
emails_received_from = sum(emails_received_from)) %>%
arrange(-email_contacts) -> email_contacts
# print(email_contacts, n = 50)
I am not including this output here for obvious privacy reasons. If you are interested to follow your email contacts over time, you can use my ExPanD shiny app for quick exploration.
panel %>%
filter(emails_sent_to > 0,
emails_received_from > 0) %>%
ExPanD(cs_id = "address", ts_id = "year", components = c(missing_values = FALSE,
by_group_violin_graph = FALSE,
by_group_bar_graph = FALSE))
Using ExPanD, I encountered a reasonable amount of my “old acquaintances” when focusing on earlier years of the panel. To do this a little bit more systematically, I prepare a linking table, linking the most prominent email contact addresses in the panel file to real people names in an n(email addresses) to 1(person) look-up table. This table has the name email_contacts_names
and the fields address
and name
. Based on this data I produce a new panel sample containing only those email contacts linked to actual persons that I care about.
panel %>%
left_join(email_contacts_names) %>%
filter(!is.na(name)) %>%
group_by(year, name) %>%
summarise(emails_sent_to = sum(emails_sent_to),
emails_received_from = sum(emails_received_from)) %>%
filter(emails_sent_to > 0,
emails_received_from > 0) %>%
mutate(email_contacts = emails_sent_to + emails_received_from,
sent_rec_ratio = emails_sent_to / emails_received_from) %>%
arrange(year, -email_contacts) -> panel_email_contacts
Using this data, you can now easily explore how your contacts changed over time. As one example I prepared a graph highlighting the (relative) number of email contacts with a selected group of people over time. The vector people
contains the names of these people.
single_out_people <- function(df, names, relative = FALSE) {
df$relative <- relative
palette <- function(n) {
hues = seq(15, 375, length = n)
c("#B0B0B0", hcl(h = hues, l = 65, c = 100)[1:n])
}
df %>%
mutate(name = ifelse(name %in% names, name, "Other")) %>%
group_by(year) %>%
mutate(val = ifelse(relative,
email_contacts/sum(email_contacts),
email_contacts)) %>%
select(year, name, val) %>%
group_by(year, name) %>%
summarise(val = sum(val)) %>%
spread(key = name, value = val, fill = 0) %>%
gather(key = name, value = val, -year) %>%
mutate(name = factor(name, levels = c("Other", rev(names)))) %>%
arrange(year, name) %>%
ggplot(aes(x = year, y = val, fill = name)) +
geom_area() +
scale_fill_manual(values = palette(length(names))) +
xlab("Year") +
ylab(ifelse(relative,
"Share of annual email contacts",
"Annual email contacts")) +
theme_minimal() -> p
if (relative) p + scale_y_continuous(labels = scales::percent) else p
}
single_out_people(panel_email_contacts, people, FALSE) + theme(legend.position="none")
single_out_people(panel_email_contacts, people, TRUE) + theme(legend.position="none")
The grayish areas marks “Other”. With legends excluded for privacy reasons, you can follow old and new friends through time with this little display.
While the Envelope Index database seems to offer much more data that is worth exploring, I will it leave it here. Happy New Year everybody and if you are in the mood: Auld Lang Syne.
Enjoy!