Mbox Data Extraction¶
Mbox is a file format used to store mailbox data on Unix operating systems. The tidyextractors.tidymbox
submodule lets you extract user data from Mbox files with minimal effort. This page will guide you through the process.
A Minimal Code Example¶
from tidyextractors.tidymbox as MboxExtractor
# Extracts all mbox files in this directory.
mx = MboxExtractor('./your/mbox/dir/')
# Email messages in a Pandas DataFrame.
email_df = mx.emails(drop_collections=True)
# MessageID/receiver keyed Pandas DataFrame.
sends_df = mx.sends()
Step 1: Prepare Your Mbox Files¶
You can extract data from a single Mbox file, or multiple Mbox files. However, all these files must be in a single directory:
ls -1 ./your/mbox/dir/
file1.mbox
file2.mbox
file3.mbox
Step 2: Extract Data¶
Once you have consolidated your Mbox files, you can extract data from them using the MboxExtractor
:
from tidyextractors.tidymbox as MboxExtractor
# All mbox files in the directory
mx = MboxExtractor('./your/mbox/dir/')
# Only one mbox file
mx = MboxExtractor('./your/mbox/dir/file1.mbox')
You may need to wait while the data is being extracted, but all the data is now stored inside the extractor object. You just need a bit more code to get it in your preferred format.
Step 3: Get Pandas Data¶
Now, you can call an MboxExtractor
method to return data in a Pandas DataFrame.
# Email messages in a Pandas DataFrame.
email_df = mx.emails(drop_collections=True)
# MessageID/receiver keyed Pandas DataFrame.
sends_df = mx.sends()
Note
MboxExtractor.emails()
drops columns with collections of data in cells (i.e. list
, set
, and dicts
) because “tidy data” requires only atomic values in cells.
If you don’t want data dropped, change the optional drop_collections
argument to false.
Note
This submodule’s internals were adapted from Phil Deutsch’s mbox-to-pandas script with his permission.