BaseExtractor

The BaseExtractor is the foundation for all data extractors in this module.

Warning

The BaseExtractor should never be created on its own. This page is intended for people who want to contribute to the tidyextractors project or just want to understand some of what is going on under the hood.

For developers, note that the internal interface is not documented on this site. See the source code for the full documentation of the internal interface.

class tidyextractors.BaseExtractor(source, auto_extract=True, *args, **kwargs)

BaseExtractor defines a basic interface, initialization routine, and data manipulation tools for extractor subclasses.

expand_on(col1, col2, rename1=None, rename2=None, drop=[], drop_collections=False)

Returns a reshaped version of extractor’s data, where unique combinations of values from col1 and col2 are given individual rows.

Example function call from tidymbox:

self.expand_on('From', 'To', ['MessageID', 'Recipient'], rename1='From', rename2='Recipient')

Columns to be expanded upon should be either atomic values or dictionaries of dictionaries. For example:

Input Data:

col1 (Atomic) col2 (Dict of Dict)
value1 {valueA : {attr1: X1, attr2: Y1}, valueB: {attr1: X2, attr2: Y2}
value2 {valueC : {attr1: X3, attr2: Y3}, valueD: {attr1: X4, attr2: Y4}

Output Data:

col1_extended col2_extended attr1 attr2
value1 valueA X1 Y1
value1 valueB X2 Y2
value2 valueA X3 Y3
value2 valueB X4 Y4
Parameters:
  • col1 (str) – The first column to expand on. May be an atomic value, or a dict of dict.
  • col2 (str) – The second column to expand on. May be an atomic value, or a dict of dict.
  • rename1 (str) – The name for col1 after expansion. Defaults to col1_extended.
  • rename2 (str) – The name for col2 after expansion. Defaults to col2_extended.
  • drop (list) – Column names to be dropped from output.
  • drop_collections (bool) – Should columns with compound values be dropped?
Returns:

pandas.DataFrame

raw(drop_collections=False)

Produces the extractor object’s data as it is stored internally.

Parameters:drop_collections (bool) – Defaults to False. Indicates whether columns with lists/dicts/sets will be dropped.
Returns:pandas.DataFrame