`BaseExtractor`¶

The BaseExtractor is the foundation for all data extractors in this module.

Warning

The BaseExtractor should never be created on its own. This page is intended for people who want to contribute to the tidyextractors project or just want to understand some of what is going on under the hood.

For developers, note that the internal interface is not documented on this site. See the source code for the full documentation of the internal interface.

class tidyextractors.BaseExtractor(source, auto_extract=True, *args, **kwargs)¶

BaseExtractor defines a basic interface, initialization routine, and data manipulation tools for extractor subclasses.

expand_on(col1, col2, rename1=None, rename2=None, drop=[], drop_collections=False)¶

Returns a reshaped version of extractor’s data, where unique combinations of values from col1 and col2 are given individual rows.

Example function call from tidymbox:

self.expand_on('From', 'To', ['MessageID', 'Recipient'], rename1='From', rename2='Recipient')

Columns to be expanded upon should be either atomic values or dictionaries of dictionaries. For example:

Input Data:

col1 (Atomic)	col2 (Dict of Dict)
value1	{valueA : {attr1: X1, attr2: Y1}, valueB: {attr1: X2, attr2: Y2}
value2	{valueC : {attr1: X3, attr2: Y3}, valueD: {attr1: X4, attr2: Y4}

Output Data:

col1_extended	col2_extended	attr1	attr2
value1	valueA	X1	Y1
value1	valueB	X2	Y2
value2	valueA	X3	Y3
value2	valueB	X4	Y4

Parameters:

col1 (str) – The first column to expand on. May be an atomic value, or a dict of dict.
col2 (str) – The second column to expand on. May be an atomic value, or a dict of dict.
rename1 (str) – The name for col1 after expansion. Defaults to col1_extended.
rename2 (str) – The name for col2 after expansion. Defaults to col2_extended.
drop (list) – Column names to be dropped from output.
drop_collections (bool) – Should columns with compound values be dropped?

Returns:

pandas.DataFrame

raw(drop_collections=False)¶

Produces the extractor object’s data as it is stored internally.

Parameters:	drop_collections (bool) – Defaults to False. Indicates whether columns with lists/dicts/sets will be dropped.
Returns:	pandas.DataFrame

BaseExtractor¶

`BaseExtractor`¶