BaseExtractor
¶
The BaseExtractor
is the foundation for all data extractors in this module.
Warning
The BaseExtractor
should never be created on its own. This page is intended for people who want to contribute to the tidyextractors
project or just want to understand some of what is going on under the hood.
For developers, note that the internal interface is not documented on this site. See the source code for the full documentation of the internal interface.
-
class
tidyextractors.
BaseExtractor
(source, auto_extract=True, *args, **kwargs)¶ BaseExtractor defines a basic interface, initialization routine, and data manipulation tools for extractor subclasses.
-
expand_on
(col1, col2, rename1=None, rename2=None, drop=[], drop_collections=False)¶ Returns a reshaped version of extractor’s data, where unique combinations of values from col1 and col2 are given individual rows.
Example function call from
tidymbox
:self.expand_on('From', 'To', ['MessageID', 'Recipient'], rename1='From', rename2='Recipient')
Columns to be expanded upon should be either atomic values or dictionaries of dictionaries. For example:
Input Data:
col1 (Atomic) col2 (Dict of Dict) value1 {valueA : {attr1: X1, attr2: Y1}, valueB: {attr1: X2, attr2: Y2} value2 {valueC : {attr1: X3, attr2: Y3}, valueD: {attr1: X4, attr2: Y4} Output Data:
col1_extended col2_extended attr1 attr2 value1 valueA X1 Y1 value1 valueB X2 Y2 value2 valueA X3 Y3 value2 valueB X4 Y4 Parameters: - col1 (str) – The first column to expand on. May be an atomic value, or a dict of dict.
- col2 (str) – The second column to expand on. May be an atomic value, or a dict of dict.
- rename1 (str) – The name for col1 after expansion. Defaults to col1_extended.
- rename2 (str) – The name for col2 after expansion. Defaults to col2_extended.
- drop (list) – Column names to be dropped from output.
- drop_collections (bool) – Should columns with compound values be dropped?
Returns: pandas.DataFrame
-
raw
(drop_collections=False)¶ Produces the extractor object’s data as it is stored internally.
Parameters: drop_collections (bool) – Defaults to False. Indicates whether columns with lists/dicts/sets will be dropped. Returns: pandas.DataFrame
-