expressionable package

Submodules

expressionable.expressionable module

class expressionable.expressionable.ExpressionAble(file_path, file_type=None)

Bases: object

Creates an ExpressionAble object, which represents a file to be transformed.

Parameters:
  • file_path (str) – Name of a file path to read and perform operations on.
  • file_type (str) – Name of the type of file that is being read.
export_filter_results(output_file_path, data_type=None, filters=None, columns=[], transpose=False, include_all_columns=False, gzip_results=False, index=None)

Filters and then exports data to a file.

Parameters:
  • output_file_path (str) – Name of the file that results will be saved to.
  • data_type (str, default None) – Name of the file format results will be saved to. If None, the type will be inferred from the file path.
  • filters (str, default None) – Query or filter to apply to the data set written in Python logic.
  • columns (list of str, default []) – Names of columns to include in the output. If blank and no filter is specified, all columns will be included.
  • transpose (bool, default False) – If True, index and columns will be transposed in the output file.
  • include_all_columns (bool, default False) – Indicates whether to include all columns in the output. If True, overrides columnList.
  • gzip_results (bool, default False) – Indicates whether the resulting file will be gzipped.
  • index (str, default None) – Name of the column to be set as index.
Returns:

None

export_query_results(out_file_path, out_file_type=None, columns=[], continuous_queries=[], discrete_queries=[], transpose=False, include_all_columns=False, gzip_results=False)

Filters and exports data to a file. Similar to export_filter_results, but takes filters in the form of ContinuousQuery and DiscreteQuery objects, and has slightly less flexible functionality

Parameters:
  • out_file_path (str) – Name of the file that results will be saved to.
  • out_file_type (str, default None) – Name of the file format results will be saved to. If None, the type will be inferred from the file path.
  • columns (list of str, default []) – Names of columns to include in the output. If blank and no filter is specified, all columns will be included.
  • continuous_queries (list of ContinuousQuery.) – Objects representing queries on a column of continuous data.
  • discrete_queries (list of DiscreteQuery) – Objects representing queries on a column of discrete data.
  • transpose (bool, default False) – If True, index and columns will be transposed in the output file.
  • include_all_columns (bool, default False) – Indicates whether to include all columns in the output. If True, overrides columnList.
  • gzip_results (bool, default False) – Indicates whether the resulting file will be gzipped.
Returns:

None

get_all_columns_info()

Retrieves the column name, data type, and all unique values from every column in a file.

Returns:Name, data type (continuous/discrete), and unique values from every column.
Return type:dictionary where key: column name and value:ColumnInfo object containing the column name, data type (continuous/discrete), and unique values from all columns
get_column_info(columnName: str, sizeLimit: int = None)

Retrieves a specified column’s name, data type, and all its unique values from a file.

Parameters:
  • columnName (str) – The name of the column about which information is being obtained.
  • sizeLimit (int) – limits the number of unique values returned to be no more than this number.
Returns:

Name, data type (continuous/discrete), and unique values from specified column

Return type:

ColumnInfo object

get_column_names() → list

Retrieves all column names from a dataset stored in a parquet file :type parquetFilePath: string :param parquetFilePath: filepath to a parquet file to be examined

Returns:All column names
Return type:list
get_filtered_samples(continuous_queries, discrete_queries)
merge_files(files_to_merge, out_file_path, files_to_merge_types=[], out_file_type=None, gzip_results=False, on=None, how='inner')

Merges multiple ExpressionAble-compatible files into a single file.

Parameters:
  • files_to_merge (list of str) – File paths representing files that will be merged with the file in this ExpressionAble object.
  • out_file_path (str) – File path where the output of merging the files will be stored.
  • files_to_merge_types (list of str) – list of file types corresponding to files_to_merge. If the list is empty, types will be inferred from file extensions. If the list has one value, that will be the type of every file in files_to_merge. If the list has the same number of items as files_to_merge, the types will correspond to the files in files_to_merge.
  • out_file_type (str, default None) – Name of the file format that results will be saved to. If None, the type will be inferred from the file path.
  • gzip_results (bool, default False) – Indicates whether the resulting file will be gzipped.
  • on (str, default None) – Column or index level names to join on. These must be found in all files. If on is None and not merging on indexes then this defaults to the intersection of the columns in all.
Returns:

None

peek(numRows=10, numCols=10)

Takes a look at the first few rows and columns of a parquet file and returns a Pandas DataFrame corresponding to the number of requested rows and columns

Parameters:
  • numRows (int, default 10) – the number of rows the returned Pandas DataFrame will contain.
  • numCols (int, default 10) – the number of columns the returned Pandas DataFrame will contain.
Returns:

The first numRows and numCols in the given parquet file

Return type:

Pandas DataFrame

peek_by_column_names(listOfColumnNames, numRows=10, indexCol='Sample')

Takes a look at a portion of the file by showing only the requested columns.

Parameters:
  • listOfColumnNames (list of str) – Names of columns that will be given in the output.
  • numRows (int, default 10) – The number of rows that will be shown with the requested columns in the output.
  • indexCol (str, default 'Sample') – Name of the column that will be the index column in the DataFrame.
Returns:

Pandas DataFrame with only the requested columns and number of rows.

Module contents

class expressionable.ExpressionAble(file_path, file_type=None)

Bases: object

Creates an ExpressionAble object, which represents a file to be transformed.

Parameters:
  • file_path (str) – Name of a file path to read and perform operations on.
  • file_type (str) – Name of the type of file that is being read.
export_filter_results(output_file_path, data_type=None, filters=None, columns=[], transpose=False, include_all_columns=False, gzip_results=False, index=None)

Filters and then exports data to a file.

Parameters:
  • output_file_path (str) – Name of the file that results will be saved to.
  • data_type (str, default None) – Name of the file format results will be saved to. If None, the type will be inferred from the file path.
  • filters (str, default None) – Query or filter to apply to the data set written in Python logic.
  • columns (list of str, default []) – Names of columns to include in the output. If blank and no filter is specified, all columns will be included.
  • transpose (bool, default False) – If True, index and columns will be transposed in the output file.
  • include_all_columns (bool, default False) – Indicates whether to include all columns in the output. If True, overrides columnList.
  • gzip_results (bool, default False) – Indicates whether the resulting file will be gzipped.
  • index (str, default None) – Name of the column to be set as index.
Returns:

None

export_query_results(out_file_path, out_file_type=None, columns=[], continuous_queries=[], discrete_queries=[], transpose=False, include_all_columns=False, gzip_results=False)

Filters and exports data to a file. Similar to export_filter_results, but takes filters in the form of ContinuousQuery and DiscreteQuery objects, and has slightly less flexible functionality

Parameters:
  • out_file_path (str) – Name of the file that results will be saved to.
  • out_file_type (str, default None) – Name of the file format results will be saved to. If None, the type will be inferred from the file path.
  • columns (list of str, default []) – Names of columns to include in the output. If blank and no filter is specified, all columns will be included.
  • continuous_queries (list of ContinuousQuery.) – Objects representing queries on a column of continuous data.
  • discrete_queries (list of DiscreteQuery) – Objects representing queries on a column of discrete data.
  • transpose (bool, default False) – If True, index and columns will be transposed in the output file.
  • include_all_columns (bool, default False) – Indicates whether to include all columns in the output. If True, overrides columnList.
  • gzip_results (bool, default False) – Indicates whether the resulting file will be gzipped.
Returns:

None

get_all_columns_info()

Retrieves the column name, data type, and all unique values from every column in a file.

Returns:Name, data type (continuous/discrete), and unique values from every column.
Return type:dictionary where key: column name and value:ColumnInfo object containing the column name, data type (continuous/discrete), and unique values from all columns
get_column_info(columnName: str, sizeLimit: int = None)

Retrieves a specified column’s name, data type, and all its unique values from a file.

Parameters:
  • columnName (str) – The name of the column about which information is being obtained.
  • sizeLimit (int) – limits the number of unique values returned to be no more than this number.
Returns:

Name, data type (continuous/discrete), and unique values from specified column

Return type:

ColumnInfo object

get_column_names() → list

Retrieves all column names from a dataset stored in a parquet file :type parquetFilePath: string :param parquetFilePath: filepath to a parquet file to be examined

Returns:All column names
Return type:list
get_filtered_samples(continuous_queries, discrete_queries)
merge_files(files_to_merge, out_file_path, files_to_merge_types=[], out_file_type=None, gzip_results=False, on=None, how='inner')

Merges multiple ExpressionAble-compatible files into a single file.

Parameters:
  • files_to_merge (list of str) – File paths representing files that will be merged with the file in this ExpressionAble object.
  • out_file_path (str) – File path where the output of merging the files will be stored.
  • files_to_merge_types (list of str) – list of file types corresponding to files_to_merge. If the list is empty, types will be inferred from file extensions. If the list has one value, that will be the type of every file in files_to_merge. If the list has the same number of items as files_to_merge, the types will correspond to the files in files_to_merge.
  • out_file_type (str, default None) – Name of the file format that results will be saved to. If None, the type will be inferred from the file path.
  • gzip_results (bool, default False) – Indicates whether the resulting file will be gzipped.
  • on (str, default None) – Column or index level names to join on. These must be found in all files. If on is None and not merging on indexes then this defaults to the intersection of the columns in all.
Returns:

None

peek(numRows=10, numCols=10)

Takes a look at the first few rows and columns of a parquet file and returns a Pandas DataFrame corresponding to the number of requested rows and columns

Parameters:
  • numRows (int, default 10) – the number of rows the returned Pandas DataFrame will contain.
  • numCols (int, default 10) – the number of columns the returned Pandas DataFrame will contain.
Returns:

The first numRows and numCols in the given parquet file

Return type:

Pandas DataFrame

peek_by_column_names(listOfColumnNames, numRows=10, indexCol='Sample')

Takes a look at a portion of the file by showing only the requested columns.

Parameters:
  • listOfColumnNames (list of str) – Names of columns that will be given in the output.
  • numRows (int, default 10) – The number of rows that will be shown with the requested columns in the output.
  • indexCol (str, default 'Sample') – Name of the column that will be the index column in the DataFrame.
Returns:

Pandas DataFrame with only the requested columns and number of rows.