API

DyeScore

class dye_score.dye_score.DyeScore(config_file_path, validate_config=True, print_config=False, sc=None)[source]
Parameters:
  • config_file_path (str) –

    The path of your config file that is used for dye score to interact with your environment. Holds references to file paths and private data such as AWS API keys. Expects a YAML file with the following keys:

    • INPUT_PARQUET_LOCATION - the location of the raw or sampled OpenWPM input parquet folder
    • DYESCORE_DATA_DIR - location where you would like dye score to store data assets
    • DYESCORE_RESULTS_DIR - location where you would like dye score to store results assets
    • USE_AWS - default False - set true if data store is AWS
    • AWS_ACCESS_KEY_ID - optional - for storing and retrieving data on AWS
    • AWS_SECRET_ACCESS_KEY - optional - for storing and retrieving data on AWS
    • SPARK_S3_PROTOCOL - default ‘s3’ - only s3 or s3a are used
    • PARQUET_ENGINE - default ‘pyarrow’ - pyarrow or fastparquet

    Locations can be a local file path or a bucket.

  • validate_config (bool, optional) – Run DyeScore.validate_config method. Defaults to True.
  • print_config (bool, optional) – Print out config once saved. Defaults to False.
  • sc (SparkContext, optional) – If accessing s3 via s3a, pass spark context to set aws credentials
build_plot_data_for_thresholds(compare_list, thresholds, leaky_threshold, filename_suffix='dye_snippets', override=False)[source]

Builds a dataframe for evaluation

Contains the recall compared to the compare_list for scripts under the threshold.

Parameters:
  • compare_list (list) – List of dye scripts to compare for recall.
  • thresholds (list) – List of distances to compute snippet scores for e.g. [0.23, 0.24, 0.25]
  • filename_suffix (str, optional) – Change to differentiate between dye_snippet sets. Defaults to dye_snippets
  • override (bool, optional) – Override output files. Defaults to False.
Returns:

list. Paths results were written to

build_raw_snippet_df(override=False, snippet_func=None)[source]

Builds raw_snippets from input data

Default snippet function is script_url.netloc||script_url.path_end||func_name If script_url is missing, location is used.

Parameters:
  • override (bool) – True to replace any existing outputs
  • snippet_func (function) – Function that accepts row of data as input and computes the snippet value. Default provided.
Returns:

str. The file path where output is saved

build_snippet_map(override=False)[source]

Builds snippet ids and saves map of ids to raw snippets

xarray cannot handle arbitrary length string indexes so we need to build a set of unique ids to reference snippets. This method creates the ids and saves the map of raw ids to snippets.

Parameters:override (bool, optional) – True to replace any existing outputs. Defaults to False
Returns:str. The file path where output is saved
build_snippet_snippet_dyeing_map(spark, override=False)[source]

Build file used to join snippets to data for dyeing.

Adds clean_script field to dataset. Saves parquet file with:
  • snippet - the int version, not raw_snippet
  • document_url
  • script_url
  • clean_script
Parameters:
  • spark (pyspark.sql.session.SparkSession) – spark instance
  • override (bool, optional) – True to replace any existing outputs. Defaults to False
Returns:

str. The file path where output is saved

build_snippets(spark, na_value=0, override=False)[source]

Builds row-normalized snippet dataset

  • Dimensions are n snippets x s unique symbols in dataset.
  • Data is output in zarr format with processing by spark, dask, and xarray.
  • Creates an intermediate tmp file when converting from spark to dask.
  • Slow running operation - follow spark and dask status to see progress

We use spark here because dask cannot memory efficiently compute a pivot table. This is the only function we need spark context for.

Parameters:
  • spark (pyspark.sql.session.SparkSession) – spark instance
  • na_value (int, optional) – The value to fill vector where there’s no call. Defaults to 0.
  • override (bool, optional) – True to replace any existing outputs. Defaults to False
Returns:

str. The file path where output is saved

compute_distances_for_dye_snippets(dye_snippets, filename_suffix='dye_snippets', snippet_chunksize=1000, dye_snippet_chunksize=1000, distance_function='chebyshev', override=False, **kwargs)[source]

Computes all pairwise distances from dye snippets to all other snippets.

  • Expects snippets file to exist.
  • Writes results to zarr with name snippets_dye_distances_from_{filename_suffix}
  • This is a long-running function - see dask for progress
Parameters:
  • dye_snippets (np.array) – Numpy array of snippets to be dyed. Must be a subset of snippets index.
  • filename_suffix (str, optional) – Change to differentiate between dye_snippet sets. Defaults to dye_snippets
  • snippet_chunksize (int, optional) – Set the chunk size for snippet xarray input, i along the snippet dimension (not the symbol dimension). Defaults to 1000.
  • dye_snippet_chunksize (int, optional) – Set the chunk size for dye snippet xarray input, along the snippet dimension (not the symbol dimension). Defaults to 1000.
  • distance_function (string or function, optional) – Provide a function to compute distances or a string to use a built-in distance function. See dye_score.distances.py for template for example distance functions. Default is "chebyshev". Alternatives are cosine, jaccard, cityblock.
  • override (bool, optional) – Override output files. Defaults to False.
  • kwargs – kwargs to pass to distance function if required e.g. mahalanobis requires vi
Returns:

str. Path results were written to

compute_dye_scores_for_thresholds(thresholds, leaky_threshold, filename_suffix='dye_snippets', override=False)[source]

Get dye scores for a range of distance thresholds.

  • Uses results from compute_snippets_scores_for_thresholds
  • Writes results to gzipped csv files with name dye_score_from_{filename_suffix}_{threshold}.csv.gz
Parameters:
  • thresholds (list) – List of distances to compute snippet scores for e.g. [0.23, 0.24, 0.25]
  • filename_suffix (str, optional) – Change to differentiate between dye_snippet sets. Defaults to dye_snippets
  • override (bool, optional) – Override output files. Defaults to False.
Returns:

list. Paths results were written to

compute_leaky_snippet_data(thresholds_to_test, filename_suffix='dye_snippets', override=False)[source]

Compute leaky percentages for a range of thresholds. This enables user to select the “leaky threshold” for following rounds.

  • Writes results to parquet files with name leak_test_{filename_suffix}_{threshold}
Parameters:
  • thresholds_to_test (list) – List of distances to compute percentage of snippets dyed at for e.g. [0.23, 0.24, 0.25]
  • filename_suffix (str, optional) – Change to differentiate between dye_snippet sets. Defaults to dye_snippets
  • override (bool, optional) – Override output files. Defaults to False.
Returns:

list. Paths results were written to

compute_snippets_scores_for_thresholds(thresholds, leaky_threshold, filename_suffix='dye_snippets', override=False)[source]

Get score for snippets for a range of distance thresholds.

  • Writes results to parquet files with name snippets_score_from_{filename_suffix}_{threshold}
Parameters:
  • thresholds (list) – List of distances to compute snippet scores for e.g. [0.23, 0.24, 0.25]
  • leaky_threshold (float) – Remove all snippets which dye more than this fraction of all other snippets.
  • filename_suffix (str, optional) – Change to differentiate between dye_snippet sets. Defaults to dye_snippets
  • override (bool, optional) – Override output files. Defaults to False.
Returns:

list. Paths results were written to

config(option)[source]

Method to retrieve config values

Parameters:option (str) – The desired config option key
Returns:The config option value
dye_score_data_file(filename)[source]

Helper function to return standardized filename.

DyeScore class holds a dictionary to standardize the file names that DyeScore saves. This method looks up filenames by their short name.

Parameters:filename (str) – data file name
Returns:str. The path where the data file should reside
file_in_validation(inpath)[source]

Check path exists.

Raises ValueError if not. Used for input files, as these must exist to proceed. :param inpath: Path of input file :type inpath: str

file_out_validation(outpath, override)[source]

Check path exists. Raises ValueError if override is False. Otherwises removes the existing file. :param outpath: Path of ourput file. :type outpath: str :param override: Whether to raise an error or remove existing data. :type override: bool

from_parquet_opts

Options used when saving to parquet.

get_input_df(columns=None)[source]

Helper function to return the input dataframe.

Parameters:columns (list, optional) – List of columns to retrieve. If None, all columns are returned.
Returns:dask.DataFrame. Input dataframe with subset of columns requested.
s3_storage_options

s3 storage options built from config

Returns:dict. if USE_AWS is True returns s3 options as dict, else None.
to_parquet_opts

Options used when saving to parquet.

validate_config()[source]

Validate the config data. Currently just checks that values are correct for aws.

Raises AssertionError if values are incorrect.

validate_input_data()[source]

Checks for expected columns and types in input data.

Plotting utils

The following plotting utils can be used directly or maybe useful template code for reviewing your results.

dye_score.plotting.get_plots_for_thresholds(ds, thresholds, leaky_threshold, n_scripts_range, filename_suffix='dye_snippets', y_range=(0, 1), recall_color='black', n_scripts_color='firebrick', **extra_plot_opts)[source]
dye_score.plotting.get_pr_plot(pr_df, title, n_scripts_range, y_range=(0, 1), recall_color='black', n_scripts_color='firebrick', **extra_plot_opts)[source]

Example code for plotting dye score threshold plots

dye_score.plotting.get_threshold_summary_plot(ds)[source]
dye_score.plotting.plot_hist(title, hist, edges, y_axis_type='linear', bottom=0)[source]
dye_score.plotting.plot_key_leaky(percent_to_dye, key, y_axis_type='linear', bottom=0, bins=40)[source]