API¶
DyeScore¶
-
class
dye_score.dye_score.
DyeScore
(config_file_path, validate_config=True, print_config=False, sc=None)[source]¶ Parameters: - config_file_path (str) –
The path of your config file that is used for dye score to interact with your environment. Holds references to file paths and private data such as AWS API keys. Expects a YAML file with the following keys:
- INPUT_PARQUET_LOCATION - the location of the raw or sampled OpenWPM input parquet folder
- DYESCORE_DATA_DIR - location where you would like dye score to store data assets
- DYESCORE_RESULTS_DIR - location where you would like dye score to store results assets
- USE_AWS - default False - set true if data store is AWS
- AWS_ACCESS_KEY_ID - optional - for storing and retrieving data on AWS
- AWS_SECRET_ACCESS_KEY - optional - for storing and retrieving data on AWS
- SPARK_S3_PROTOCOL - default ‘s3’ - only s3 or s3a are used
- PARQUET_ENGINE - default ‘pyarrow’ - pyarrow or fastparquet
Locations can be a local file path or a bucket.
- validate_config (bool, optional) – Run
DyeScore.validate_config
method. Defaults toTrue
. - print_config (bool, optional) – Print out config once saved. Defaults to
False
. - sc (SparkContext, optional) – If accessing s3 via s3a, pass spark context to set aws credentials
-
build_plot_data_for_thresholds
(compare_list, thresholds, leaky_threshold, filename_suffix='dye_snippets', override=False)[source]¶ Builds a dataframe for evaluation
Contains the recall compared to the
compare_list
for scripts under the threshold.Parameters: - compare_list (list) – List of dye scripts to compare for recall.
- thresholds (list) – List of distances to compute snippet scores for e.g.
[0.23, 0.24, 0.25]
- filename_suffix (str, optional) – Change to differentiate between dye_snippet sets. Defaults to
dye_snippets
- override (bool, optional) – Override output files. Defaults to
False
.
Returns: list. Paths results were written to
-
build_raw_snippet_df
(override=False, snippet_func=None)[source]¶ Builds raw_snippets from input data
Default snippet function is
script_url.netloc||script_url.path_end||func_name
If script_url is missing, location is used.Parameters: - override (bool) – True to replace any existing outputs
- snippet_func (function) – Function that accepts row of data as input and computes the snippet value. Default provided.
Returns: str. The file path where output is saved
-
build_snippet_map
(override=False)[source]¶ Builds snippet ids and saves map of ids to raw snippets
xarray cannot handle arbitrary length string indexes so we need to build a set of unique ids to reference snippets. This method creates the ids and saves the map of raw ids to snippets.
Parameters: override (bool, optional) – True to replace any existing outputs. Defaults to False
Returns: str. The file path where output is saved
-
build_snippet_snippet_dyeing_map
(spark, override=False)[source]¶ Build file used to join snippets to data for dyeing.
- Adds clean_script field to dataset. Saves parquet file with:
- snippet - the int version, not raw_snippet
- document_url
- script_url
- clean_script
Parameters: - spark (pyspark.sql.session.SparkSession) – spark instance
- override (bool, optional) – True to replace any existing outputs. Defaults to
False
Returns: str. The file path where output is saved
-
build_snippets
(spark, na_value=0, override=False)[source]¶ Builds row-normalized snippet dataset
- Dimensions are n snippets x s unique symbols in dataset.
- Data is output in zarr format with processing by spark, dask, and xarray.
- Creates an intermediate tmp file when converting from spark to dask.
- Slow running operation - follow spark and dask status to see progress
We use spark here because dask cannot memory efficiently compute a pivot table. This is the only function we need spark context for.
Parameters: - spark (pyspark.sql.session.SparkSession) – spark instance
- na_value (int, optional) – The value to fill vector where there’s no call. Defaults to
0
. - override (bool, optional) – True to replace any existing outputs. Defaults to
False
Returns: str. The file path where output is saved
-
compute_distances_for_dye_snippets
(dye_snippets, filename_suffix='dye_snippets', snippet_chunksize=1000, dye_snippet_chunksize=1000, distance_function='chebyshev', override=False, **kwargs)[source]¶ Computes all pairwise distances from dye snippets to all other snippets.
- Expects snippets file to exist.
- Writes results to zarr with name
snippets_dye_distances_from_{filename_suffix}
- This is a long-running function - see dask for progress
Parameters: - dye_snippets (np.array) – Numpy array of snippets to be dyed. Must be a subset of snippets index.
- filename_suffix (str, optional) – Change to differentiate between dye_snippet sets. Defaults to
dye_snippets
- snippet_chunksize (int, optional) – Set the chunk size for snippet xarray input, i
along the snippet dimension (not the symbol dimension). Defaults to
1000
. - dye_snippet_chunksize (int, optional) – Set the chunk size for dye snippet xarray input,
along the snippet dimension (not the symbol dimension). Defaults to
1000
. - distance_function (string or function, optional) – Provide a function to compute distances or a string
to use a built-in distance function. See
dye_score.distances.py
for template for example distance functions. Default is"chebyshev"
. Alternatives are cosine, jaccard, cityblock. - override (bool, optional) – Override output files. Defaults to
False
. - kwargs – kwargs to pass to distance function if required e.g. mahalanobis requires vi
Returns: str. Path results were written to
-
compute_dye_scores_for_thresholds
(thresholds, leaky_threshold, filename_suffix='dye_snippets', override=False)[source]¶ Get dye scores for a range of distance thresholds.
- Uses results from
compute_snippets_scores_for_thresholds
- Writes results to gzipped csv files with name
dye_score_from_{filename_suffix}_{threshold}.csv.gz
Parameters: - thresholds (list) – List of distances to compute snippet scores for e.g.
[0.23, 0.24, 0.25]
- filename_suffix (str, optional) – Change to differentiate between dye_snippet sets. Defaults to
dye_snippets
- override (bool, optional) – Override output files. Defaults to
False
.
Returns: list. Paths results were written to
- Uses results from
-
compute_leaky_snippet_data
(thresholds_to_test, filename_suffix='dye_snippets', override=False)[source]¶ Compute leaky percentages for a range of thresholds. This enables user to select the “leaky threshold” for following rounds.
- Writes results to parquet files with name
leak_test_{filename_suffix}_{threshold}
Parameters: - thresholds_to_test (list) – List of distances to compute percentage of snippets
dyed at for e.g.
[0.23, 0.24, 0.25]
- filename_suffix (str, optional) – Change to differentiate between dye_snippet sets. Defaults to
dye_snippets
- override (bool, optional) – Override output files. Defaults to
False
.
Returns: list. Paths results were written to
- Writes results to parquet files with name
-
compute_snippets_scores_for_thresholds
(thresholds, leaky_threshold, filename_suffix='dye_snippets', override=False)[source]¶ Get score for snippets for a range of distance thresholds.
- Writes results to parquet files with name
snippets_score_from_{filename_suffix}_{threshold}
Parameters: - thresholds (list) – List of distances to compute snippet scores for e.g.
[0.23, 0.24, 0.25]
- leaky_threshold (float) – Remove all snippets which dye more than this fraction of all other snippets.
- filename_suffix (str, optional) – Change to differentiate between dye_snippet sets. Defaults to
dye_snippets
- override (bool, optional) – Override output files. Defaults to
False
.
Returns: list. Paths results were written to
- Writes results to parquet files with name
-
config
(option)[source]¶ Method to retrieve config values
Parameters: option (str) – The desired config option key Returns: The config option value
-
dye_score_data_file
(filename)[source]¶ Helper function to return standardized filename.
DyeScore class holds a dictionary to standardize the file names that DyeScore saves. This method looks up filenames by their short name.
Parameters: filename (str) – data file name Returns: str. The path where the data file should reside
-
file_in_validation
(inpath)[source]¶ Check path exists.
Raises ValueError if not. Used for input files, as these must exist to proceed. :param inpath: Path of input file :type inpath: str
-
file_out_validation
(outpath, override)[source]¶ Check path exists. Raises ValueError if override is False. Otherwises removes the existing file. :param outpath: Path of ourput file. :type outpath: str :param override: Whether to raise an error or remove existing data. :type override: bool
-
from_parquet_opts
¶ Options used when saving to parquet.
-
get_input_df
(columns=None)[source]¶ Helper function to return the input dataframe.
Parameters: columns (list, optional) – List of columns to retrieve. If None, all columns are returned. Returns: dask.DataFrame. Input dataframe with subset of columns requested.
-
s3_storage_options
¶ s3 storage options built from config
Returns: dict. if USE_AWS is True returns s3 options as dict, else None.
-
to_parquet_opts
¶ Options used when saving to parquet.
- config_file_path (str) –
Plotting utils¶
The following plotting utils can be used directly or maybe useful template code for reviewing your results.
-
dye_score.plotting.
get_plots_for_thresholds
(ds, thresholds, leaky_threshold, n_scripts_range, filename_suffix='dye_snippets', y_range=(0, 1), recall_color='black', n_scripts_color='firebrick', **extra_plot_opts)[source]¶