Usage¶
This notebook runs through using the dye score library and methodology to score scripts.
The input data is generated by OpenWPM. A dataset that has been used with the dye score is available at github.com/mozilla/overscripted
This notebook was run on a small sample.
Dye Score expects a spark context to be available for thie initial data processing steps.
Additionally, set-up a Dask Client however you choose to. The below cell was generated by Dask’s JupyterLab extension.
Note the warning is known by the dask team (https://github.com/dask/distributed/issues/2564).
[1]:
from dask.distributed import Client
client = Client("tcp://127.0.0.1:32829")
client
/home/bird/miniconda3/envs/ovscrptd/lib/python3.6/site-packages/dask/config.py:168: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
data = yaml.load(f.read()) or {}
/home/bird/miniconda3/envs/ovscrptd/lib/python3.6/site-packages/distributed/config.py:20: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
defaults = yaml.load(f)
[1]:
Client
|
Cluster
|
[2]:
import dask.dataframe as dd
import numpy as np
from dye_score import DyeScore
[3]:
ds = DyeScore('config.yaml', print_config=False)
[4]:
ds.validate_input_data()
[4]:
True
[5]:
df = ds.get_input_df()
df.head()
[5]:
top_level_url | script_url | func_name | symbol | |
---|---|---|---|---|
0 | https://7ero.org/ | https://forsiteid6346.tech/convert/scripts/cre... | b.exec | CanvasRenderingContext2D.fillStyle |
1 | https://www.stevinsonhyundai.com/ | https://tag.contactatonce.com/le_secure_storag... | r | window.Storage.setItem |
2 | https://www.thecircle.com/us/ | https://www.thecircle.com/k3/ruxitagentjs_ICA2... | fc | window.document.cookie |
3 | https://www.jcpportraits.com/ | https://cdn.optimizely.com/js/8447592883.js | be/< | window.Storage.length |
4 | https://www.technik-profis.de/ | https://cdn.optimizely.com/js/8323142798.js | t.getUserAgent | window.navigator.userAgent |
[6]:
print(f'This sample is {len(df):,} rows')
This sample is 2,312,697 rows
Data Preparation¶
[7]:
%time ds.build_raw_snippet_df()
top_level_url \
0 https://7ero.org/
1 https://www.stevinsonhyundai.com/
2 https://www.thecircle.com/us/
3 https://www.jcpportraits.com/
4 https://www.technik-profis.de/
script_url func_name \
0 https://forsiteid6346.tech/convert/scripts/cre... b.exec
1 https://tag.contactatonce.com/le_secure_storag... r
2 https://www.thecircle.com/k3/ruxitagentjs_ICA2... fc
3 https://cdn.optimizely.com/js/8447592883.js be/<
4 https://cdn.optimizely.com/js/8323142798.js t.getUserAgent
symbol \
0 CanvasRenderingContext2D.fillStyle
1 window.Storage.setItem
2 window.document.cookie
3 window.Storage.length
4 window.navigator.userAgent
raw_snippet called
0 forsiteid6346.tech||createjs-2015.11.26.min.js... 1
1 tag.contactatonce.com||storage.secure.min.html||r 1
2 www.thecircle.com||ruxitagentjs_ICA27SVfhjoqrx... 1
3 cdn.optimizely.com||8447592883.js||be/< 1
4 cdn.optimizely.com||8323142798.js||t.getUserAgent 1
CPU times: user 121 ms, sys: 43 ms, total: 164 ms
Wall time: 39.7 s
[7]:
'/home/bird/Dev/mozilla/overscripted-clustering/new_data_dye_package/test_dyescore_data/raw_snippet_call_df.parquet'
[8]:
%time ds.build_snippet_map()
raw_snippet snippet
0 forsiteid6346.tech||createjs-2015.11.26.min.js... 792826184637634903
1 tag.contactatonce.com||storage.secure.min.html||r -3182365903651065472
2 www.thecircle.com||ruxitagentjs_ICA27SVfhjoqrx... -9027005229756292155
3 cdn.optimizely.com||8447592883.js||be/< 2248811367515630966
4 cdn.optimizely.com||8323142798.js||t.getUserAgent -6265856453346281252
CPU times: user 81 ms, sys: 10.6 ms, total: 91.6 ms
Wall time: 1.71 s
[8]:
'/home/bird/Dev/mozilla/overscripted-clustering/new_data_dye_package/test_dyescore_data/snippet_lookup.parquet'
The next two methods require your spark context to be available to pass to the methods.
[9]:
%time ds.build_snippets(spark)
/home/bird/miniconda3/envs/ovscrptd/lib/python3.6/site-packages/pyarrow/__init__.py:159: UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream
warnings.warn("pyarrow.open_stream is deprecated, please use "
Dataset has 216 unique symbols
<xarray.DataArray (snippet: 231057, symbol: 216)>
array([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]])
Coordinates:
* snippet (snippet) object '-1000043381057326421' ... '999736522860943363'
* symbol (symbol) object 'AnalyserNode.channelCount' ... 'window.sessionStorage'
CPU times: user 13.6 s, sys: 859 ms, total: 14.5 s
Wall time: 4min
[9]:
'/home/bird/Dev/mozilla/overscripted-clustering/new_data_dye_package/test_dyescore_data/snippets.zarr'
[12]:
%time ds.build_snippet_snippet_dyeing_map(spark)
CPU times: user 39.1 ms, sys: 51.7 ms, total: 90.8 ms
Wall time: 9.8 s
[12]:
'/home/bird/Dev/mozilla/overscripted-clustering/new_data_dye_package/test_dyescore_data/snippet_dyeing_map.parquet'
There are no more functions that depend on spark
Dyeing¶
Building list of dye snippets is up to user. Here we show an example using a keyword search for fingerprint
.
[4]:
snippet_dyeing_map_file = ds.dye_score_data_file('snippet_dyeing_map')
snippet_data = dd.read_parquet(snippet_dyeing_map_file, engine='pyarrow')
snippet_data.head()
[4]:
top_level_url | script_url | func_name | snippet | clean_script | |
---|---|---|---|---|---|
0 | http://narvalife.ucoz.net/ | https://usocial.pro/usocial/fingerprint2.min.js | e.prototype.getNavigatorPlatform | 4996125033026346492 | usocial.pro/usocial/fingerprint2.min.js |
1 | https://sletaem.by/ | https://sletaem.by/ | updateTimer | -6846198680163094774 | sletaem.by/ |
2 | http://realcoco.com/ | http://fs.bizspring.net/fsn/bstrk.1.js | _trkdp_getCookie | 2578583411096044764 | fs.bizspring.net/fsn/bstrk.1.js |
3 | https://www.trendydiscount.shop/ | https://www.google-analytics.com/analytics.js | zc | 1695113790766404014 | www.google-analytics.com/analytics.js |
4 | https://www.liveaquaria.com/ | https://www.youtube.com/yts/jsbin/player-vflYg... | hE | 2066756695033721030 | www.youtube.com/yts/jsbin/player-vflYgf3QU/en_... |
[4]:
key = 'fingerprint'
filename_suffix = f'{key}_keyword'
thresholds = [0.15, 0.2, 0.23, 0.24, 0.25, 0.26, 0.3, 0.35]
[6]:
script_snippets = snippet_data[snippet_data.clean_script.str.contains(key, case=False)].snippet.unique().astype(str)
funcname_snippets = snippet_data[snippet_data.func_name.str.contains(key, case=False)].snippet.unique().astype(str)
dye_snippets = np.unique(np.append(script_snippets, funcname_snippets))
With the dye snippets in hand we can now use the DyeScore library to compute the dye scores for a range of thresholds.
[8]:
%time ds.compute_distances_for_dye_snippets(dye_snippets=dye_snippets, filename_suffix=filename_suffix)
/home/bird/miniconda3/envs/ovscrptd/lib/python3.6/site-packages/dask/array/blockwise.py:204: UserWarning: The da.atop function has moved to da.blockwise
warnings.warn("The da.atop function has moved to da.blockwise")
<xarray.DataArray 'data' (snippet: 231057, dye_snippet: 553)>
dask.array<shape=(231057, 553), dtype=float64, chunksize=(10000, 100)>
Coordinates:
* snippet (snippet) object '-1000043381057326421' ... '999736522860943363'
* dye_snippet (dye_snippet) object '-1006661115172174629' ... '917589267078160730'
CPU times: user 453 ms, sys: 120 ms, total: 573 ms
Wall time: 1min 6s
[8]:
'/home/bird/Dev/mozilla/overscripted-clustering/new_data_dye_package/test_dyescore_results/snippets_dye_distances_from_fingerprint_keyword'
[9]:
%time ds.compute_snippets_scores_for_thresholds(thresholds, filename_suffix=filename_suffix)
CPU times: user 1.55 s, sys: 253 ms, total: 1.8 s
Wall time: 13.1 s
[9]:
['/home/bird/Dev/mozilla/overscripted-clustering/new_data_dye_package/test_dyescore_results/snippets_score_from_fingerprint_keyword_0.15',
'/home/bird/Dev/mozilla/overscripted-clustering/new_data_dye_package/test_dyescore_results/snippets_score_from_fingerprint_keyword_0.2',
'/home/bird/Dev/mozilla/overscripted-clustering/new_data_dye_package/test_dyescore_results/snippets_score_from_fingerprint_keyword_0.23',
'/home/bird/Dev/mozilla/overscripted-clustering/new_data_dye_package/test_dyescore_results/snippets_score_from_fingerprint_keyword_0.24',
'/home/bird/Dev/mozilla/overscripted-clustering/new_data_dye_package/test_dyescore_results/snippets_score_from_fingerprint_keyword_0.25',
'/home/bird/Dev/mozilla/overscripted-clustering/new_data_dye_package/test_dyescore_results/snippets_score_from_fingerprint_keyword_0.26',
'/home/bird/Dev/mozilla/overscripted-clustering/new_data_dye_package/test_dyescore_results/snippets_score_from_fingerprint_keyword_0.3',
'/home/bird/Dev/mozilla/overscripted-clustering/new_data_dye_package/test_dyescore_results/snippets_score_from_fingerprint_keyword_0.35']
[5]:
%time ds.compute_dye_scores_for_thresholds(thresholds, filename_suffix=filename_suffix)
CPU times: user 10.3 s, sys: 415 ms, total: 10.7 s
Wall time: 57 s
[5]:
['/home/bird/Dev/mozilla/overscripted-clustering/new_data_dye_package/test_dyescore_results/dye_score_from_fingerprint_keyword_0.15.csv.gz',
'/home/bird/Dev/mozilla/overscripted-clustering/new_data_dye_package/test_dyescore_results/dye_score_from_fingerprint_keyword_0.2.csv.gz',
'/home/bird/Dev/mozilla/overscripted-clustering/new_data_dye_package/test_dyescore_results/dye_score_from_fingerprint_keyword_0.23.csv.gz',
'/home/bird/Dev/mozilla/overscripted-clustering/new_data_dye_package/test_dyescore_results/dye_score_from_fingerprint_keyword_0.24.csv.gz',
'/home/bird/Dev/mozilla/overscripted-clustering/new_data_dye_package/test_dyescore_results/dye_score_from_fingerprint_keyword_0.25.csv.gz',
'/home/bird/Dev/mozilla/overscripted-clustering/new_data_dye_package/test_dyescore_results/dye_score_from_fingerprint_keyword_0.26.csv.gz',
'/home/bird/Dev/mozilla/overscripted-clustering/new_data_dye_package/test_dyescore_results/dye_score_from_fingerprint_keyword_0.3.csv.gz',
'/home/bird/Dev/mozilla/overscripted-clustering/new_data_dye_package/test_dyescore_results/dye_score_from_fingerprint_keyword_0.35.csv.gz']
Evaluate scores¶
We now manually review the dye scores compared to the input dye list in order to select the best distance threshold.
The review process needs a list of clean script
scripts to compare to the dye score list to produce the following plot. The production of this list will be dependent on how the dye snippets list was prepared.
[10]:
import pandas as pd
from bokeh.io import export_png, show
from bokeh.layouts import gridplot
from dye_score.plotting import get_pr_plot
from IPython.display import Image
[8]:
snippet_data.head()
[8]:
top_level_url | script_url | func_name | snippet | clean_script | |
---|---|---|---|---|---|
0 | http://narvalife.ucoz.net/ | https://usocial.pro/usocial/fingerprint2.min.js | e.prototype.getNavigatorPlatform | 4996125033026346492 | usocial.pro/usocial/fingerprint2.min.js |
1 | https://sletaem.by/ | https://sletaem.by/ | updateTimer | -6846198680163094774 | sletaem.by/ |
2 | http://realcoco.com/ | http://fs.bizspring.net/fsn/bstrk.1.js | _trkdp_getCookie | 2578583411096044764 | fs.bizspring.net/fsn/bstrk.1.js |
3 | https://www.trendydiscount.shop/ | https://www.google-analytics.com/analytics.js | zc | 1695113790766404014 | www.google-analytics.com/analytics.js |
4 | https://www.liveaquaria.com/ | https://www.youtube.com/yts/jsbin/player-vflYg... | hE | 2066756695033721030 | www.youtube.com/yts/jsbin/player-vflYgf3QU/en_... |
[9]:
compare_list = snippet_data[snippet_data.snippet.isin(dye_snippets)].clean_script.unique().compute()
compare_list.head()
[9]:
0 usocial.pro/usocial/fingerprint2.min.js
1 script.hotjar.com/modules-ab5ba0ccf53ded68dfc9...
2 www.convertthepdf.co/js/landing.js
3 track.adabra.com/sbn_fingerprint.v1.16.47.min.js
4 www.bestwestern.com.br/modules/mod_rewards_but...
Name: clean_script, dtype: object
[10]:
%time plot_df_paths = ds.build_plot_data_for_thresholds(compare_list, thresholds, filename_suffix=filename_suffix)
CPU times: user 39.7 s, sys: 148 ms, total: 39.9 s
Wall time: 39.9 s
[7]:
plot_df_paths[0]
[7]:
'/home/bird/Dev/mozilla/overscripted-clustering/new_data_dye_package/test_dyescore_results/dye_score_plot_data_from_fingerprint_keyword_0.15.csv.gz'
[11]:
plots = []
plot_opts = dict(tools='', toolbar_location=None, width=300, height=200)
for threshold, pr_df_path in zip(thresholds, plot_df_paths):
pr_df = pd.read_csv(pr_df_path)
plots.append(get_pr_plot(pr_df, title=f'{threshold}', plot_opts=plot_opts))
Image(export_png(gridplot(plots, ncols=3, toolbar_location=None)))
[11]:
Remaining analysis is up to user based on their preferred distance threshold.
[ ]: