Links that we found helpful in developing and deploying Tutela on EC2. Tutela uses a PostgreSQL database to store clusters.
Add code to your python path:
source init_env.sh
[10.28.21] Added Redis to cache queries, so we don't need to continually bug firebase. However, this is only a bandaid as this does not support fast querying.
[10.29.21] Switching to an SQL model for fast queries. Removes need for firebase and Redis.
[11.9.21] Important to keep uploading to db vs creating clusters files as separate. We also assume that the graph will be small enough to fit into memory. Some doubt here.
Here is the procedure of steps to compute "deposit address reuse" clusters. This involves processing CSVs of > 1 Tb, so care must be taken to not exceed constraints on storage and RAM.
scripts/dl_bucket.py
.scripts/sort_big_csv.py
).scripts/run_deposit.py
to generate data.csv
and metadata.csv
.data.csv
-> data-pruned.csv
using scripts/prune_data.csv
.metadata.csv
-> metadata-pruned.csv
using scripts/metadata-pruned.csv
.scripts/run_nx.py
to generate user_clusters.json
and exchange_clusters.json
.combine_metadata.py
to generate metadata-final.csv
.