前言
Title: Introduction to Elliptic Data Set - a graph network of Bitcoin transactions with handcrafted features.
Keywords: Anomaly Detection, Bitcoin, Anti-Money Laundering
合法(licit)與不合法(illicit)定義
Licit: exchanges, wallet providers, miners, licit services, etc.
wallet provider: any natural or legal person or other legal form that provides private cryptographic key protection services on behalf of its clients for the possession, storage, and transfer of virtual currencies.
illicit: scams(欺詐), malware(惡意軟體), terrorist organizations, ransomware(勒索), Ponzi schemes(龐氏騙局), etc.
Graph Construction
Nodes
Nodes represent transactions. There are 203,769 node transactions. 4545 nodes labelled illicit & 42019 labelled licit & 157205 unknown.
應用模型分類:Semi-supervised
Node Feature
Due to intellectual property issues, we cannot provide an exact description of all the features in the dataset.
Each node has associated 166 features. The first 94 features represent local information about the transaction. The remaining 72 features, called aggregated features, are obtained by aggregating transaction information one-hop backward/forward from the center node.
- 94 Local feature:
Time step: detail in Temporal Information
Number of inputs/outputs:
Inputs:The bitcoin address that contains the bitcoin Alice wants to send. To be more accurate, it is the address from which Alice had previously received bitcoin to and is now wanting to spend. Outputs: Bob’s public key or bitcoin address.
Transaction fee(手续费): Mathematically, transaction fees are the difference between the amount of bitcoin sent and the amount received. Conceptually, transaction fees are a reflection of the speed with which a user wants their transaction validated on the blockchain.
Output volume and aggregated figures(總數) such as average BTC received (spent) by the inputs/outputs and average number of incoming (outgoing) transactions associated with the inputs/outputs.
- 72 Aggregated feature:
Obtained by aggregating transaction information one-hop backward/forward from the center node -the maximum, minimum, standard deviation and correlation coefficients(相關係數) of the neighbour transactions for the same information data (number of inputs/outputs, transaction fee, etc.).
- Temporal Information
A time step from 1 to 49 is associated with each node. It represents an estimate of when the Bitcoin network confirmed the transaction. The time steps are evenly spaced with an interval of about two weeks and each one contains a single connected component of transactions that appeared on the blockchain within less than three hours between each other. Therefore, it can be considered that this data set includes 49 directed acyclic graphs(有向無環圖) associated with different sequential moments in time.
Edges
Edges represent the flow of Bitcoin currency (BTC) going from one transaction to the next one. There are 234,355 directed edge payments flows.
Actual Data Document Format
The data set contains 3 .csv files:
- elliptic_txs_classes
About label info. 1 for illicit. 2 for illicit.
- elliptic_txs_edgelist
About node connection info.
- elliptic_txs_features
About Node(transaction) feartures info.(被加密處理過)