Introduction

Building and testing a real-time fraud detection application requires a continuous stream of realistic data. But generating that data can be a challenge. That's why we recently created the Datagen CLI, a simple tool that helps you create believable fake data using the FakerJS API.

In this blog post, we'll explore how to use the Datagen CLI to simulate a streaming data use-case like a fraud detection app. I'll show you how to install and configure the tool, create a schema for the data, and send the data to a Kafka topic. By the end of this tutorial, you'll be able to generate your own realistic streaming data for testing and development purposes.

Prerequisites

Basic knowledge of Kafka and streaming data
Node.js and npm installed on your machine
A Kafka cluster set up and running

Installation and Setup

First, install the Datagen CLI using npm:

npm install -g @materializeinc/datagen

Create a .env file in your working directory with the necessary Kafka and Schema Registry environment variables. Replace the placeholder values with your actual settings:

# Kafka Brokers
KAFKA_BROKERS=

# For Kafka SASL Authentication:
SASL_USERNAME=
SASL_PASSWORD=
SASL_MECHANISM=

Creating a Schema for Fraud Detection Data

To generate realistic data for a fraud detection app, we need to define a schema that includes relevant fields like transaction ID, user ID, timestamp, and transaction amount. Let's create a JSON schema called transactions.json:

[
  {
    "_meta": {
      "topic": "transactions"
    },
    "transaction_id": "faker.datatype.uuid()",
    "user_id": "faker.datatype.number({min: 1, max: 10000})",
    "timestamp": "faker.date.between('2023-01-01', '2023-12-31')",
    "amount": "faker.finance.amount(0, 10000, 2)",
    "is_fraud": "faker.datatype.boolean()"
  }
]

This schema generates a stream of transaction data with random transaction IDs, user IDs, timestamps, and amounts. We've also added a field called is_fraud that randomly the transactions as fraudulent.

Generating and Sending Data to Kafka

Now that we have our schema, we can use the Datagen CLI to generate data and send it to a Kafka topic. Use the following command to generate an infinite stream of transactions in JSON format:

datagen \
  -s transactions.json \
  -f json \
  -n -1 \
  -dr

The -n flag specifies the number of messages to generate. We've set it to -1 to generate an infinite stream of data. The -dr flag enables dry run mode, which prints the data to the console instead of sending it to Kafka. This is useful for testing and debugging.

Example output:

✔  Dry run: Skipping record production...  
  Topic: transactions 
  Record key: null 
  Payload: {"transaction_id":"b86d1d57-a650-4680-843d-06179f1c4c2e","user_id":5127,"timestamp":"2023-09-02T03:26:28.194Z","amount":"6904.40","is_fraud":false}


✔  Dry run: Skipping record production...  
  Topic: transactions 
  Record key: null 
  Payload: {"transaction_id":"719fe62a-322c-4b58-89f9-e380e2f3552d","user_id":2757,"timestamp":"2023-09-30T06:40:37.378Z","amount":"3375.15","is_fraud":true}

Press Ctrl+C to stop producing data.

Enriching the Data

To enrich the datagen schema example, we can add more fields related to geolocation and other attributes that can be useful for fraud detection. Update the JSON input schema as follows:

[
  {
    "_meta": {
      "topic": "transactions",
      "key": "id"
    },
    "id": "faker.datatype.uuid()",
    "user_id": "faker.datatype.number({min: 1, max: 1000})",
    "amount": "faker.finance.amount(1, 5000, 2)",
    "currency": "faker.finance.currencyCode()",
    "timestamp": "faker.date.past(1, '2023-01-01').getTime()",
    "is_fraud": "faker.datatype.boolean({likelihood: 5})",
    "ip_address": "faker.internet.ip()",
    "location": {
      "latitude": "faker.address.latitude()",
      "longitude": "faker.address.longitude()"
    },
    "device": {
      "id": "faker.datatype.uuid()",
      "type": "faker.helpers.arrayElement(['mobile', 'tablet', 'desktop'])",
      "os": "faker.helpers.arrayElement(['ios', 'android', 'windows', 'macos', 'linux', 'other'])"
    },
    "merchant_id": "faker.datatype.number({min: 1, max: 500})"
  }
]

In this enriched schema, we've added:

ip_address: An IP address related to the transaction.
location: An object containing latitude and longitude.
device: An object containing device information such as ID, type, and operating system.
merchant_id: A unique ID representing the merchant involved in the transaction.

Relationship between Transactions and Users

The Datagen CLI can also generate data for related entities.

For example, we can extend the schema to include a users topic that contains user information. We can then use the user_id field in the transactions topic to join the two topics together:

[
  {
    "_meta": {
      "topic": "users",
      "key": "id",
      "relationships": [
        {
          "topic": "transactions",
          "parent_field": "id",
          "child_field": "user_id",
          "records_per": 10
        }
      ]
    },
    "id": "faker.datatype.number({min: 1, max: 1000})",
    "name": "faker.name.fullName()",
    "email": "faker.internet.email()",
    "registered_at": "faker.date.past(5, '2023-01-01').getTime()"
  },
  {
    "_meta": {
      "topic": "transactions",
      "key": "id"
    },
    "id": "faker.datatype.uuid()",
    "user_id": "faker.datatype.number(100)",
    "amount": "faker.finance.amount(1, 5000, 2)",
    "currency": "faker.finance.currencyCode()",
    "timestamp": "faker.date.between('relationship.registered_at', new Date()).getTime()",
    "is_fraud": "faker.datatype.boolean({likelihood: 5})",
    "ip_address": "faker.internet.ip()",
    "location": {
      "latitude": "faker.address.latitude()",
      "longitude": "faker.address.longitude()"
    },
    "device": {
      "id": "faker.datatype.uuid()",
      "type": "faker.helpers.arrayElement(['mobile', 'tablet', 'desktop'])",
      "os": "faker.helpers.arrayElement(['ios', 'android', 'windows', 'macos', 'linux', 'other'])"
    },
    "merchant_id": "faker.datatype.number({min: 1, max: 500})"
  }
]

The data will be produced to the users and transactions topics. The transactions topic will contain a user_id field that references the id field in the users topic. The transactions topic will also contain a registered_at field that references the registered_at field in the users topic.

An example of the data produced:

...
  Topic: users 
  Record key: 602 
  Payload: {"id":602,"name":"Mr. Jennie Prohaska","email":"[email protected]","registered_at":1591058898886}


  Topic: transactions 
  Record key: 417f6a6b-d7c5-47a0-a013-93f2aee94941 
  Payload: {"user_id":602,"id":"417f6a6b-d7c5-47a0-a013-93f2aee94941","amount":"1760.29","currency":"MZN","timestamp":1680516946423,"is_fraud":true,"ip_address":"240.254.28.18","location":{"latitude":"60.3920","longitude":"11.9718"},"device":{"id":"1e57d9d2-de48-4bf2-9131-e70ceb5f4fee","type":"mobile","os":"macos"},"merchant_id":20}
...

Testing Your Fraud Detection App

With realistic streaming data available, you can now test your fraud detection app using the data generated by the Datagen CLI. Consume the data from the transactions Kafka topic and implement your fraud detection logic, which may involve analyzing transaction patterns, comparing with historical data, or applying machine learning models.

As a next step you can use Materialize to create a materialized view of the transactions data. This will allow you to query the data in real-time and build a fraud detection dashboard.

Conclusion

The Datagen CLI is a simple tool for generating realistic streaming data for testing and development purposes. In this tutorial, we showcased how to use Datagen CLI to simulate a fraud detection app use-case. With this knowledge, you can create your own schemas and generate data for various streaming data applications.

Useful Links:

Results From Your Search

Simulating Streaming Data for Fraud Detection with Datagen CLI