# Multi-Modal Lie Detection with GSPO-enhanced ReAct Reasoning

This notebook demonstrates a multi-modal deception detection system that integrates multiple data sources (video, audio, text, and more) with an advanced reasoning framework. The system uses **GSPO-enhanced ReAct** reasoning, combining self-play reinforcement learning and a reasoning-action loop for improved decision-making. It emphasizes transparency, explainability, and ethical considerations in AI-driven lie detection.

## 1. Installation & Setup
In this section, we install all required libraries and set up the environment.
We'll use `pip` to install necessary packages and mount Google Drive to access datasets like the **Strawberry-Phi** deception dataset.

#### Dependencies:
- `torch` for deep learning model implementation (CNNs, LSTMs, transformers).
- `transformers` for the text model and NLP tasks.
- `opencv-python` for video processing (facial cues from images).
- `librosa` for audio signal processing (extracting voice features).
- `shap` and `lime` for explainable AI (interpret model decisions).
- `scikit-learn` for evaluation metrics and possibly simple model components.
- `ipywidgets` for interactive UI elements (uploading files, toggling options).

We'll also mount Google Drive to load the **Strawberry-Phi** dataset for fine-tuning later.

In [None]:
!pip install torch transformers opencv-python librosa shap lime scikit-learn ipywidgets

# Mount Google Drive (if running in Colab)
from google.colab import drive
drive.mount('/content/drive')

## 2. Project Overview
**Multi-Modal Deception Detection** involves analyzing multiple data streams (like facial expressions, voice, text, and physiological signals) to determine if a subject is being deceptive. By combining modalities, we can improve accuracy since deceit often manifests through subtle cues in different channels&#8203;:contentReference[oaicite:0]{index=0}.

**ReAct Reasoning Framework**: The ReAct (Reason + Act) framework interleaves logical reasoning with actionable operations. Instead of making predictions blindly, the system generates a reasoning trace (chain-of-thought) and uses that to inform its actions. This combined approach has been shown to improve decision-making and interpretability&#8203;:contentReference[oaicite:1]{index=1}. In practice, the agent will reason about the inputs (e.g., "The subject is fidgeting and voice pitch is high, which often indicates stress") and take actions (e.g., flag as potential lie) in a loop&#8203;:contentReference[oaicite:2]{index=2}.

We also integrate **GSPO (Generative Self-Play Optimization)** with ReAct. GSPO uses self-play reinforcement learning: the model can simulate conversations or scenarios with itself to improve its lie-detection policy over time. This optional module lets the system learn from hypothetical scenarios, gradually refining its decision boundaries.

#### Ethical AI Considerations:
- **Transparency**: Our system provides reasoning traces and uses explainability tools (LIME, SHAP) so users can understand *why* a decision was made, addressing the "lack of explainability" concern in AI lie detection&#8203;:contentReference[oaicite:3]{index=3}.
- **Bias Mitigation**: We must ensure the models do not overfit to demographic features (e.g., avoiding predictions based on gender or ethnicity). Training on diverse data and testing for bias helps create fair outcomes.
- **Privacy**: All processing is done locally (no data is sent to external servers). We avoid storing sensitive personal data and only use the inputs for real-time analysis.
- **Responsible Use**: Lie detection AI can be misused. This notebook is for research and educational purposes. Any real-world deployment should comply with legal standards and consider the potential for false positives/negatives.


## 3. Model Implementations
We implement separate models for each modality. Each model outputs a confidence score or decision about deception for its modality. Later, we'll fuse these results.

The models will be simple prototypes (not fully trained) to illustrate the architecture:
- **Vision Model**: A CNN for facial expression and micro-expression analysis from video frames or images.
- **Audio Model**: An LSTM (or GRU) for vocal analysis, capturing stress or pitch anomalies in speech.
- **Text Model**: A Transformer (e.g., BERT) for analyzing textual statements for linguistic cues of deception.
- **Physiological Model (Optional)**: Placeholder for processing signals like heart rate or skin conductance.


In [None]:
# Vision Model: CNN-based facial analysis
import torch
import torch.nn as nn
import torch.nn.functional as F

class VisionCNN(nn.Module):
    def __init__(self):
        super(VisionCNN, self).__init__()
        # Simple CNN: 2 conv layers + FC
        self.conv1 = nn.Conv2d(3, 16, kernel_size=3, stride=1, padding=1)
        self.conv2 = nn.Conv2d(16, 32, kernel_size=3, stride=1, padding=1)
        self.pool = nn.MaxPool2d(2, 2)
        # Assuming input images are 64x64, after 2 pools -> 16x16
        self.fc1 = nn.Linear(32 * 16 * 16, 2)  # output: [lie_score, truth_score]

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(x.size(0), -1)
        x = self.fc1(x)
        return x

# Instantiate the vision model (untrained for now)
vision_model = VisionCNN()
print(vision_model)

In [None]:
# Audio Model: LSTM-based vocal stress analysis
import numpy as np
import torch.nn.utils.rnn as rnn_utils

class AudioLSTM(nn.Module):
    def __init__(self, input_size=13, hidden_size=32, num_layers=1):
        super(AudioLSTM, self).__init__()
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, 2)  # 2 classes: lie or truth

    def forward(self, x, lengths=None):
        # x: batch of sequences (batch, seq_len, features)
        if lengths is not None:
            # pack padded sequence if lengths provided
            x = rnn_utils.pack_padded_sequence(x, lengths, batch_first=True, enforce_sorted=False)
        lstm_out, _ = self.lstm(x)
        if lengths is not None:
            lstm_out, _ = rnn_utils.pad_packed_sequence(lstm_out, batch_first=True)
        # Take output of last time step
        if lengths is not None:
            idx = (lengths - 1).view(-1, 1, 1).expand(lstm_out.size(0), 1, lstm_out.size(2))
            last_outputs = lstm_out.gather(1, idx).squeeze(1)
        else:
            last_outputs = lstm_out[:, -1, :]
        out = self.fc(last_outputs)
        return out

# Instantiate the audio model (untrained placeholder)
audio_model = AudioLSTM()
print(audio_model)

In [None]:
# Text Model: Transformer-based deception analysis
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F

# We use a pre-trained BERT model for binary classification (truth/lie)
model_name = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)
text_model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

# Function to get prediction from text model
def text_model_predict(text):
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True)
    outputs = text_model(**inputs)
    logits = outputs.logits
    probs = F.softmax(logits, dim=1)
    # probs is a tensor of shape (batch_size, 2)
    prob_np = probs.detach().cpu().numpy()
    return prob_np

# Example usage (with dummy text)
example_text = "I absolutely did not take the money."  # a deceptive statement example
probs = text_model_predict([example_text])
print(f"Predicted probabilities (lie/truth) for example text: {probs}")

In [None]:
# Physiological Model (Optional): Placeholder for biometric data analysis
# Example of physiological signals: heart rate, skin conductance, blood pressure, etc.
# We'll create a simple placeholder class that could be extended for real sensor input.

class PhysiologicalModel:
    def __init__(self):
        # No actual model, just a placeholder
        self.name = 'PhysioModel'
    def predict(self, data):
        # data could be a dictionary of sensor readings
        # Here we return a dummy neutral prediction
        return np.array([0.5, 0.5])  # equal probability of lie/truth

physio_model = PhysiologicalModel()
print("Physiological model ready (placeholder):", physio_model.name)

## 4. GSPO Integration
Here we integrate **Generative Self-Play Optimization (GSPO)** to enhance the model's decision-making through reinforcement learning. In GSPO, the system can create simulated scenarios and learn from them (like an agent playing against itself to improve skill).

- **Self-Play Reinforcement Learning**: The model (as an agent) plays both roles in a deception scenario (questioner and responder). For example, it might simulate asking a question and then answering either truthfully or deceptively. The agent then tries to predict deception on these simulated answers, receiving a reward for correct detection. Over many iterations, this self-play helps the agent refine its policy for detecting lies.
- This approach is inspired by how game-playing AIs train via self-play (e.g., AlphaGo Zero using self-play to surpass human performance). It allows the model to explore a wide range of scenarios beyond the initial dataset.

- **Optional Learning Toggle**: We implement GSPO in a modular way. Users can turn this self-play learning on or off (for example, to compare performance with/without reinforcement learning). By default, the system won't do self-play unless explicitly enabled, to avoid long training times in this demo.

- **Fine-Tuning with Strawberry-Phi Dataset**: We incorporate a fine-tuning phase using the `strawberry-phi` dataset, which is assumed to contain recorded deception instances (possibly multi-modal). Fine-tuning on real or richly simulated data like Strawberry-Phi ensures the models align better with actual deception cues.


In [None]:
# GSPO Self-Play Reinforcement Learning (simplified simulation)
import random

class SelfPlayAgent:
    def __init__(self, detector_model):
        self.model = detector_model  # could be a combined model or policy
        self.learning = False
        self.training_history = []

    def enable_learning(self, flag=True):
        self.learning = flag

    def simulate_scenario(self):
        """Simulate a deception scenario. Returns (input_data, is_deceptive)."""
        # For simplicity, random simulation: generate a random outcome
        # In practice, this could use a generative model to create realistic scenarios
        is_deceptive = random.choice([0, 1])  # 0 = truth, 1 = lie
        simulated_data = {
            'video': None,  # no actual video in this simulation
            'audio': None,
            'text': "simulated statement",
            'physio': None
        }
        return simulated_data, is_deceptive

    def train_self_play(self, episodes=5):
        if not self.learning:
            print("Self-play learning is disabled. Skipping training.")
            return
        for ep in range(episodes):
            data, truth_label = self.simulate_scenario()
            # Here we would run the detection model on the simulated data
            # and get a prediction (e.g., 1 for lie, 0 for truth)
            # We'll simulate prediction randomly for this demo:
            pred_label = random.choice([0, 1])
            reward = 1 if pred_label == truth_label else -1
            # In a real scenario, use this reward to update model (e.g., policy gradient)
            self.training_history.append(reward)
            print(f"Episode {ep+1}: truth={truth_label}, pred={pred_label}, reward={reward}")

# Initialize a self-play agent (using text model as base for simplicity)
agent = SelfPlayAgent(text_model)
agent.enable_learning(flag=False)  # Disabled by default
agent.train_self_play(episodes=3)

In [None]:
# Fine-tuning with Strawberry-Phi dataset (placeholder)
import pandas as pd
phi_data = None
try:
    # Attempt to load JSONL
    phi_data = pd.read_json('/content/drive/MyDrive/strawberry-phi.jsonl', lines=True)
except Exception:
    try:
        phi_data = pd.read_parquet('/content/drive/MyDrive/strawberry-phi.parquet')
    except Exception as e:
        print("Strawberry-Phi dataset not found. Please upload it to Google Drive.")

if phi_data is not None:
    print("Strawberry-Phi data loaded. Rows:", len(phi_data))
    # TODO: process the dataset, e.g., extract features, train models
else:
    print("Proceeding without Strawberry-Phi fine-tuning.")

## 5. Fusion Model
After obtaining results from each modality-specific model, we need to combine them into a final decision. This is handled by a **Fusion Model** or strategy.

Common fusion approaches:
- **Majority Voting**: Each modality votes truth or lie, and the majority wins. This is simple and robust to one model's errors.
- **Weighted Ensemble**: Assign weights to each modality based on confidence or accuracy, then compute a weighted sum of lie probabilities.
- **Learned Fusion (Meta-Model)**: Train a separate classifier that takes each model's output (or confidence) as input features and outputs the final decision. This could be a small neural network or logistic regression trained on a validation set.

For our system, we'll implement a simple weighted approach. We assume each model outputs a probability of deception (lie). We'll average these probabilities (or give higher weight to modalities we trust more) and then apply a threshold.


In [None]:
# Fusion function for combining modality outputs
def fuse_outputs(results, weights=None):
    """
    results: list of dictionaries with 'lie_score' or probabilities for lie from each modality.
    weights: optional list of weights for each modality.
    returns: final decision ('lie' or 'truth') and combined score.
    """
    if weights is None:
        weights = [1] * len(results)
    total_weight = sum(weights)
    # weighted sum of lie probabilities
    combined_score = 0.0
    for res, w in zip(results, weights):
        # if res is a probability or has 'lie' key
        if isinstance(res, dict):
            lie_prob = res.get('lie') or res.get('lie_score') or (res[1] if isinstance(res, (list, tuple, np.ndarray)) else res)
        else:
            lie_prob = float(res)
        combined_score += w * lie_prob
    combined_score /= total_weight
    decision = 'lie' if combined_score >= 0.5 else 'truth'
    return decision, combined_score

# Example: fuse dummy outputs from the models
vision_out = {'lie': 0.7, 'truth': 0.3}
audio_out = {'lie': 0.4, 'truth': 0.6}
text_out = {'lie': 0.9, 'truth': 0.1}
physio_out = {'lie': 0.5, 'truth': 0.5}
final_decision, score = fuse_outputs([vision_out, audio_out, text_out, physio_out])
print(f"Final decision: {final_decision} (lie probability = {score:.2f})")

## 6. ReAct Agent
The ReAct agent is responsible for the reasoning-action loop. It should mimic how an expert would analyze evidence step-by-step, and justify each conclusion with reasoning before making the next move (action). Our ReAct agent will use the outputs from the above models and reason about them interactively.

Key aspects of our ReAct implementation:
- The agent will gather observations from each modality (e.g., *"Vision model sees nervous facial expression."*).
- It will reason about these observations (*"Nervous face + high voice pitch = likely stress from lying"*).
- Based on reasoning, it may decide an action, such as concluding "lie" or maybe asking for more input if uncertain.
- The loop continues if more reasoning or data is needed. For simplicity, our agent will do one pass of reasoning and then decide.

The agent's decision-making process (as pseudocode):
1. **Observe**: Get inputs from modalities.
2. **Reason**: Form a narrative like "The text content contradicts known facts and the speaker's voice is shaky.".
3. **Act**: Decide on an output (lie or truth) or ask for more data if needed.
4. **Explain**: Provide the reasoning trace to the user for transparency.


In [None]:
# ReAct Agent Implementation (simplified reasoning loop)
def react_agent_decision(video=None, audio=None, text=None, physio=None):
    reasoning_trace = []
    modality_results = []
    # 1. Observe from each modality if available
    if video is not None:
        # Use vision model to get lie probability
        # (Here we simulate by random since we don't have actual video frames)
        vision_prob = random.random()
        modality_results.append({'lie': vision_prob, 'truth': 1-vision_prob})
        reasoning_trace.append(f"Vision analysis suggests lie probability {vision_prob:.2f}.")
    if audio is not None:
        audio_prob = random.random()
        modality_results.append({'lie': audio_prob, 'truth': 1-audio_prob})
        reasoning_trace.append(f"Audio analysis suggests lie probability {audio_prob:.2f}.")
    if text is not None:
        # Use text model
        probs = text_model_predict([text])  # get [ [lie_prob, truth_prob] ]
        lie_prob = float(probs[0][0])
        modality_results.append({'lie': lie_prob, 'truth': float(probs[0][1])})
        reasoning_trace.append(f"Text analysis suggests lie probability {lie_prob:.2f} for the statement.")
    if physio is not None:
        physio_prob = random.random()
        modality_results.append({'lie': physio_prob, 'truth': 1-physio_prob})
        reasoning_trace.append(f"Physiological analysis suggests lie probability {physio_prob:.2f}.")
    
    if not modality_results:
        return "No input provided", None
    # 2. Reason: (In a more complex system, we could add additional logical rules or ask follow-up questions.)
    if len(modality_results) > 1:
        reasoning_trace.append("Combining all modalities to form a conclusion.")
    else:
        reasoning_trace.append("Single modality provided, basing conclusion on that alone.")
    
    # 3. Act: fuse results to get final decision
    decision, score = fuse_outputs(modality_results)
    reasoning_trace.append(f"Final decision: {decision.upper()} (confidence {score:.2f}).")
    
    return "\n".join(reasoning_trace), decision

# Example usage of ReAct agent:
reasoning, decision = react_agent_decision(video=True, audio=True, text="I am telling the truth.")
print("Reasoning Trace:\n" + reasoning)
print("Decision:", decision)

## 7. Interactive Features
To make the system interactive, we include features that allow user input and involvement:

- **File Uploads**: Users can upload video, audio, or text for analysis. We use `ipywidgets` to provide UI elements (like file upload buttons) in Colab.
- **Human-in-the-loop Validation**: After the model makes a decision, the user can review the reasoning and provide feedback or corrections. For example, if the model is wrong, the user could label the instance, which could be logged for further training.
- **Explainability Tools**: We integrate LIME and SHAP to explain model predictions. For example, LIME can highlight which words in the text most influenced the prediction, or SHAP can indicate which facial features contributed to the vision model's output.

These features help users trust and verify the system's outputs, turning the detection process into a cooperative effort between AI and human.


In [None]:
# Interactive widget for file upload
import ipywidgets as widgets

# Create upload widgets for video, audio, text
video_upload = widgets.FileUpload(accept=".mp4,.mov,.avi", description="Upload Video", multiple=False)
audio_upload = widgets.FileUpload(accept=".wav,.mp3", description="Upload Audio", multiple=False)
text_input = widgets.Textarea(placeholder='Enter text to analyze', description='Text:')

# Display widgets
display(video_upload)
display(audio_upload)
display(text_input)

# Button to trigger analysis
analyze_button = widgets.Button(description="Analyze")
output_area = widgets.Output()

def on_analyze_clicked(b):
    with output_area:
        output_area.clear_output()
        vid_file = list(video_upload.value.values())[0] if video_upload.value else None
        aud_file = list(audio_upload.value.values())[0] if audio_upload.value else None
        txt = text_input.value if text_input.value else None
        reasoning, decision = react_agent_decision(video=vid_file, audio=aud_file, text=txt)
        print("Reasoning:\n" + reasoning)
        print("Decision:", decision)

analyze_button.on_click(on_analyze_clicked)
display(analyze_button)
display(output_area)

In [None]:
# Explainability Example with LIME (for text model)
from lime.lime_text import LimeTextExplainer

explainer = LimeTextExplainer(class_names=["Truth", "Lie"])
# We'll use the text model's predict function for probabilities
if 'text_model_predict' in globals():
    exp = explainer.explain_instance("I swear I didn't do it", 
                                     lambda x: text_model_predict(x), 
                                     num_features=5)
    # Display the explanation in notebook (as text)
    explanation = exp.as_list()
    print("Top influences for the text model prediction:")
    for word, score in explanation:
        print(f"{word}: {score:.3f}")
else:
    print("Text model not available for explanation.")

## 8. Inference & Real-Time Processing
Now that we have the components in place, we can use the system for inference on new data. This could be done in batch (one input at a time) or in real-time.

For **real-time processing**, imagine a scenario like a live interview or interrogation. The system would continuously capture video frames and audio snippets, run them through the respective models, and update its deception probability in real-time. The ReAct agent can continuously reason over the new data.

In this notebook setting, we'll simulate real-time processing by iterating through some data or using a loop with delays. In a real deployment, one could use threads or async processes to handle streaming data from a webcam and microphone.

*Note:* Real-time use requires efficient processing and possibly hardware acceleration (GPU) to keep up with live data. There's also a need to smooth predictions over time to avoid jitter (e.g., using a rolling average of recent outputs).


In [None]:
# Simulated real-time processing
import time

# Suppose we have a list of incoming text segments (as an example of streaming data)
streaming_texts = [
    "Hello, I'm happy to talk to you.",
    "I have nothing to hide.",
    "(nervous laugh) Sure, ask me anything...",
    "I already told you everything I know."
]

print("Starting live analysis loop...\n")
for segment in streaming_texts:
    # Simulate delay as if processing streaming input
    time.sleep(1)
    reasoning, decision = react_agent_decision(text=segment)
    print(f"Input: {segment}\nDecision: {decision.upper()}\n")

## 9. Testing & Evaluation
To ensure our system works as expected, we include testing and evaluation steps:

- **Unit Tests**: We create simple tests for each component (e.g., check that the vision model outputs the correct shape, or the fusion function behaves correctly). In Python, one could use the `unittest` framework or simple `assert` statements for validation.
- **Performance Evaluation**: If we have labeled test data, we can measure accuracy, F1-score, AUC, etc. Here we'll simulate predictions and compute a confusion matrix and classification report using scikit-learn.
- **Fairness Assessments**: It's important to test the model for bias. If we had data tagged with demographics, we could check performance separately for each group to ensure consistency. We might also use techniques like counterfactual testing (e.g., swapping gender-specific words in text to see if prediction changes) to identify bias.


In [None]:
# Simple Unit Test for Fusion Function
assert fuse_outputs([{'lie':0.8,'truth':0.2}, {'lie':0.8,'truth':0.2}])[0] == 'lie', "Fusion failed for obvious lie case"
assert fuse_outputs([{'lie':0.1,'truth':0.9}, {'lie':0.2,'truth':0.8}])[0] == 'truth', "Fusion failed for obvious truth case"
print("Fusion function unit tests passed!")

# Simulated Performance Evaluation
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report
# Simulate some ground truth labels and predictions (1=lie, 0=truth)
y_true = [0, 0, 1, 1, 1, 0]
y_pred = [0, 1, 1, 1, 0, 0]
print("Accuracy:", accuracy_score(y_true, y_pred))
print("F1-score:", f1_score(y_true, y_pred, average='binary'))
print("Confusion Matrix:\n", confusion_matrix(y_true, y_pred))
print("Classification Report:\n", classification_report(y_true, y_pred, target_names=["Truth","Lie"]))

## 10. Ethical Considerations
Building a lie detection system raises important ethical questions. We conclude by addressing these aspects:

- **Privacy**: Deception detection can be very invasive. Video and audio analysis might reveal sensitive information. It's crucial to obtain informed consent from individuals being analyzed and ensure data is stored securely (or not at all, in our design).
- **Bias and Fairness**: As noted earlier, AI models can inadvertently learn biases. For example, certain facial expressions might be more common in some cultures but not indicate lying. We should continuously test for and mitigate bias. Techniques include balanced training data, bias correction algorithms, and human review of contentious cases.
- **False Accusations**: No lie detector is 100% accurate â€“ even humans are fallible. AI predictions should not be taken as absolute truth. The system should ideally express uncertainty (e.g., a confidence score) and allow for an appeal or secondary review process. The cost of wrongly accusing someone is high, so threshold for calling something a lie should be carefully chosen.
- **Legal Compliance**: Different jurisdictions have laws about recording conversations, biometric data use, and the admissibility of lie detection in court. Any deployment of this technology must comply with privacy laws (like GDPR) and regulations regarding such tools. Also, organizations like the APA have ethical guidelines on lie detection usage.
- **Responsible Deployment**: We emphasize that this project is a prototype. In practice, one should involve ethicists, legal experts, and psychologists before using an AI lie detection system in real-world situations. It should augment human judgment, not replace it.

By considering these factors, developers and users of lie detection AI can aim to minimize harm and maximize the benefits of the technology.