User-Generated Content Safety Policy

Effective: Conflab v0.1.12+

This document defines Conflab's posture on user-generated content: what we accept, how we moderate it, and what recourse exists for users and operators.

UGC Surfaces

Conflab accepts user-generated content on these surfaces:

Surface Resource Constraint Since
Reviews Review One per user per entry ST0067/WP-08
Ratings Rating One per user per entry (1--5) ST0067/WP-08
Published lenses and shapes Entry + EntryContent Via publish flow ST0067/WP-11
Flags Flag One per user per target ST0072/WP-02

Future surfaces (threaded comments, reply-to-review, review-on-theme) will follow the same moderation primitives.

Threat Model

T1: Review Spam

Scenario: A user posts promotional or nonsense reviews on popular entries to drive traffic or degrade quality.

Mitigations: One-review-per-entry identity constraint (exists). Per-user daily review creation quota of 20/day (WP-04). Community flagging with threshold-based hiding (WP-02). Admin moderation queue (WP-05).

T2: Review Bombing

Scenario: Coordinated users post many negative reviews on a single entry to suppress its rating.

Mitigations: One-review-per-entry constraint limits each attacker to one review. Flag system lets legitimate users flag abusive reviews (WP-02). Admin can unflag incorrectly suppressed reviews. Aggregate rating is decoupled from review text, so even hidden reviews retain their rating contribution -- admin can bulk-delete ratings from flagged accounts if needed.

T3: Coordinated Flag Abuse

Scenario: A group flags legitimate reviews or entries to silence them.

Mitigations: Admin moderation queue (WP-05) provides human oversight. Per-user daily flag quota of 50/day (WP-04). Flag counts are visible to admins, so patterns of coordinated flagging are detectable. Unflag action allows admins to clear flags. Future: automated flag-abuse detection (deferred).

T4: Spam Publishes

Scenario: A user publishes many low-quality or spam entries to pollute the catalog.

Mitigations: Publish quarantine (WP-03) -- untrusted users' publishes land in moderation_status: :pending, invisible in the public catalog until admin-approved. Per-user daily publish quota of 10/day (WP-04). Trusted users bypass quarantine (see Trusted User Criteria below).

T5: Malicious Lens Content

Scenario: A published lens contains Lua code designed to exfiltrate data, harvest credentials, or abuse the LLM context.

Mitigations: conflabd's Lua runtime sandbox constrains execution (no filesystem access, no network access, no system calls). Publish quarantine (WP-03) adds a human checkpoint before untrusted content enters the catalog. Content scanning at publish time is deferred (see Scope below).

T6: PII Leakage

Scenario: A user inadvertently includes personal information in a review body or lens content.

Mitigations: No automated PII detection in this phase. Admin moderation queue (WP-05) provides a manual checkpoint. The privacy policy covers user responsibility for content they submit.

Lifecycle States

Flag States (Reviews and Entries)

Flags use dynamic threshold semantics: a target is considered flagged when flag_count >= threshold. Unflagging (withdrawing a flag) decreases the count and can restore visibility. There is no sticky "flagged" state -- the system recalculates based on current flag count.

Rationale: Dynamic semantics are simpler, more forgiving of false flags, and avoid the need for admin intervention on every threshold crossing. The trade-off (a coordinated group could flag and unflag repeatedly) is acceptable at current scale and mitigated by per-user flag quotas.

  • flag_count < threshold -- visible in public reads
  • flag_count >= threshold -- hidden from public reads, visible in admin moderation queue
  • Admin can override: explicitly approve (resets flags) or remove content

Moderation States (Published Entries)

moderation_status is a separate attribute from visibility, because they represent orthogonal concerns:

  • visibility is the author's intent: :private, :unlisted, :public
  • moderation_status is the platform's decision: :approved, :pending, :rejected

An entry is publicly discoverable only when visibility == :public AND moderation_status == :approved.

State Meaning Visible in catalog?
:approved Passed moderation (or auto-approved for trusted users) Yes (if visibility is :public)
:pending Awaiting moderation review No
:rejected Rejected by moderator, with reason No

Seed/curated entries are created with moderation_status: :approved. User-published entries default to :pending unless the author is trusted.

Flag Reasons

Flags carry a reason enum to support triage in the moderation queue:

  • :spam -- promotional, off-platform advertising, SEO manipulation
  • :abuse -- harassment, hate speech, threats, personal attacks
  • :off_topic -- content unrelated to the entry being reviewed
  • :malicious -- suspected malware, credential harvesting, data exfiltration
  • :other -- anything not covered above (requires note text)

Trusted User Criteria

A user is considered "trusted" for the purpose of publish auto-approval when any of the following are true:

  1. User has role :admin or :superadmin
  2. User has verified_publisher: true (set by admin)

The verified_publisher flag is the primary trust signal. It is deliberately manual -- an admin must grant it. Automated trust escalation (N prior approved publishes, account age, etc.) is deferred until operational data justifies the thresholds.

Rationale: At launch scale, the number of publishers is small enough that manual verification is practical. Premature automation risks creating a trust ladder that attackers can climb.

Self-Flag Policy

Users can flag their own reviews. This serves as a self-retraction mechanism: a user who regrets a review can flag it rather than deleting it, which preserves the audit trail. Self-flagging an entry is also permitted for the same reason, though the more common path for entry authors is to set visibility to :private.

Rate Limits

Per-user creation quotas enforced at the service layer:

Action Limit Window
Review creation 20 24 hours
Flag creation 50 24 hours
Lens/shape publish 10 24 hours

Rate limit hits return {:error, :rate_limited} and surface as a toast notification in the UI. Limits are configurable via application config. Admin and superadmin users are exempt.

Backend: Hammer (in-memory, ETS-backed). Chosen over Oban-backed counters because rate limiting is a hot path that benefits from in-memory speed, and the deployment target is single-node for the foreseeable future. Rate limit state is lost on restart, which is acceptable -- the worst case is a brief window of no enforcement after a deploy.

Notification Channel

Moderation state changes (approval, rejection) notify the publisher via in-app notification only in this phase. Email notifications are deferred until the platform has a transactional email pipeline.

Rejection notifications include the moderator's reason text.

Scope Exclusions

The following are explicitly out of scope for the initial UGC hardening and will be addressed in dedicated steel threads if operational data warrants:

  • Automated content classification (LLM-based spam/abuse/PII detection)
  • Community trust levels (Discourse-style new/member/regular/leader)
  • Structured appeals flow for rejected content
  • Image/attachment moderation
  • Federated reputation or external trust signals
  • Lua content analysis at publish time (conflabd sandbox is the safety layer)
  • Automated flag-abuse detection