Data Engineer

Aetosky

Indonesia

3-5 Years

Save

Posted a day ago
Be among the first 10 applicants

Early Applicant

Job Description

Company Description

Aetosky develops secure software platforms designed for defense and dual-use institutions to harness geospatial data for critical decision-making. By providing interoperable tools tailored to mission-critical environments, Aetosky supports operations such as battlefield intelligence, infrastructure protection, disaster response, and climate security. Focused on real-time operations and strategic foresight, our technologies empower partners to act with precision, speed, and confidence in sensitive, air-gapped environments. We collaborate with government and enterprise customers to advance geospatial intelligence capabilities in modern defense and multi-domain operations.

About the role

The Data & NLP/AI Engineer owns the full data journey within Aetosky's Multi-INT Fusion Platform -from scraping raw open-source content off the internet, through statistical filtering and semantic analysis, to orchestrating LLM-powered deep intelligence processing. This is a combined Data Engineering and NLP/AI Engineering role with end-to-end ownership: you build the ingestion infrastructure, deploy the vector database, implement anomaly detection and clustering algorithms, and design the prompt orchestration layer for agentic AI analysis. AI-assisted development (GitHub Copilot, Cursor, Claude Code, or equivalent) is the standard workflow - not optional - and will be directly assessed during the hiring process.

Responsibilities

Data Infrastructure Responsibilities

Design and build automated data collection pipelines (web scrapers, API integrations) for target platforms including X, Facebook, local forums, Instagram, TikTok, and Reddit.

Deploy and manage the vector database (PostgreSQL with pgvector extension) with indexing optimized for semantic similarity search at scale.

Implement pipeline monitoring and alerting: heartbeat checks, record-count validation, dead-letter queues, and golden-record unit tests to prevent silent data loss.

Manage infrastructure scaling during surge events (sudden data volume spikes during geopolitical crises).

Complete secure enclave provider assessment based on target client security requirements.

NLP / AI Engineering Responsibilities

Implement the first-stage statistical filter using TF-IDF with configurable anomaly thresholds against 30-day rolling baselines.

Build semantic clustering using lightweight vector embedding models, grouping near-duplicate content into representative cluster centroids for efficient analyst review.

Implement bot-detection tripwires: velocity anomaly detection (timing-based coordinated inauthentic behavior) and lexical duplication detection (copy-paste spam arrays).

Design and manage the prompt orchestration layer for the second-stage LLM processor: intent extraction, relationship mapping, and structured output generation within a secure cloud enclave.

Implement cost-cap logic with graceful degradation: dynamic threshold escalation at budget warning levels, automated pause at cap, and manual triage fallback.

Collaboration & Tuning Responsibilities

Collaborate with the Full-Stack Software Developer on data contracts, API schemas, and query optimization for frontend consumption.

Lead the daily filter tuning cycle during the post-launch stabilization period (first 3060 days): analyze false positive rates, processing costs, and output quality metrics.

Document pipeline architecture, filter logic, and prompt templates to enable future team onboarding and sovereign AI transition.

Classifications / Qualifications

Required

3+ years of combined experience spanning data engineering and applied NLP/machine learning.

Demonstrated daily proficiency with AI-assisted development tools (GitHub Copilot, Cursor, Claude Code, or equivalent) - this will be assessed in the technical evaluation.

Strong Python and SQL skills with hands-on experience in PostgreSQL (pgvector a plus), Elasticsearch, or similar.

Experience building web scrapers that handle anti-bot protections, rate limiting, proxy rotation, and DOM structure changes.

Hands-on experience with text embedding models (sentence-transformers, OpenAI embeddings, or equivalent), vector similarity search, and clustering algorithms.

Demonstrated LLM prompt engineering: designing prompts, managing context windows, evaluating output quality, and controlling inference costs.

Familiarity with monitoring and observability tools (Prometheus, Grafana, Datadog, or equivalent).

Preferred

Experience with multilingual NLP.

Experience with real-time data streaming technologies (Kafka, Redis Streams, or similar).

Background in influence operation detection, disinformation analysis, or social media intelligence.

Demonstrated LLM cost optimization techniques (batching, caching, token management).