Data Engineer

aetosky

Indonesia

3-5 Years

Save

Posted 2 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

Company Description

Aetosky develops secure software platforms designed for defense and dual-use institutions to harness geospatial data for critical decision-making. By providing interoperable tools tailored to mission-critical environments, Aetosky supports operations such as battlefield intelligence, infrastructure protection, disaster response, and climate security. Focused on real-time operations and strategic foresight, our technologies empower partners to act with precision, speed, and confidence in sensitive, air-gapped environments. We collaborate with government and enterprise customers to advance geospatial intelligence capabilities in modern defense and multi-domain operations.

About the role

The Data & NLP/AI Engineer owns the full data journey within Aetosky's Multi-INT Fusion Platform -from scraping raw open-source content off the internet, through statistical filtering and semantic analysis, to orchestrating LLM-powered deep intelligence processing. This is a combined Data Engineering and NLP/AI Engineering role with end-to-end ownership: you build the ingestion infrastructure, deploy the vector database, implement anomaly detection and clustering algorithms, and design the prompt orchestration layer for agentic AI analysis. AI-assisted development (GitHub Copilot, Cursor, Claude Code, or equivalent) is the standard workflow - not optional - and will be directly assessed during the hiring process.

Responsibilities

Data Infrastructure Responsibilities

•⁠ ⁠Design and build automated data collection pipelines (web scrapers, API integrations) for target platforms including X, Facebook, local forums, Instagram, TikTok, and Reddit.

•⁠ ⁠Deploy and manage the vector database (PostgreSQL with pgvector extension) with indexing optimized for semantic similarity search at scale.

•⁠ ⁠Implement pipeline monitoring and alerting: heartbeat checks, record-count validation, dead-letter queues, and golden-record unit tests to prevent silent data loss.

•⁠ ⁠Manage infrastructure scaling during surge events (sudden data volume spikes during geopolitical crises).

•⁠ ⁠Complete secure enclave provider assessment based on target client security requirements.

NLP / AI Engineering Responsibilities

•⁠ ⁠Implement the first-stage statistical filter using TF-IDF with configurable anomaly thresholds against 30-day rolling baselines.

•⁠ ⁠Build semantic clustering using lightweight vector embedding models, grouping near-duplicate content into representative cluster centroids for efficient analyst review.

•⁠ ⁠Implement bot-detection tripwires: velocity anomaly detection (timing-based coordinated inauthentic behavior) and lexical duplication detection (copy-paste spam arrays).

•⁠ ⁠Design and manage the prompt orchestration layer for the second-stage LLM processor: intent extraction, relationship mapping, and structured output generation within a secure cloud enclave.

•⁠ ⁠Implement cost-cap logic with graceful degradation: dynamic threshold escalation at budget warning levels, automated pause at cap, and manual triage fallback.

Collaboration & Tuning Responsibilities

•⁠ ⁠Collaborate with the Full-Stack Software Developer on data contracts, API schemas, and query optimization for frontend consumption.

•⁠ ⁠Lead the daily filter tuning cycle during the post-launch stabilization period (first 30–60 days): analyze false positive rates, processing costs, and output quality metrics.

•⁠ ⁠Document pipeline architecture, filter logic, and prompt templates to enable future team onboarding and sovereign AI transition.

Classifications / Qualifications

Required

•⁠ ⁠3+ years of combined experience spanning data engineering and applied NLP/machine learning.

•⁠ ⁠Demonstrated daily proficiency with AI-assisted development tools (GitHub Copilot, Cursor, Claude Code, or equivalent) - this will be assessed in the technical evaluation.

•⁠ ⁠Strong Python and SQL skills with hands-on experience in PostgreSQL (pgvector a plus), Elasticsearch, or similar.

•⁠ ⁠Experience building web scrapers that handle anti-bot protections, rate limiting, proxy rotation, and DOM structure changes.

•⁠ ⁠Hands-on experience with text embedding models (sentence-transformers, OpenAI embeddings, or equivalent), vector similarity search, and clustering algorithms.

•⁠ ⁠Demonstrated LLM prompt engineering: designing prompts, managing context windows, evaluating output quality, and controlling inference costs.

•⁠ ⁠Familiarity with monitoring and observability tools (Prometheus, Grafana, Datadog, or equivalent).

Preferred

•⁠ ⁠Experience with multilingual NLP.

•⁠ ⁠Experience with real-time data streaming technologies (Kafka, Redis Streams, or similar).

•⁠ ⁠Background in influence operation detection, disinformation analysis, or social media intelligence.

•⁠ ⁠Demonstrated LLM cost optimization techniques (batching, caching, token management).