Meta Data Engineering Interview Questions: Top Topics, Problems & Solutions
Meta data engineering interviews (and similar big-tech loops) usually lean heavily on SQL—often PostgreSQL-style—and on Python for problems involving hash maps, counting, arrays, strings, and streaming or interval-style logic. Interviewers care that you can write correct queries, pick the right grain, and explain trade-offs—not just memorize syntax. This guide introduces the topics that show up most often in that kind of prep: aggregations, joins, windows, dates, null-safe reporting, set logic, and representative Python patterns. The sample questions here are original teaching examples. # Topic Why it matters on Meta-style sets 1 Aggregation & GROUP BY / HAVING Summaries by brand, seller, day, etc.; filter groups after summing. 2 Filtering (WHERE) Keep the right rows before you aggregate. 3 Joins & deduplication Combine tables without inflating row counts. 4 Window functions & ranking Running totals, top-N per group, ties. 5 Subqueries & CTEs Multi-step logic readable in one script. 6 Dates & time-series Daily revenue, latency, cohort-style buckets. 7 NULL & safe percentages Correct numerators/denominators; avoid silent wrong rates. 8 Set-style logic (overlap / both / except) Customers in both stores, A-not-B, etc. 9 Python: hash maps & counting Dicts, Counter, frequency and “most common.” 10 Python: streaming & intervals Update state as events arrive; merge intervals. If you are new to SQL: In most databases the engine processes a query roughly in this order: FROM / joins → WHERE (filter rows) → GROUP BY → compute aggregates → HAVING (filter groups) → window functions → SELECT / ORDER BY. When in doubt, ask: “Am I filtering one row at a time (WHERE) or a whole group after summing (HAVING)?” Picture a table with many detail rows—for example one row per order. Aggregation means: “turn lots of rows into one summary value (or a few values) per bucket.” The bucket is whatever the question cares about: per user, per day, per campaign, and so on. GROUP BY defines the bucket: “Put all rows with the same user_id together,” or “the same (store_id, day) together.” Every distinct combination of the GROUP BY columns is one group; the database runs your aggregate functions separately inside each group. SUM(col) What it does: Adds all numeric values of col in the bucket. NULL behavior: NULL cells are ignored (they are not treated as 0). If every value is NULL, SUM is usually NULL, not 0—say that in an interview if the edge case matters. Typical use: Revenue totals, quantities, scores summed per seller or per day. Worked example: In one group, amount values 10, NULL, 30 → SUM(amount) = 40 (only 10 + 30). AVG(col) What it does: Average of non-NULL values: sum of non-null values divided by count of non-null values. NULL behavior: Rows where col is NULL do not enter the numerator or the denominator. If you need “average where missing means 0,” use AVG(COALESCE(col, 0)) (only if the business defines it that way). Typical use: Average order value per user, average latency per region. Worked example: Same three rows 10, NULL, 30 → AVG(amount) = 20 because (10 + 30) / 2; the NULL row is not counted in the average. COUNT(*) What it does: Counts how many rows are in the bucket—every row counts, even if some columns are NULL. Typical use: “How many orders per customer?”, “how many events in this hour?”—when each row is one event. Worked example: Same three rows → COUNT(*) = 3 (the NULL row still counts as a row). COUNT(col) What it does: Counts rows where col is not NULL. Differs from COUNT(*) as soon as col has nulls. Example intuition: COUNT(user_id) might count rows with a known user; COUNT(*) counts all rows in the group. Related: COUNT(DISTINCT col) counts unique non-null values in the bucket—essential after joins when you must count people, not multiplied rows (see section 3). Worked example: Same three rows → COUNT(amount) = 2. If the third row were 50 instead of NULL, COUNT(DISTINCT amount) with values 10, 30, 50 would be 3; with 10, 30, 10 it would be 2. MIN(col) and MAX(col) What they do: Return the smallest or largest value of col in the bucket. Works on orderable types: numbers, dates/timestamps, strings (lexicographic order). NULL behavior: NULLs are skipped. If all values are NULL, the result is NULL. Typical use: Latest (MAX(ts)), earliest (MIN(day)), cheapest product in a category (MIN(price)). Worked example: Amounts 10, NULL, 30 → MIN(amount) = 10, MAX(amount) = 30. For strings 'apple', 'banana' in one group → MIN = 'apple' (lexicographic). Worked example — one dataset, several aggregates Suppose orders looks like this: order_id user_id amount 101 u1 20.00 102 u1 NULL 103 u1 40.00 104 u2 100.00 Run: SELECT user_id, SUM(amount) AS sum_amt, AVG(amount) AS avg_amt, COUNT(*) AS n_rows, COUNT(amount) AS n_known_amt, MIN(amount) AS min_amt, MAX(amount) AS max_amt FROM orders GROUP BY user_id ORDER BY user_id; You should get: user_id sum_amt avg_amt n_rows n_known_amt min_amt max_amt u1 60.00 30.0000… 3 2 20.00 40.00 u2 100.00 100.0000… 1 1 100.00 100.00 For u1, the row with amount NULL still counts in COUNT(*) (3 rows) but not in SUM / AVG / COUNT(amount) (only the 20 and 40 matter). That single picture is how most interviews test whether you understand NULL with aggregates. COUNT(DISTINCT) mini-example: if clicks has two rows for the same user_id (double click), COUNT(*) is 2 but COUNT(DISTINCT user_id) is 1 for that user’s bucket. CASE inside aggregates) Idea: Count or sum only some rows in the group without splitting into multiple queries—put the condition inside the aggregate. Patterns: SUM(CASE WHEN condition THEN col ELSE 0 END), COUNT(CASE WHEN … THEN 1 END) (or SUM(CASE WHEN … THEN 1 ELSE 0 END)), AVG(CASE WHEN … THEN col END) (averages only non-null branches). Worked example — events per user by type user_id event_type u1 view u1 purchase u1 view u2 view SELECT user_id, COUNT(*) AS total_events, SUM(CASE WHEN event_type = 'purchase' THEN 1 ELSE 0 END) AS purchases, SUM(CASE WHEN event_type = 'view' THEN 1 ELSE 0 END) AS views FROM events GROUP BY user_id; Result user_id total_events purchases views u1 3 1 2 u2 1 0 1 This is the same “one GROUP BY, many metrics” style as COUNT(*) FILTER (WHERE …) in PostgreSQL—portable warehouses use CASE heavily. GROUP BY and HAVING (how they fit together) HAVING is WHERE for buckets: it runs after grouping, so you can filter on AVG(amount), COUNT(*), etc. WHERE runs before grouping and only sees raw row columns—so WHERE AVG(amount) > 50 is invalid: the average does not exist until after GROUP BY. Worked example — WHERE vs HAVING order_id brand_id amount 1 A 40 2 A 70 3 A 20 4 B 200 WHERE amount > 30 drops row 3 before grouping. Then GROUP BY brand_id with SUM(amount) gives A =110 (40+70), B = 200. HAVING SUM(amount) > 100 runs after grouping on the unfiltered table: A’s sum is 130, B’s is 200—both pass; if you needed “brands with at least 3 orders and sum > 100,” you would use HAVING COUNT(*) >= 3 AND SUM(amount) > 100. Worked example — same orders as above, show HAVING output SELECT brand_id, SUM(amount) AS total_amt, COUNT(*) AS n_orders FROM orders GROUP BY brand_id HAVING SUM(amount) > 100; Result brand_id total_amt n_orders A 130 3 B 200 1 If you add AND COUNT(*) >= 2, only A remains (B has a single order). Rule of thumb: If the condition uses SUM / COUNT / AVG / …` of the group, use HAVING. If it only uses this row’s columns, use WHERE (and put it first—it usually makes the query faster too). If a column appears in SELECT and is not inside an aggregate, it must appear in GROUP BY (in strict SQL). Otherwise the database does not know which row’s value to show for that column inside a bucket. Common beginner mistakes Putting AVG(amount) > 50 in WHERE → use HAVING after GROUP BY. Forgetting a column in GROUP BY when it appears in SELECT without an aggregate → invalid query in strict SQL. Answering at the wrong grain (e.g. one row per ad_id when the question asked per campaign). Table orders(order_id, user_id, amount) lists purchases. Return each user_id whose average amount is greater than 50 and who has at least 3 orders. `sql SELECT user_id, ) AS order_cnt ) >= 3 Why this works: We group all orders by user_id, compute average and count per user, then keep only groups that pass both conditions. Those conditions depend on aggregates, so they belong in HAVING. Practice SQL · Topic — Aggregation problems (all companies) COMPANY · Meta — aggregation Meta-tagged aggregation 2. Data Filtering and WHERE Clause in SQL Filtering Data Using the WHERE Clause in SQL WHERE is the row filter: after FROM and joins, each row is tested once. If the condition is true, the row stays and can be grouped, counted, or shown in the result; if false, the row is dropped and never enters aggregates. Rows removed here are invisible to GROUP BY and to COUNT(*) on the remaining set—so push the filters that narrow the problem early. =, <>, , =) Compare a column to a literal or another column: amount > 100, ts >= TIMESTAMP '2026-01-01'. Strings compare in the database’s collation order unless you use explicit functions. Watch types: comparing a DATE to a TIMESTAMP may require casting so you do not accidentally exclude boundary instants. Worked example id amount status 1 60 paid 2 40 refunded 3 90 paid WHERE amount > 50 AND status <> 'refunded' keeps rows 1 and 3 only. BETWEEN a AND b is inclusive on both ends. For timestamps, >= start AND (SELECT AVG(amount) FROM …) — inner query must return at most one row and one column. Worked example orders: (id, dept_id, amount) — WHERE amount > (SELECT AVG(amount) FROM orders o2 WHERE o2.dept_id = orders.dept_id) keeps rows above that row’s department average (correlated scalar subquery). Worked example — IN (membership) products sku: A, B, C — discontinued sku: B. SELECT sku FROM products WHERE sku IN (SELECT sku FROM discontinued) → B only. Worked example — EXISTS (existence, no need to return columns from inner query) Same tables. SELECT sku FROM products p WHERE EXISTS (SELECT 1 FROM discontinued d WHERE d.sku = p.sku) → B—same result as IN here; EXISTS often reads better when the inner query is large or correlated. SELECT list SELECT id, (SELECT COUNT(*) FROM orders o WHERE o.user_id = u.id) AS order_cnt — runs per outer row; fine in interviews; on big data often rewritten as a join for performance. Worked example users: Ann (id=1), Bo (id=2). orders: two rows for Ann, none for Bo. SELECT u.id, (SELECT COUNT(*) FROM orders o WHERE o.user_id = u.id) returns (1, 2) and (2, 0). The inner query references the outer row (e.g. WHERE o.dept_id = e.dept_id with e from outside). Think: “for each outer row, evaluate this.” Worked example employees (id, dept_id, salary): (1, A, 80k), (2, A, 60k), (3, B, 90k). For row 1, AVG(salary) WHERE dept_id = 'A' is 70k; WHERE salary > that avg keeps row 1 only among dept A. WITH) Readability: Steps read top to bottom like a pipeline: clean → aggregate → join. Reuse: The same CTE name can appear multiple times in the final query; subqueries in FROM must be duplicated or wrapped. Chaining: WITH a AS (…), b AS (SELECT … FROM a …) SELECT … FROM b — b can use a. Recursive CTEs (WITH RECURSIVE): for trees/org charts; specialty syntax—learn when you hit that problem class. Worked example WITH per_day AS ( SELECT day, SUM(amount) AS revenue FROM orders GROUP BY day ) SELECT day, revenue FROM per_day WHERE revenue > 100; Step 1 builds a daily revenue table; step 2 filters it—same logic as an inline subquery, but easier to read. Subquery vs CTE (quick compare): Subquery CTE (WITH) Readability Fine for one small nest; deep nesting gets hard to read Often easier to read—steps read top to bottom Reuse Repeat the whole nest if you need it twice Same CTE name can be referenced multiple times in one query Style “Inline” “Named pipeline” Common beginner mistakes Expecting a subquery to return one value when it returns many rows—use IN, EXISTS, or a join. Nesting many subqueries when WITH would make the steps obvious. From employees(id, dept_id, salary), list employees who earn more than their department’s average salary. WITH dept_avg AS ( SELECT dept_id, AVG(salary) AS avg_sal FROM employees GROUP BY dept_id ) SELECT e.id, e.dept_id, e.salary FROM employees e JOIN dept_avg d ON d.dept_id = e.dept_id WHERE e.salary > d.avg_sal; Why this works: dept_avg is a small table of one row per department; we join each employee to their department’s average and filter. Practice SQL · Topic — Subqueries & CTEs COMPANY · Meta — CTE Meta-tagged CTEs 6. Date Handling and Time-Series Data Concepts Date and Time-Series Handling in SQL Time-series questions almost always mean: “Put each event in a time bucket (day, hour, week), then aggregate.” date_trunc (PostgreSQL-style) date_trunc('day', ts) snaps every timestamp in that calendar day to the same instant (midnight at the start of the day). Common grains: hour, week, month. After truncating, GROUP BY the truncated value (or cast to date) and apply SUM, COUNT, etc. Worked example created_at (UTC) amount 2026-01-01 22:00 10 2026-01-02 03:00 20 Both fall on different calendar days in UTC, so GROUP BY date_trunc('day', created_at)::date gives two buckets: 2026-01-01 → 10, 2026-01-02 → 20. INTERVAL and relative windows CURRENT_DATE - INTERVAL '7 days' or NOW() - INTERVAL '1 hour' expresses sliding windows without hard-coding calendar dates. Pair with WHERE ts >= … (and usually = NOW() - INTERVAL '24 hours'. EXTRACT / date_part Pull hour of day, dow (day of week), month, etc. for “volume by hour” or seasonality slices: EXTRACT(HOUR FROM ts). Worked example ts 2026-01-01 08:30 2026-01-01 09:15 GROUP BY EXTRACT(HOUR FROM ts) groups both into hours 8 and 9. end_ts - start_ts (PostgreSQL interval) or DATEDIFF-style functions in other engines—useful for durations and “time between events.” Worked example ordered_at shipped_at 2026-01-01 10:00 2026-01-01 16:00 shipped_at - ordered_at is a 6-hour interval (cast to minutes/seconds if the question asks for a number). SUM(amount) OVER (PARTITION BY store_id ORDER BY day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) = 7-day trailing sum per store, one row per day still present. Change ROWS to RANGE only when you understand your SQL dialect’s semantics for gaps in dates. Worked example (7-day trailing sum, one row per day stays) store_id day amount S1 2026-01-01 10 S1 2026-01-02 5 S1 2026-01-03 20 S1 2026-01-04 0 S1 2026-01-05 15 S1 2026-01-06 10 S1 2026-01-07 5 SUM(amount) OVER (PARTITION BY store_id ORDER BY day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) on 2026-01-07 is 65 (sum of all seven days)—same pattern as section 4’s shorter 2-day illustration, stretched to a week of daily facts. BETWEEN Prefer ts >= start_ts AND ts = TIMESTAMP '2026-01-01' AND ts = CURRENT_DATE - INTERVAL '7 days' GROUP BY 1 ORDER BY 1; Why this works: We filter to the rolling window, truncate each created_at to midnight, and sum per day. Practice SQL · Topic — Date functions & time buckets COMPANY · Meta — dates Meta-tagged date functions 7. Handling NULL Values and Safe Calculations NULL Handling and Safe Calculations in SQL NULL means “we don’t know the value”—not zero, not “false,” not an empty string. WHERE and expressions WHERE col = NULL is invalid / wrong; use IS NULL or IS NOT NULL. COALESCE(a, b) returns the first non-null argument—use when the business says “treat unknown as default X.” Worked example id discount_pct 1 NULL 2 10 SELECT COALESCE(discount_pct, 0) → 0 and 10. WHERE discount_pct IS NULL returns row 1 only. NULL (per bucket) SUM / AVG / MIN / MAX: NULL inputs are skipped; if there is nothing left to aggregate, the result is often NULL. COUNT(*): Counts rows, regardless of nulls in individual columns. COUNT(col): Counts rows where col is not NULL. COUNT(DISTINCT col): Counts distinct non-null values. Worked example (one group, three rows) amount 10 NULL 30 SUM = 40, AVG = 20, COUNT(*) = 3, COUNT(amount) = 2. Write the definition explicitly: numerator = rows (or sum) matching success; denominator = all rows in scope (or a filtered population). “Completion rate” changes if the denominator is “all tasks” vs “tasks that started.” COUNT(*) FILTER (WHERE condition) (PostgreSQL) builds numerators/denominators in one grouped query without subqueries. SUM(CASE WHEN condition THEN 1 ELSE 0 END) * 1.0 / COUNT(*) is the portable analog. Worked example task_id status 1 done 2 open 3 done Done fraction = 2 / 3, i.e. about 0.667. Query: COUNT(*) FILTER (WHERE status = 'done')::numeric / NULLIF(COUNT(*), 0) → 0.666… If the denominator can be 0, use NULLIF(denominator, 0) so the division yields NULL instead of an error—then handle in the app or outer query. Worked example player_id hits at_bats 1 0 0 hits::numeric / NULLIF(at_bats, 0) → NULL, not a runtime error. No rows after WHERE: aggregates like COUNT(*) → 0, SUM → NULL (typical)—state assumptions out loud in an interview. Worked example: SELECT SUM(amount) FROM orders WHERE FALSE — no rows match → SUM is NULL in PostgreSQL; COUNT(*) would be 0. Common beginner mistakes Treating NULL like 0 in business logic. Using the wrong denominator for a rate. Dividing without guarding zero with NULLIF. tasks(task_id, status) where status is 'done' or 'open'. What fraction of tasks are done? SELECT COUNT(*) FILTER (WHERE status = 'done')::numeric / NULLIF(COUNT(*), 0) AS done_fraction FROM tasks; Why this works: Numerator = done rows; denominator = all tasks; NULLIF prevents divide-by-zero. Practice SQL · Topic — NULL handling & safe rates COMPANY · Meta — hub Meta SQL & Python (includes null-edge drills) 8. Set Operations and Data Comparison Techniques Set Operations in SQL (UNION, INTERSECT, EXCEPT) Set problems sound like: “users in both A and B,” “customers who bought X but not Y,” “combine two lists of ids.” You are doing intersection, difference, or union on keys (usually user_id). Set operators require both branches to return the same number of columns with compatible types. INTERSECT Result: Only rows that appear in both SELECTs (duplicate handling depends on dialect; often distinct rows). Interview mapping: “Users who did both action A and action B” when each branch returns the same key column. Worked example mobile: user_ids u1, u2 — web: u2, u3 — INTERSECT → u2 only. EXCEPT (some engines: MINUS) Result: Rows in the first query not present in the second. Interview mapping: “Signed up but never purchased,” “In feed A but not in feed B.” Worked example signups: u1, u2 — buyers: u2 — EXCEPT → u1 (“signed up, never bought” if buyers is “ever purchased”). UNION Result: Stack the two result sets and deduplicate rows (can sort + dedupe—often more expensive than UNION ALL). Worked example: SELECT id FROM a yields 1,1,2; SELECT id FROM b yields 2,3 — UNION → 1, 2, 3 (unique). UNION ALL Result: Stack results keeping all duplicates—preferred when you know duplicates are impossible or when you want repeated rows (e.g. concatenating event streams). Worked example: Same as above → UNION ALL → 1,1,2,2,3 (five rows). EXISTS equivalents Intersection on keys: INNER JOIN on user_id from two deduped subqueries—or WHERE EXISTS for semi-join style. Difference (A not B): Anti-join: LEFT JOIN B ON … WHERE B.key IS NULL. NOT EXISTS (SELECT 1 FROM B WHERE …) is often safer than NOT IN when B can produce NULL keys. Worked example (semi-join with IN) customers 1, 2, 3 — vip has customer_id 2 only. SELECT id FROM customers WHERE id IN (SELECT customer_id FROM vip) → 2. Worked example (anti-join) users 1,2 — buyers only user 2. FROM users u LEFT JOIN buyers b ON u.id = b.user_id WHERE b.user_id IS NULL → user 1 (not in buyers). GROUP BY + HAVING for “both” conditions HAVING COUNT(DISTINCT CASE WHEN … THEN tag END) = 2 (or two boolean conditions) can express “user did both activities” in one fact table without set operators—useful when set SQL is awkward or slow. Worked example purchases user_id product_code u1 BOOK u1 PEN u2 BOOK GROUP BY user_id HAVING MAX(CASE WHEN product_code = 'BOOK' THEN 1 ELSE 0 END) = 1 AND MAX(CASE WHEN product_code = 'PEN' THEN 1 ELSE 0 END) = 1 → u1 only (has both). Common beginner mistakes Using OR when the problem needs both conditions on the same user (not “either event”). NOT IN (subquery) when the subquery can return NULL—use NOT EXISTS instead. purchases(user_id, product_code). Find user_ids who bought BOOK and PEN (possibly on different rows). SELECT user_id FROM purchases WHERE product_code = 'BOOK' INTERSECT SELECT user_id FROM purchases WHERE product_code = 'PEN'; Why this works: First query = set of users with BOOK; second = set with PEN; intersect = users in both sets. Practice SQL · Topic — Set operations (INTERSECT, EXCEPT, …) COMPANY · Meta — sets Meta-tagged set logic 9. Hash Maps and Counting Techniques in Python Hash Maps and Counting in Python for Data Processing A dict (hash map) maps keys to values. Average-time lookup, insert, and update are O(1) in the usual amortized sense—much faster than rescanning a whole list for every key. dict and .get Frequency: counts[k] = counts.get(k, 0) + 1 avoids KeyError on first sighting of k. Grouping into lists: d.setdefault(k, []).append(item) or use defaultdict below. Worked example words = ["cat", "dog", "cat"] — after the loop, freq is {"cat": 2, "dog": 1}. collections.Counter Built for frequency counts: Counter(iterable), .most_common(k) for top-k. Good when the problem is “how many of each label?” with minimal code. Worked example Counter(["err", "ok", "err"]).most_common(1) # [('err', 2)] defaultdict from collections defaultdict(int) — same ergonomics as counting with 0 default. defaultdict(list) — d[user_id].append(event) mirrors GROUP BY user_id “collect all rows in a bucket.” defaultdict(set) — handy for unique neighbors per key. Worked example Rows ("u1", "click"), ("u1", "view") — dd["u1"] becomes ["click", "view"] with defaultdict(list). defaultdict(set): Edges ("u1","u2"), ("u1","u3"), ("u1","u2") — g["u1"].add(...) yields {"u2", "u3"} (duplicates collapsed). GROUP BY Use a tuple key: key = (country, day) as the dict key when the bucket is more than one dimension. Worked example totals = {} totals[("US", "2026-01-01")] = totals.get(("US", "2026-01-01"), 0) + 50 → one bucket per (country, day) pair. SQL idea Python sketch COUNT(*) per key counts[k] += 1 or Counter SUM(amount) per key sums[k] = sums.get(k, 0) + amount AVG(amount) per key Store (sum, count) per key, divide at the end COUNT(DISTINCT x) per key d[k].add(x) with a set per key Worked example rows = [{"user": "a", "amt": 10}, {"user": "a", "amt": 20}, {"user": "b", "amt": 5}] One pass: sums["a"]=30, sums["b"]=5. Worked example — distinct per key rows = [{"user": "a", "sku": "X"}, {"user": "a", "sku": "Y"}, {"user": "a", "sku": "X"}] — after uniq["a"].add(sku) with defaultdict(set), len(uniq["a"]) == 2 (matches COUNT(DISTINCT sku) GROUP BY user). “For each distinct key, loop the entire list” is O(n²). One pass with a dict is typically O(n) time and O(distinct keys) space—what interviewers expect. Worked example: With about 10,000 rows and 1,000 distinct keys, rescanning all rows per key is on the order of ten million operations; one dict pass stays on the order of ten thousand. heapq for top-k (when k is small) heapq.nlargest(k, iterable) / nsmallest return the k best items without sorting the whole list (O(n log k)). For “top k keys by count,” you can also push (-count, key) into a min-heap of size k while streaming counts—useful when memory must stay O(k). Worked example import heapq counts = {"a": 5, "b": 2, "c": 9, "d": 1} top2 = heapq.nlargest(2, counts.items(), key=lambda x: x[1]) # [('c', 9), ('a', 5)] Common beginner mistakes Re-scanning the whole list inside a loop over unique keys. Forgetting tie-breaking (e.g. lexicographically smallest among max frequency). Not handling missing keys—use .get(key, 0) or Counter. Given a list of words, return the word with the highest count (break ties by lexicographically smallest word). from collections import Counter def top_word(words: list[str]) -> str: cnt = Counter(words) max_freq = max(cnt.values()) candidates = [w for w, c in cnt.items() if c == max_freq] return min(candidates) # Example: top_word(["apple", "banana", "apple"]) -> "apple" Why this works: Counter gets frequencies; we keep all words at the max count; min picks the tie-breaker. Practice PYTHON · Topic — Hash tables & counting COMPANY · Meta — hash table Meta-tagged hash tables 10. Streaming Data and Interval Processing in Python Streaming Data Processing in Python Streaming means: events arrive one after another (or in time order). You keep a small piece of state—last timestamp seen, counts in the current window, “open” orders in a dict—and update it when the next event arrives. You should not rescan the entire history for each new line if you want an efficient solution. Most simulations assume events.sort() by timestamp (and stable tie-breaking: e.g. process end before start at the same time if “touching” is not overlap—depends on problem statement). Worked example events = [(10, "start"), (5, "end"), (7, "start")] — sort by time → (5,end), (7,start), (10,start) so processing order matches the real timeline. [start, end) Start included, end excluded—standard for “busy from second start up to but not including end.” Overlap test for [a1, a2) and [b1, b2): a1 b1. (Closed intervals use different inequalities—match the prompt.) Worked example: [1, 5) and [5, 8) — touch at 5 but no overlap (half-open). Sort by start, then sweep: if the next interval overlaps the current merged block, extend the block’s end with max(end1, end2); else start a new block. Used for “total covered time” after merging. Worked example: [1,4) and [3,6) merge to [1,6) (covered length 5). Expand each interval to (start, +1) and (end, -1); sort all points; walk while tracking a running balance of active intervals; max of the balance is the peak concurrency (see sample below). Worked example: Intervals [0,3), [2,5) — at time 2 both are active → peak concurrency 2 (see full solution below). state[order_id] = (stage, last_ts) (or similar): on each log line, update the entity’s state and maybe accumulate durations now - last_ts for the previous stage—pattern for marketplace order timelines. Worked example Log lines: order_A placed t=0, order_A shipped t=10. After second line, time-in-placed = 10 − 0; update state["order_A"] = ("shipped", 10). Keep a deque or index of events within the last N seconds; drop expired items as time advances—pattern for “active users in last 5 minutes” style prompts. Worked example: Times 100, 250, 400 (seconds), window 300s. When processing 400, drop timestamps int: events = [] for s, e in intervals: events.append((s, 1)) # start: +1 concurrent events.append((e, -1)) # end: -1 concurrent events.sort() cur = best = 0 for _, delta in events: cur += delta best = max(best, cur) return best Why this works: At any time, cur is how many intervals are active; we record the peak. Practice PYTHON · Topic — Streaming & interval-style problems COMPANY · Meta — streaming Meta-tagged streaming Tips to Crack Meta Data Engineering Interviews These tips will help you confidently crack Meta data engineering interviews by focusing on the technical and problem-solving skills interviewers actually score—not textbook definitions. Solid data engineering interview preparation blends SQL preparation for interviews with typed Python practice; how to crack a data engineering interview at this level is mostly repetition with feedback, not passive reading. The Meta data engineer interview tips below are practical: habits, patterns, and where to drill on PipeCode. They do not re-explain sections 1–10—they tell you what to do with that material. Quick checklist (prep habits): Drill SQL daily: joins, GROUP BY, window functions, dates, nulls, and set-style logic. Rehearse data pipeline and ETL thinking: sources, transforms, delivery, and what breaks at scale. Be ready to discuss data-oriented system design: schemas, data flow, and reliability—not just a single query. Strengthen Python for data processing: dictionaries, counting, streaming, and interval-style problems. Work real problems on the Meta company practice hub with tests, not theory alone. Strong SQL is non-negotiable. SQL preparation for interviews should emphasize correctness first, then clarity: name the grain (what one row means), say WHERE vs HAVING, and sanity-check join fan-out before you optimize. Timed reps beat rereading notes—use Meta · aggregation, Meta · filtering, Meta · joins, Meta · window functions, Meta · subqueries, and Meta · CTE for company-scoped drills; add Topic · null handling when you practice NULL-safe reporting. Many data engineering interview loops expect you to reason about pipelines: ingestion, transformation, and serving—often with ETL-style trade-offs (batch vs incremental, idempotency, late data). You do not need a slide deck—you need vocabulary: what is upstream, what is idempotent, what schema does the consumer need? On PipeCode, warm up with Meta · ETL, Meta · dimensional modeling, Meta · event modeling, and the broader topic · ETL hub. System design for data engineering is usually data-centric: sketch components (ingest, store, process, serve), data flow, failure modes, and scale (partitioning, backfill, duplicates). If you only prepare SQL in isolation, practice one whiteboard-style walkthrough per week: inputs, outputs, and where quality is enforced. Deep dives and Explore courses can complement problem reps when you want structured depth. Python screens favor clear code over clever tricks. Prioritize Meta · hash table and topic · hash table for frequency and dict patterns; use Meta · streaming with topic · sliding window and topic · intervals for window and sweep-line style tasks. State time and space complexity when the interviewer signals they care. To crack data engineering interview problems faster, recognize the pattern before you code—aggregation, filtering, joins, windows, CTEs, dates, sets, hash maps, streaming. When a prompt feels new, map it to one of those shapes (see sections 1–10), then pick the smallest example and trace it by hand. Primary loop: Meta company practice hub. Browse all practice topics and company hubs; full library: Explore practice. Commitment: Subscribe when you want full access. Skill lane Where to practice on PipeCode Aggregations & GROUP BY / HAVING Meta · aggregation, Meta · grouping, Meta · having clause Filtering Meta · filtering Joins & dedupe Meta · joins, Meta · join Windows & ranking Meta · window functions, Meta · ranking Subqueries & CTEs Meta · subqueries, Meta · CTE Dates & time-series Meta · date functions, Meta · time-series Set-style logic Meta · set Interview habit: say WHERE vs HAVING and join grain out loud before you type. State assumptions (grain, nulls, ties), sketch a tiny test case, then code—interviewers reward data engineering judgment, not just syntax. Loops often stress SQL (aggregations, joins, windows, dates, nulls, sets) and Python (hash maps, counting, streaming-style intervals). Exact questions vary by team and level; this guide teaches those topic types with original examples aligned to PipeCode’s Meta skill tags. Read the first teaching subheading under each topic (the SQL or Python explainer block), type the sample solutions, then practice on PipeCode’s Meta hub. Usually SQL first, then Python; 450+ problems are available for reps. Yes. Many candidates see SQL and Python exercises. This guide matches that mix with worked examples in sections 1–10—not a copy of any single live question. Expect shapes like aggregation and GROUP BY, joins, window functions, dates, NULL-safe rates, and set-style logic. See sections 1–8; examples are PostgreSQL-style unless noted. Common patterns include summaries by grain, join fan-out, ranking and windows, time buckets, safe percentages, and Python frequency or interval sweeps. The topic table near the top lists them. Strong SQL (aggregations, joins, windows, dates), solid Python for dicts, counts, and streaming-style logic, plus clear communication under time pressure. PipeCode offers Meta-tagged practice across these topics. PipeCode pairs company-tagged Meta problems with tests and feedback so you move from reading solutions to typing your own. Pipecode.ai is Leetcode for Data Engineering Browse Meta practice → View plans →
