AI News Hub Logo

AI News Hub

Stop Using Regex for Invoices: Use AI to Extract Line-Items in Seconds

DEV Community
Peter Njuguna

The Nightmare of Parsing Invoices If you’ve ever tried to extract structured data from an invoice or receipt, you know exactly how painful it is. You write a perfect regular expression to extract the total amount from one vendor. It works beautifully. Then, a new vendor comes along with a slightly different format, and your regex silently fails, breaks your pipeline, and leaves you cleaning up messy data. Invoices are inherently unpredictable. They contain: Different date formats (DD/MM/YYYY vs MM/DD/YYYY). Tabular data represented as raw, unstructured text. Varied terminology ("Qty", "Units", "Quantity"). Chaotic text generated by OCR (Optical Character Recognition) scanners. Trying to parse this with traditional code is a never-ending game of whack-a-mole. In this guide, we'll look at a much better way: using a specialized AI extraction API to turn messy invoice text into clean, structured JSON in a single request. Instead of trying to match patterns with text coordinates or regex, modern workflows pass the unstructured text directly to an LLM-backed API. The AI understands the context of the document, identifies the merchant, isolates the line items, and returns a uniform JSON schema every single time. Let’s see how to implement this using Python. To follow along, you will need: Python installed on your machine. The requests library (pip install requests). A free API key from the Invoice and Receipt Extractor API on RapidAPI. Let's assume you have an OCR scanner or a script that has extracted raw text from a messy PDF invoice. Here is what that unstructured text looks like: Coast View Investments.ltd N0 PARTICULARS QTTY UNITS UNIT PRICE COST 1 POLES 150 PIECES 50 7500 TOTAL. 7500 Now, let's write a Python script to send this data to the API and parse it automatically. Create a file named extract.py and add the following code: import json import requests # 1. Define the API Endpoint and your RapidAPI credentials url = "https://invoice-and-receipt-extractor.p.rapidapi.com/v1/extract" headers = { "Content-Type": "application/json", "x-rapidapi-key": "YOUR_RAPIDAPI_KEY", # Replace with your actual RapidAPI Key "x-rapidapi-host": "invoice-and-receipt-extractor.p.rapidapi.com" } # 2. Add the raw invoice text you want to parse payload = { "text_content": "Coast View Investments.ltd\nN0 PARTICULARS QTTY UNITS UNIT PRICE COST\n1 POLES 150 PIECES 50 7500\nTOTAL. 7500" } print("⏳ Extracting data via AI...") # 3. Make the API request try: response = requests.post(url, json=payload, headers=headers) response.raise_for_status() # 4. Print the clean JSON output structured_data = response.json() print("\n✅ Success! Clean structured data received:\n") print(json.dumps(structured_data, indent=2)) except requests.exceptions.HTTPError as err: print(f"❌ API Error: {err}") When you run the script, the API processes the messy text and extracts the data into a clean, highly reliable format: { "merchant_name": "Coast View Investments.ltd", "date_of_issue": null, "invoice_number": null, "line_items": [ { "description": "POLES", "quantity": 150.0, "unit_price": 50.0, "total_price": 7500.0 } ], "subtotal": 7500.0, "tax_amount": 0.0, "currency": "USD", "grand_total": 7500.0 } Now, instead of writing dozens of custom parsing rules, you can directly map this clean JSON output straight into your accounting software, database, or ERP. In 2026, building fragile data pipelines around regular expressions doesn't make sense anymore. By utilizing specialized AI extraction APIs, you save hours of development time and build a pipeline that won't break when a merchant updates their document layout. If you want to try this out yourself: Check out the Invoice and Receipt Extractor API on RapidAPI. Sign up for the free tier (10 requests/month) to start testing it in your own projects today.