When companies in Singapore face a tough challenge and want to explore using artificial intelligence (AI) to achieve their organisation’s goals, they turn to AI Singapore (AISG). Launched to catalyse Singapore’s capabilities to power its future digital economy with AI, the national programme brings together all Singapore-based research institutions and the vibrant ecosystem of AI start-ups and companies developing AI products to perform use-inspired research, grow the knowledge, create the tools, and develop the talent to power Singapore’s AI efforts.
Siavash Sakhavi is Assistant Head of the flagship 100 Experiments (100E) programme, which helps solve organisations’ artificial intelligence (AI) problem statements and assists them with building their own AI teams. An organisation may propose 100E problem statements where no commercial off-the-shelf (COTS) AI solution exists, which can potentially be solved by Singapore’s ecosystem of researchers and AI Singapore’s engineering team within 9 to 18 months.
Sakhavi’s team was challenged recently in a corporate sustainability reporting project with a large, multinational financial services client. The client had difficulties extracting text from disparate reports and brochures from various sources in the form of PDF documents. The project team, consisting of several AI, data, and platform engineers, plus AISG apprentices, wanted to feed the extracted information into a natural language processing (NLP) classification pipeline. However, the project team noticed the pipeline was not performing as expected because of large volumes of unstructured, gibberish text being returned by the PDF extractor tools they were using.
For typical projects, the team is able to develop models within one to two months. In this situation, it was already on sprint six of ten, so pressure was building to produce results. Fortunately, they heard about Adobe PDF Extract API, which at the time was emerging from beta to general availability. A new web service from Adobe, PDF Extract API, parses data and context from native and scanned PDF files, extracting text, table, and image elements within a structured JSON file.