I'd like to share my experence of How to Extract and integrate thousands of unstructured messy Excel tables
1 Description of the problem:
In the audit project, it is necessary to merge, extract, and integrate tens of thousands of unstructured Excel quotation files with irregular headers and contents.
The extraction target is to contain nearly 100 fields in a specified range such as product name, specification model, price, etc. The extracted fields are required to be as complete as possible.
2 Difficulties:
Irregular layout: Each Excel file may contain dozens of Sheets (up to 4-50 sheets), and each Sheet may contain unfixed tables with unfixed positions and unfixed headers (the same header field may have multiple For example, the header of the product name may include: product name, product name, ProductName, name, commodity name, MeteralName, etc.).
Inconsistent headers: merged cell headers, multi-column, multi-row headers, inconsistent order of headers, inconsistent description of headers
Irregular content: There are multiple ways of writing and describing the same company name, project name, product name, product specification or attribute
3 solutions:
Arranging this data based on manual copy-paste is almost impossible to complete the task.
Finally, based on the tablemerge.com unstructured document structured extraction model, use the SQL language to run SQL extraction on tens of thousands of sheets:
Automatically identify complex table headers, rows and columns: through the customized SQL parsing engine and regular expression parsing engine, regular expressions are embedded in SQL to achieve extraction
Automatically identify merged cells, multi-column, multi-row headers
Automatically analyze document layout and layout, automatically divide documents into blocks and extract tables
One SQL runs on tens of thousands of non-standard unstructured Excel documents
Export as structured data
4 development process:
It only takes three steps to extract tens of thousands of non-standard unstructured Excel documents into structured data. When developing, you only need to write a SQL and hand it over to the customized unstructured SQL processing engine.
Developers only need to be familiar with SQL and regular expressions to preview, develop, track and debug, and export results on tens of thousands of unstructured documents.
Automatic layout analysis, table area segmentation and recognition, merged cell recognition, multi-row multi-column header recognition, structured row and column recognition.
Export structured data to database or Excel.
In the end, it only took a few days to extract and integrate clean and consistent data tables with millions of records for auditing business.
Bookmarks