Classification of Layout vs Data Tables published!

Our work on classifying tables as layout and relational tables on the web with a machine learning approach is published at Tweb!

Waqar Haider and Yeliz Yesilada. Classification of Layout vs Relational Tables on the Web: Machine Learning with Rendered Pages. ACM Transactions on the Web, 2022, Volume 17, Issue 1, Article No.: 1, pp 1–23. https://doi.org/10.1145/3555349

Abstract: Table mining on the web is an open problem, and none of the previously proposed techniques provides a complete solution. Most research focuses on the structure of the HTML document, but because of the nature and structure of the web, it is still a challenging problem to detect relational tables. Web Content Accessibility Guidelines (WCAG) also cover a wide range of recommendations for making tables accessible, but our previous work shows that these recommendations are also not followed; therefore, tables are still inaccessible to disabled people and automated processing. We propose a new approach to table mining by not looking at the HTML structure, but rather, the rendered pages by the browser. The first task in table mining on the web is to classify relational vs. layout tables, and here, we propose two alternative approaches for that task. We first introduce our dataset, which includes 725 web pages with 9,957 extracted tables. Our first approach extracts features from a page after being rendered by the browser, then applies several machine learning algorithms in classifying the layout vs. relational tables. The best result is with Random Forest with the accuracy of 97.2% (F1-score: 0.955) with 10-fold cross-validation. Our second approach classifies tables using images taken from the same sources using Convolutional Neural Network (CNN), which gives an accuracy of 95% (F1-score: 0.95). Our work here shows that the web’s true essence comes after it goes through a browser and using the rendered pages and tables, the classification is more accurate compared to literature and paves the way in making the tables more accessible.