Guide To TAPAS (TAble PArSing) – A technique to retrieve information from Tabular Data using NLP

Analytics India Magazine | Mangaloremirror.com

This post was originally published by Nikita Shiledarbaxi at Analytics India Magazine

One of the most common forms of data that exists today is tabular data (structured data).In order to extract information from tabular data, you use Python libraries like Pandas or SQL-like languages. Google has recently open-sourced one of their models called ‘TAPAS’ (for TAble PArSing) wherein you can ask questions about your data in natural language.

TAPAS is essentially a BERT model-based approach to question answering over tables. However, instead of conventional NLP approaches handling natural language questions as a semantic parsing task based on logical forms (precisely specified semantic version of the syntactic text), it is a weak-supervision technique relying on denotations (i.e. literal or primary meaning of the words and not the underlying idea or emotions). It predicts the denotation by selecting table cells, optionally applies a corresponding aggregation operator to such selection and makes end-to-end predictions.

TAPAS was introduced by researchers at Google Research and Tel-Aviv University namely, Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Muller, Francesco Piccinno and Julian Martin Eisenschlos.

Before going into the details of TAPAS, let us understand some of the relevant terminologies.

What is weak-supervision?

Weak supervision is an area of machine learning where noisy, limited, or imprecise sources are used to provide supervision signals for labelling large amounts of training data in a supervised learning task. It is a way of using lower-quality labels more efficiently and/or at a higher abstraction level than It eliminates the need for obtaining hand-labelled data sets, which can be expensive or impractical. The cheaper weak labels, though imperfect, can be used to create a strong predictive model.

The weak labels can be of various forms such as,

  • Existing resources like knowledge bases or pre-trained models can be used to create some helpful labels
  • Imprecise or inexact labels
  • Inaccurate labels

Visit this page to know about the details of the weak-supervision paradigm.

What is the BERT model?

BERT (Bidirectional Encoder Representations from Transformers) is an NLP model introduced by Google Research in 2018. It pretrains deep bidirectional representations from the unlabeled text by jointly conditioning on both left and right context in all the network layers. So, the pre-trained BERT model can be fine-tuned by adding only one additional output layer. It is found to be useful for several Natural Language Processing (NLP) and Natural Language Understanding (NLU) tasks, such as question answering and language inference, without requiring significant task-specific modifications to be done to the model architecture.

Read the BERT research paper here to understand the detailed workings of the model.

Overview of TAPAS

To answer natural language questions from semi-structured tables using semantic parsing, the question is first translated to a logical form. The logical form can then be executed against the table for retrieving the true denotations.

It is expensive to annotate supervised training data which can pair natural language questions with their logical forms. Semantic parsing applications only use the generated logical form as an intermediate step for retrieving the answers. Generation of these logical forms however gives rise to challenges such as maintaining a logical formalism with enough expressivity, obeying decoding constraints (e.g. well-formedness) and so on.

In the above example, if we use synonyms of keywords e.g. instead of world cup table, we say something like world soccer tournament, the system will not be able to recognize it. All such uncertainties can be resolved using TAPAS.

TAPAS is a weakly supervised question answering model. It reasons over tables without generating logical forms. It predicts a minimal program by selecting a subset of the table cells and a possible aggregation operation to be executed on top of them. Consequently, it can learn operations from natural language, without specifying them in some formalism.

TAPAS is an extension of BERT’s architecture. It has some additional embeddings which can capture tabular structure. Besides, it has two classification layers – one for cells’ selection and the other for predicting a corresponding aggregation operator. It flattens the tabular data into a sequence of words. It then splits those words into word pieces called ‘tokens’ and concatenates the question tokens before the table tokens.

It adds a separator token between the question and the table, but not between cells or rows. The token embeddings are combined with table-aware positional embeddings before feeding them to the model. The various types of positional embeddings used for TAPAS are as follows:

  • Position ID
  • Segment ID
  • Column/Row ID
  • Rank ID
  • Previous Answer

Read the research paper to dive deeper into the working of TAPAS.

Practical implementation of TAPAS

Here’s a demonstration of TAPAS applied to a table having data of some international cricketers e.g. the team they belong to, career span, runs scored, number of innings played and so on. The code has been run using GPU in Google colab. The link to the notebook can be found at the end of the code explanation.

Clone the GitHub repository:

! git clone https://github.com/google-research/tapas.git

Installation:

! pip install ./tapas

Restart the runtime to use the newly installed versions

Download the pre-trained checkpoint from Google Storage. For the sake of speed, a base sized model trained on SQA has been used. However, the best results in the paper were obtained with a larger model having 24 layers instead of 12.

! gsutil cp gs://tapas_models/2020_04_21/tapas_sqa_base.zip . && unzip tapas_sqa_base.zip

Import the necessary modules

 import tensorflow.compat.v1 as tf
 import os 
 import shutil
 import csv
 import pandas as pd
 import IPython
 tf.get_logger().setLevel('ERROR')
 from tapas.utils import tf_example_utils
 from tapas.protos import interaction_pb2
 from tapas.utils import number_annotation_utils
 from tapas.scripts import prediction_utils 

Load the latest checkpoint from the model

 os.makedirs('results/sqa/tf_examples', exist_ok=True)
 os.makedirs('results/sqa/model', exist_ok=True)
 with open('results/sqa/model/checkpoint', 'w') as f:
   f.write('model_checkpoint_path: "model.ckpt-0"')
 for suffix in ['.data-00000-of-00001', '.index', '.meta']:
   shutil.copyfile(f'tapas_sqa_base/model.ckpt{suffix}', 
   f'results/sqa/model/model.ckpt-0{suffix}') 

Load the tabular dataset

df = pd.read_csv(“DATASET_PATH”)

Before passing the table as input, all the columns in the table are required to have string type values. Perform the datatype conversion.

df = df.astype(str)

View the dataset

See Also Upskill

df

On executing the above line of code, the tabular data will be displayed as follows:

Input data

Convert the data frame into a list of lists. The first element of the list of lists should be the column names of df.

 list_of_list = [[]]
 list_of_list[0] = list(df.columns)
 list_of_list.extend(df.values.tolist()) 

Prediction code

Note: Since TAPAS is basically a BERT model, training models with greater than 512 sequence length will require a TPU. You can use the option max_seq_length to create shorter sequences. It will reduce the model accuracy but makes it possible to train the model on GPUs.

 max_seq_length = 512
 vocab_file = "tapas_sqa_base/vocab.txt"
 config = tf_example_utils.ClassifierConversionConfig(
     vocab_file=vocab_file,
     max_seq_length=max_seq_length,
     max_column_id=max_seq_length,
     max_row_id=max_seq_length,
     strip_column_names=False,
     add_aggregation_candidates=False,
 )
 converter = tf_example_utils.ToClassifierTensorflowExample(config)
 def convert_interactions_to_examples(tables_and_queries):
   #Calls Tapas converter to convert interaction to example
   for idx, (table, queries) in enumerate(tables_and_queries):
     interaction = interaction_pb2.Interaction()
     for position, query in enumerate(queries):
       question = interaction.questions.add()
       question.original_text = query
       question.id = f"{idx}-0_{position}"
     for header in table[0]:
       interaction.table.columns.add().text = header
     for line in table[1:]:
       row = interaction.table.rows.add()
       for cell in line:
         row.cells.add().text = cell
     number_annotation_utils.add_numeric_values(interaction)
     for i in range(len(interaction.questions)):
       try:
         yield converter.convert(interaction, i)
       except ValueError as e:
         print(f"Can't convert interaction: {interaction.id} error: {e}")
 def write_tf_example(filename, examples):
   with tf.io.TFRecordWriter(filename) as writer:
     for example in examples:
       writer.write(example.SerializeToString())
 def predict(table_data, queries):
   table = table_data
   examples = convert_interactions_to_examples([(table, queries)])
   write_tf_example("results/sqa/tf_examples/test.tfrecord", examples)
   write_tf_example("results/sqa/tf_examples/random-split-1-dev.tfrecord", 
   [])
   ! python tapas/tapas/run_task_main.py 
     --task="SQA" 
     --output_dir="results" 
     --noloop_predict 
     --test_batch_size={len(queries)} 
     --tapas_verbosity="ERROR" 
     --compression_type= 
     --init_checkpoint="tapas_sqa_base/model.ckpt" 
     --bert_config_file="tapas_sqa_base/bert_config.json" 
     --mode="predict" 2> error
   results_path = "results/sqa/model/test_sequence.tsv"
   all_coordinates = []
   df = pd.DataFrame(table[1:], columns=table[0])
   display(IPython.display.HTML(df.to_html(index=False)))
   print()
   with open(results_path) as csvfile:
     reader = csv.DictReader(csvfile, delimiter='t')
     for row in reader:
       coordinates = 
       prediction_utils.parse_coordinates(row["answer_coordinates"])
       all_coordinates.append(coordinates)
       answers = ', '.join([table[row + 1][col] for row, col in 
       coordinates])
       position = int(row['position'])
       print(">", queries[position])
       print(answers)
   return all_coordinates 

Make predictions

 result = predict(list_of_list, ["what were the players names?",
       "of these, which team did Sachin Tendulkar play for?",
       "what is his highest score?",
       "how many runs has Virat Kohli scored?"]) 

The arguments to the predict() function include the list of lists to be fed as the input and a list of questions to be answered.

Output:

 > what was the player’s name?
 Sachin Tendulkar, Rahul Dravid, Jacques Kallis, Saurav Ganguly, Inzamam-ul-Haq, Sanath Jaysuriya, Ricky Ponting, Virat Kohli, Mahela Jayawardene, Kumar Sangakkara
 > of these, which team did Ricky Ponting play for?
 Australia
 > what is his highest score?
 164
 > how many runs has Saurav Ganguly scored?
 11363 

Some of the noticeable points in the above output are as follows:

  • TAPAS could understand whom we wanted to refer to when we used words like “these” and “his” while asking the questions.
  • After asking the initial 2 questions, when we changed the context and asked about a different cricketer in the 3rd question, it could properly fetch the relevant information from the table.

Note: The data used for the above implementation may not be up-to-date. It has just been used for demonstration purposes. (Data source)

Code source: GitHub

Google colab notebook of the above demo can be found here.

Spread the word

This post was originally published by Nikita Shiledarbaxi at Analytics India Magazine

Related posts