File size: 6,963 Bytes
3b857bf
 
0f67940
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3b857bf
 
0f67940
3b857bf
0f67940
3b857bf
0f67940
3b857bf
0f67940
3b857bf
 
0f67940
 
 
3b857bf
0f67940
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3b857bf
0f67940
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46e8b0e
0f67940
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46e8b0e
0f67940
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
---
library_name: transformers
tags:
- text-to-SQL
- SQL
- code-generation
- NLQ-to-SQL
- text2SQL
inference:
  parameters:
    max_length: 200
widget:
- text: |-
    CREATE TABLE Loans 
    { 
        loan_id number, 
        client_id number, 
        budget real, 
        duration number, 
        interest real, 
        status varchar 
    } 
     CREATE TABLE Clients 
    { 
        client_id number, 
        first_name varchar, 
        last_name varchar, 
        email varchar, 
        city varchar, 
        year_of_birth number 
    } 
     CREATE TABLE Accounts 
    { 
        account_id number, 
        client_id number, 
        balance real, 
        type varchar 
    } 
     CREATE TABLE Deposits 
    { 
        deposit_id number, 
        account_id number, 
        source varchar, 
        amount real 
    } 
     -- Using valid SQLite, answer the following question for the tables provided above. 
     -- What is the duration and budget of the loan id 16342? 
     SELECT 
  example_title: Loan duration
- text: |-
    CREATE TABLE Transactions 
    { 
        transaction_id number, 
        timestamp_id number, 
        primary_contract_id number, 
        client_id number, 
        beneficiary_id number, 
        transaction_amount real, 
        is_fraudulent boolean, 
        product_family_code varchar, 
        amount_currency varchar 
    } 
     CREATE TABLE Beneficiary 
    { 
        beneficiary_id number, 
        bank_branch_id number, 
        country_name varchar, 
        country_code varchar 
    } 
     CREATE TABLE Source 
    { 
        primary_contract_id number, 
        client_id number, 
        counterparty_bank_branch_id number, 
        counterparty_donor_id number 
    } 
     CREATE TABLE Time 
    { 
        timestamp_id number, 
        week_number number, 
        day_number number, 
        hour_number number, 
        day_name varchar, 
        year number, 
        month_number number 
    } 
     -- Using valid SQLite, answer the following question for the tables provided above. 
     -- How many transactions for the client id 15482? 
     SELECT 
  example_title: Client Transactions
datasets:
- salmane11/BanQies
language:
- en
base_model:
- bigcode/starcoderbase-1b
---

# BanQL-1B

## Model Description

BanQL is a family of Code LLMs dedicated solely for the text-to-SQL task in the Financial domain. 

The checkpoint included in this repository is based on [bigcode/starcoderbase](https://huggingface.co/bigcode/starcoderbase) and further finetuned on [BanQies](https://huggingface.co/datasets/salmane11/BanQies), a dataset generated using [SelectCraft](https://github.com/ezzini/SelectCraft) compose of NLQ-SQL pairs in the financial domain.


## Finetuning Procedure

BanQL was fine-tuned using PEFT (Parameter-Efficient Fine-Tuning) techniques, specifically LoRA (Low-Rank Adaptation) adapters. 

## Intended Use and Limitations

The model was designed as a use case to prove the efficiency of SelectCraft in generating large-scale good quality domain-specific text-to-SQL datasets. The model is mainly finetuned on the database schemas displayed above. The prompt format is defined below.

## How to Use

Example 1: Loans_DB

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

device="cuda"
tokenizer = AutoTokenizer.from_pretrained("salmane11/BanQL-1b")
model = AutoModelForCausalLM.from_pretrained("salmane11/BanQL-1b").to(device)

input_text = """
      CREATE TABLE Loans {
            loan_id number, 
            client_id number, 
            budget real, 
            duration number, 
            interest real, 
            status varchar
        }

        CREATE TABLE Clients {
            client_id number, 
            first_name varchar, 
            last_name varchar, 
            email varchar, 
            city varchar, 
            year_of_birth number
        }

        CREATE TABLE Accounts {
            account_id number, 
            client_id number, 
            balance real, 
            type varchar
        }

        CREATE TABLE Deposits{
            deposit_id number, 
            account_id number, 
            source varchar, 
            amount real
        }

        -- Using valid SQLite, answer the following question for the tables provided above.

        -- What is the duration and budget of the loan id 16342?

        SELECT"""

encoding = tokenizer.encode_plus(input_text, return_tensors="pt").to(device)
input_ids, attention_masks = encoding["input_ids"].to(device), encoding["attention_mask"].to(device)
    

outputs = model.generate(
    input_ids=input_ids, attention_mask=attention_masks,
    max_length=512,
    do_sample=True,
    top_k=120,
    top_p=0.95,
    early_stopping=True,
)
line = tokenizer.decode(outputs[0], skip_special_tokens=True,clean_up_tokenization_spaces=True)
query_begining = line.find("SELECT")
print(line[query_begining:])
```

Example 2: Transactions_DB

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

device="cuda"
tokenizer = AutoTokenizer.from_pretrained("salmane11/BanQL-1b")
model = AutoModelForCausalLM.from_pretrained("salmane11/BanQL-1b").to(device)

input_text = """
        CREATE TABLE Transactions {
            transaction_id number, 
            timestamp_id number, 
            primary_contract_id number, 
            client_id number, 
            beneficiary_id number, 
            transaction_amount real, 
            is_fraudulent boolean, 
            product_family_code varchar, 
            amount_currency varchar
        }

        CREATE TABLE Beneficiary {
            beneficiary_id number, 
            bank_branch_id number, 
            country_name varchar, 
            country_code varchar, 
        }

        CREATE TABLE Source {
            primary_contract_id number, 
            client_id number, 
            counterparty_bank_branch_id number, 
            counterparty_donor_id number, 
        }

        CREATE TABLE Time{
            timestamp_id number, 
            week_number number, 
            day_number number, 
            hour_number number, 
            day_name varchar, 
            year number,
            month_number number
        }

        -- Using valid SQLite, answer the following question for the tables provided above.

        -- How many transactions for the client id 15482?

        SELECT"""


encoding = tokenizer.encode_plus(input_text, return_tensors="pt").to(device)
input_ids, attention_masks = encoding["input_ids"].to(device), encoding["attention_mask"].to(device)
    

outputs = model.generate(
    input_ids=input_ids, attention_mask=attention_masks,
    max_length=512,
    do_sample=True,
    top_k=120,
    top_p=0.95,
    early_stopping=True,
)
line = tokenizer.decode(outputs[0], skip_special_tokens=True,clean_up_tokenization_spaces=True)
query_begining = line.find("SELECT")
print(line[query_begining:])
```



## Cite our work

Citation