Table of Contents
Some of my other projects
December 30, 2023
I have also been working on some other small projects recently. Here is a small overview of those projects.
Function Calling with openai
and open-source LLM’s
Dec 2023
This project was actually the result of an assignment. So the goal here was to
scrape data from a website and convert that to structured data by passing it as
an input to the API call of some LLM without using some external framework like
instructor
, langchain
, llamaindex
or the others.
I took heavy inspiration from instructor’s implementation of function calling, especially
their implementation of the Pydantic schema and the wrapper around chat
completion response from the openai
package. However, having no prior
experience in dealing with these frameworks - I soon ran into some issues -
except OpenAI’s GPT series models, no other open source models on the endpoints
I was using actually supported function calling.
Hence improvisation - I resorted to provide structured output definitions in the prompts themselves and trying out the results. Here I observed that while Llama2 and Mistral-chat models were not really returning the exact structured output that can be parsed without any hitches into a pydantic object, they were actually providing the correct outputs - but with some additional text (like, “here is your solution … ”).
Although I did find one model which provided exact output that was parsed easily from just prompts. All in all, a really fun short experiment.
Building a 4-gram language model and building word vectors
Aug 2023 - Nov 2023
So this was a small academic project in which I delve into the creation of a language model (LM) and word vectors using a downsized version of the English-Wikipedia corpus. The primary goals are to construct an n-gram language model and generate word vectors from this corpus. There are 3 primary subparts of this project.
- The first part is cleaning and tokenization of the corpus. The tokenization was done with the sentencepiece tokenizer (which is a subword tokenizer). Empirical laws such as Zipf’s Law as also verified in the process.
- Moving forward, I build a simple 4-gram probabilistic language model with next word prediction and sentence generation capabilities. Another thing that I attempted to do was perform next word prediction conditioned on PoS tags.
- The final part focusses on training the word vectors based on Co-occurance Analogue to Lexical Semantics or COALS.