Some of my other projects

December 30, 2023

I have also been working on some other small projects recently. Here is a small overview of those projects.

Function Calling with openai and open-source LLM’s

Dec 2023

This project was actually the result of an assignment. So the goal here was to scrape data from a website and convert that to structured data by passing it as an input to the API call of some LLM without using some external framework like instructor, langchain, llamaindex or the others.

I took heavy inspiration from  instructor’s implementation of function calling, especially their implementation of the Pydantic schema and the wrapper around chat completion response from the openai package. However, having no prior experience in dealing with these frameworks - I soon ran into some issues - except OpenAI’s GPT series models, no other open source models on the endpoints I was using actually supported function calling.

Hence improvisation - I resorted to provide structured output definitions in the prompts themselves and trying out the results. Here I observed that while Llama2 and Mistral-chat models were not really returning the exact structured output that can be parsed without any hitches into a pydantic object, they were actually providing the correct outputs - but with some additional text (like, “here is your solution … ”).

Although I did find one model which provided exact output that was parsed easily from just prompts. All in all, a really fun short experiment.

Building a 4-gram language model and building word vectors

Aug 2023 - Nov 2023

So this was a small academic project in which I delve into the creation of a language model (LM) and word vectors using a downsized version of the English-Wikipedia corpus. The primary goals are to construct an n-gram language model and generate word vectors from this corpus. There are 3 primary subparts of this project.

  • The first part is cleaning and tokenization of the corpus. The tokenization was done with the sentencepiece tokenizer (which is a subword tokenizer). Empirical laws such as Zipf’s Law as also verified in the process.
  • Moving forward, I build a simple 4-gram probabilistic language model with next word prediction and sentence generation capabilities. Another thing that I attempted to do was perform next word prediction conditioned on PoS tags.
  • The final part focusses on training the word vectors based on Co-occurance Analogue to Lexical Semantics or COALS.