<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <title>Fasih Khatib</title>
  
  <subtitle>☕</subtitle>
  <link href="/atom.xml" rel="self"/>
  
  <link href="http://fasihkhatib.com/"/>
  <updated>2025-10-20T06:19:37.631Z</updated>
  <id>http://fasihkhatib.com/</id>
  
  <author>
    <name>Fasih Khatib</name>
    
  </author>
  
  <generator uri="http://hexo.io/">Hexo</generator>
  
  <entry>
    <title>Writing a PRD - Curated Menus at Restaurants</title>
    <link href="http://fasihkhatib.com/2025/10/17/Writing-a-PRD-Restaurant-Reservations/"/>
    <id>http://fasihkhatib.com/2025/10/17/Writing-a-PRD-Restaurant-Reservations/</id>
    <published>2025-10-17T13:47:27.000Z</published>
    <updated>2025-10-20T06:19:37.631Z</updated>
    
    <content type="html"><![CDATA[<p>One of my favorite food delivery apps recently launched a feature where they spotlight restaurants. Upon clicking the icon which shows you more about the restaurant, you’re presented with a list of all the outlets they have in the city. As someone who eats outside frequently and eats a variety of different cuisines, this got me thinking. What if we could elevate this experience? While previously I’d written about a feature I’d love to see from a programmer’s perspective, I’ll write this post from a product manager’s perspective. In essence, I’ll attempt to write a Product Requirements Document (PRD). I’ll follow the <a href="https://www.notion.com/blog/how-to-write-a-prd">“How to write a PRD”</a> guide by Notion. Here it goes.</p><h2 id="Step-1-Define-the-Product"><a href="#Step-1-Define-the-Product" class="headerlink" title="Step 1 - Define the Product."></a>Step 1 - Define the Product.</h2><p>The aim of the product is to elevate a diner’s experience of a restaurant by presenting them with curated menus. Curated menus allow the diners to expreience the best of what the restaurants have to offer by letting them focus more on what makes eating out an experience — the cuisine, the company, and the conversations.   </p><p>There are various personas who will benefit from such a product. These personas are defined below.  </p><h3 id="Persona-1-—-The-Explorer"><a href="#Persona-1-—-The-Explorer" class="headerlink" title="Persona 1 — The Explorer."></a>Persona 1 — The Explorer.</h3><p>People who like eating out frequently and trying new places and cuisines to eat can benefit from handpicked dishes presented in the curated menu. This allows them to experience new tastes and cultures whle enjoying a tailored exprience; every dish in the menu is chosen to delight the diner by highlighting the best of what the culture has to offer.  </p><h3 id="Persona-2-—-The-Executive"><a href="#Persona-2-—-The-Executive" class="headerlink" title="Persona 2 — The Executive."></a>Persona 2 — The Executive.</h3><p>Curated menus also make dining in a business setting easier by letting the diners focus more on enjoying the cuisine, and having business discussions than choosing dishes out of a menu. This also allows highlighting, if there are diners from various countries, the cuisine of the host country.</p><h3 id="Persona-3-—-The-Lover"><a href="#Persona-3-—-The-Lover" class="headerlink" title="Persona 3 — The Lover."></a>Persona 3 — The Lover.</h3><p>Curated menus make going out on dates a more memorable experience by allowing partners to bond over meals shared together. The menu takes the mystery out of choosing the right dish as each is chosen to delight the diner. This makes going out and choosing the right restaurant for the evening an easier experience.</p><h2 id="Step-2-Determine-Goals"><a href="#Step-2-Determine-Goals" class="headerlink" title="Step 2 - Determine Goals."></a>Step 2 - Determine Goals.</h2><p>The following are the list of SMART goals which will enable the release of the first iteration of the product.</p><ol><li>Identify the top three cities where users eat out the most.  </li><li>Identify the five restaurants in these cities who’d participate.  </li><li>Create three curated menus for each of these restaurants.</li><li>Get 100 bookings for each menu in the first six months.  </li></ol><p>The KPIs for the following goals are listed below.  </p><ol><li>The number of people viewing curated menus.  </li><li>The number of people buying curated menus.</li><li>The number of people adding curated menus to their favorites.</li></ol><h2 id="Step-3-—-Constraints-and-Limitations"><a href="#Step-3-—-Constraints-and-Limitations" class="headerlink" title="Step 3 — Constraints and Limitations."></a>Step 3 — Constraints and Limitations.</h2><p>Curated menus take time to create. Getting each restaurant onboard, creating the menu, presenting them to the user, and getting them to choose the menu can be a time-taking process. Additionally, factoring in people’s dietary choices mean that some menus will have limited appeal thus reducing the number of people who choose the menu. Another factor is the pricing of the menu requiring them to be created at different price points.  </p><h2 id="Step-4-—-Scope"><a href="#Step-4-—-Scope" class="headerlink" title="Step 4 — Scope."></a>Step 4 — Scope.</h2><p>The initial pilot of the feature should include only a limited number of restaurants and menus. From the point-of-view of functionality, the engineering effort required should focus on enabling the foals outlined in Step 2 — adding curated menus in the system, displaying them to users, allowing users to buy the menu, notifying the restaurants of the purchase, and analytics to track the KPIs.</p><h2 id="Step-5-—-Features"><a href="#Step-5-—-Features" class="headerlink" title="Step 5 — Features."></a>Step 5 — Features.</h2><img src="/2025/10/17/Writing-a-PRD-Restaurant-Reservations/App.png" class="">  <p>Upon viewing a restaurant, the user will be presented with the list of curated menus for that restaurant along with the price. They will be able to buy it within the app by choosing a date and time, and paying the displayed amount. This will reserve a table for them and notify the restaurant that a curated experience is expected at the chosen time.  </p><h2 id="Step-6-—-Release-Criteria"><a href="#Step-6-—-Release-Criteria" class="headerlink" title="Step 6 — Release Criteria."></a>Step 6 — Release Criteria.</h2><p>The criteria for success would include trial runs with restaurants with a chosen group of users and ensuring that the functionality performs as expected. Once past this phase, the feature can be released generally to more users. Concretly, the following criterias would help check the readiness of the product.  </p><ol><li>Add the participating restaurants into the system.  </li><li>Add the curated menus into the system.  </li><li>Ensure that the menus show up in the app as expected.  </li><li>Purchase a menu and check the end-to-end flow.  </li><li>Release it to an early group of users.  </li></ol><p>Finito.</p><link rel="stylesheet" href="/css/spoiler.css" type="text/css"><script src="/js/spoiler.js" type="text/javascript" async></script>]]></content>
    
    <summary type="html">
    
      
      
        &lt;p&gt;One of my favorite food delivery apps recently launched a feature where they spotlight restaurants. Upon clicking the icon which shows yo
      
    
    </summary>
    
    
      <category term="PRD" scheme="http://fasihkhatib.com/tags/PRD/"/>
    
  </entry>
  
  <entry>
    <title>Real-time Restaurant Recommendations</title>
    <link href="http://fasihkhatib.com/2025/06/25/Real-time-restaurant-recommendations/"/>
    <id>http://fasihkhatib.com/2025/06/25/Real-time-restaurant-recommendations/</id>
    <published>2025-06-25T13:57:28.000Z</published>
    <updated>2025-06-25T13:57:28.405Z</updated>
    
    <content type="html"><![CDATA[<p>One of my favorite food delivery apps recently launched a feature where your friends can recommend dishes to you. This got me thinking. What if we could provide recommendations from more than just friends? What if recommendations came from the people around you as they browse restaurants and place orders? In this post we’ll take a look at how we can go about building such a system.  </p><h2 id="Getting-Started"><a href="#Getting-Started" class="headerlink" title="Getting Started"></a>Getting Started</h2><p>Imagine opening your favorite food delivery app and being presented with the trending searches and restaurants. For example, something like the mockup shown below. It shows three trending searches and a few trending restaurants. It turns out that by tracking user activity from search to purchase allows us to find out what’s gaining traction. Let’s design such a system.</p><img src="/2025/06/25/Real-time-restaurant-recommendations/Group.png" class="">  <p>We’ll begin by tracking a user’s journey from search to purchase.  </p><p>Let’s say that the user begins their food ordering journey by searching for “pizza”. They’re presented with a list of restaurants that sell pizza. They click on a few restaurants, browse the menu, and finally add a large pizza to their cart from one of the restaurants. Finally, they place the order by making the payment. If we were to track each interaction of the user as an event, it would look something like this.  </p><p>Search for “pizza”.<br>Present a list of restaurants.<br>Click on a restaurant.<br>Click on a restaurant.<br>Add pizza to cart.<br>Add a bottle of cola to cart.<br>Purchase.  </p><p>As we collect more events, we’ll see patterns emerge. For example, we may discover that pizza is one of the trending searches since it’s being searched more often. Let’s see how we can do all of this programatically. It all starts with tracking the user’s search.  </p><p>Let every search generated by the user be tagged by a unique identifier. This identifier, which we’ll call <code>query_id</code>, will be unique throughout the journey. Every new search will generate a new <code>query_id</code> but will stay unchanged for the same search. Similarly, we’ll track a user with a unique identifier called <code>user_id</code>. Let’s model the user’s search as an event as follows.  </p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line">{</span><br><span class="line">    <span class="string">"query_id"</span>: query_id,</span><br><span class="line">    <span class="string">"user_id"</span>: user_id,</span><br><span class="line">    <span class="string">"type"</span>: <span class="string">"query"</span>,</span><br><span class="line">    <span class="string">"target"</span>: <span class="string">"large pizza"</span>,</span><br><span class="line">    <span class="string">"created_at"</span>: datetime.datetime.now()</span><br><span class="line">}</span><br></pre></td></tr></table></figure>  <p>In this event, which is of type “query”, the user searched for “large pizza”. Similarly, we’ll model the results that were displayed to them as an event.  </p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line">{</span><br><span class="line">    <span class="string">"query_id"</span>: query_id,</span><br><span class="line">    <span class="string">"user_id"</span>: user_id,</span><br><span class="line">    <span class="string">"type"</span>: <span class="string">"results"</span>,</span><br><span class="line">    <span class="string">"target"</span>: <span class="string">"La Pizzeria, Freddy's"</span>,</span><br><span class="line">    <span class="string">"created_at"</span>: datetime.datetime.now()</span><br><span class="line">}</span><br></pre></td></tr></table></figure>  <p>In this event, which is of type “results”, we see that the user was presented with two restaurants. Next, let’s model the user clicking on one of the restaurants. </p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line">{</span><br><span class="line">    <span class="string">"query_id"</span>: query_id,</span><br><span class="line">    <span class="string">"user_id"</span>: user_id,</span><br><span class="line">    <span class="string">"type"</span>: <span class="string">"click"</span>,</span><br><span class="line">    <span class="string">"target"</span>: <span class="string">"La Pizzeria"</span>,</span><br><span class="line">    <span class="string">"created_at"</span>: datetime.datetime.now()</span><br><span class="line">}</span><br></pre></td></tr></table></figure>  <p>In this event, which is of type “click”, the user clicked on the restaurant called “La Pizzeria”. Next, let’s model the user adding two items to their cart.  </p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br></pre></td><td class="code"><pre><span class="line">{</span><br><span class="line">    <span class="string">"query_id"</span>: query_id,</span><br><span class="line">    <span class="string">"user_id"</span>: user_id,</span><br><span class="line">    <span class="string">"type"</span>: <span class="string">"add_to_cart"</span>,</span><br><span class="line">    <span class="string">"target"</span>: <span class="string">"16 inch Margherita"</span>,</span><br><span class="line">    <span class="string">"created_at"</span>: datetime.datetime.now()</span><br><span class="line">},</span><br><span class="line">{</span><br><span class="line">    <span class="string">"query_id"</span>: query_id,</span><br><span class="line">    <span class="string">"user_id"</span>: user_id,</span><br><span class="line">    <span class="string">"type"</span>: <span class="string">"add_to_cart"</span>,</span><br><span class="line">    <span class="string">"target"</span>: <span class="string">"Cola 1L"</span>,</span><br><span class="line">    <span class="string">"created_at"</span>: datetime.datetime.now()</span><br><span class="line">}</span><br></pre></td></tr></table></figure>  <p>In the events, which are of type “add_to_cart”, we see that the user added two items to their cart. Finally, let’s model the purchase event.  </p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br></pre></td><td class="code"><pre><span class="line">{</span><br><span class="line">    <span class="string">"query_id"</span>: query_id,</span><br><span class="line">    <span class="string">"user_id"</span>: user_id,</span><br><span class="line">    <span class="string">"type"</span>: <span class="string">"purchase"</span>,</span><br><span class="line">    <span class="string">"target"</span>: <span class="string">"Cola 1L"</span>,</span><br><span class="line">    <span class="string">"created_at"</span>: datetime.datetime.now()</span><br><span class="line">},</span><br><span class="line">{</span><br><span class="line">    <span class="string">"query_id"</span>: query_id,</span><br><span class="line">    <span class="string">"user_id"</span>: user_id,</span><br><span class="line">    <span class="string">"type"</span>: <span class="string">"purchase"</span>,</span><br><span class="line">    <span class="string">"target"</span>: <span class="string">"16 inch Margherita"</span>,</span><br><span class="line">    <span class="string">"created_at"</span>: datetime.datetime.now()</span><br><span class="line">}</span><br></pre></td></tr></table></figure>  <p>In the events, which are of type “purchase”, we see that the user purchased the two items.  </p><p>Having seen how to track user journey from search to purchase using events, let’s see how we can mine these events for patterns. Let’s write a simple Python script which demonstrates this using the events we’ve seen above.  </p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">import</span> datetime</span><br><span class="line"><span class="keyword">import</span> uuid</span><br><span class="line"></span><br><span class="line"><span class="keyword">import</span> pandas <span class="keyword">as</span> pd</span><br><span class="line"></span><br><span class="line">query_id = <span class="built_in">str</span>(uuid.uuid4())</span><br><span class="line">user_id = <span class="built_in">str</span>(uuid.uuid4())</span><br><span class="line"></span><br><span class="line">events = [</span><br><span class="line">    {</span><br><span class="line">        <span class="string">"query_id"</span>: query_id,</span><br><span class="line">        <span class="string">"user_id"</span>: user_id,</span><br><span class="line">        <span class="string">"type"</span>: <span class="string">"query"</span>,</span><br><span class="line">        <span class="string">"target"</span>: <span class="string">"large pizza"</span>,</span><br><span class="line">        <span class="string">"created_at"</span>: datetime.datetime.now()</span><br><span class="line">    },</span><br><span class="line">    {</span><br><span class="line">        <span class="string">"query_id"</span>: query_id,</span><br><span class="line">        <span class="string">"user_id"</span>: user_id,</span><br><span class="line">        <span class="string">"type"</span>: <span class="string">"results"</span>,</span><br><span class="line">        <span class="string">"target"</span>: <span class="string">"La Pizzeria, Freddy's"</span>,</span><br><span class="line">        <span class="string">"created_at"</span>: datetime.datetime.now()</span><br><span class="line">    },</span><br><span class="line">    {</span><br><span class="line">        <span class="string">"query_id"</span>: query_id,</span><br><span class="line">        <span class="string">"user_id"</span>: user_id,</span><br><span class="line">        <span class="string">"type"</span>: <span class="string">"click"</span>,</span><br><span class="line">        <span class="string">"target"</span>: <span class="string">"La Pizzeria"</span>,</span><br><span class="line">        <span class="string">"created_at"</span>: datetime.datetime.now()</span><br><span class="line">    },</span><br><span class="line">    {</span><br><span class="line">        <span class="string">"query_id"</span>: query_id,</span><br><span class="line">        <span class="string">"user_id"</span>: user_id,</span><br><span class="line">        <span class="string">"type"</span>: <span class="string">"add_to_cart"</span>,</span><br><span class="line">        <span class="string">"target"</span>: <span class="string">"16 inch Margherita"</span>,</span><br><span class="line">        <span class="string">"created_at"</span>: datetime.datetime.now()</span><br><span class="line">    },</span><br><span class="line">    {</span><br><span class="line">        <span class="string">"query_id"</span>: query_id,</span><br><span class="line">        <span class="string">"user_id"</span>: user_id,</span><br><span class="line">        <span class="string">"type"</span>: <span class="string">"add_to_cart"</span>,</span><br><span class="line">        <span class="string">"target"</span>: <span class="string">"Cola 1L"</span>,</span><br><span class="line">        <span class="string">"created_at"</span>: datetime.datetime.now()</span><br><span class="line">    },</span><br><span class="line">    {</span><br><span class="line">        <span class="string">"query_id"</span>: query_id,</span><br><span class="line">        <span class="string">"user_id"</span>: user_id,</span><br><span class="line">        <span class="string">"type"</span>: <span class="string">"purchase"</span>,</span><br><span class="line">        <span class="string">"target"</span>: <span class="string">"Cola 1L"</span>,</span><br><span class="line">        <span class="string">"created_at"</span>: datetime.datetime.now()</span><br><span class="line">    },</span><br><span class="line">    {</span><br><span class="line">        <span class="string">"query_id"</span>: query_id,</span><br><span class="line">        <span class="string">"user_id"</span>: user_id,</span><br><span class="line">        <span class="string">"type"</span>: <span class="string">"purchase"</span>,</span><br><span class="line">        <span class="string">"target"</span>: <span class="string">"16 inch Margherita"</span>,</span><br><span class="line">        <span class="string">"created_at"</span>: datetime.datetime.now()</span><br><span class="line">    }</span><br><span class="line">]</span><br><span class="line"></span><br><span class="line">df = pd.DataFrame.from_records(data=events)</span><br><span class="line"></span><br><span class="line">metrics = df.merge(df, left_on=<span class="string">"query_id"</span>, right_on=<span class="string">"query_id"</span>)</span><br><span class="line">metrics = metrics[(metrics[<span class="string">"type_x"</span>] == <span class="string">"purchase"</span>) &amp; (metrics[<span class="string">"type_y"</span>] == <span class="string">"query"</span>)]</span><br><span class="line">metrics = metrics.groupby([<span class="string">"target_y"</span>, <span class="string">"target_x"</span>])[<span class="string">"user_id_x"</span>].count().reset_index().rename(</span><br><span class="line">    columns={<span class="string">"target_y"</span>: <span class="string">"query"</span>, <span class="string">"target_x"</span>: <span class="string">"item"</span>, <span class="string">"user_id_x"</span>: <span class="string">"count"</span>}</span><br><span class="line">)</span><br><span class="line"></span><br><span class="line"><span class="built_in">print</span>(metrics.head())</span><br></pre></td></tr></table></figure>  <p>Running the script produces the following output.  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">         query                item  count</span><br><span class="line">0  large pizza  16 inch Margherita      1</span><br><span class="line">1  large pizza             Cola 1L      1</span><br></pre></td></tr></table></figure>  <p>In the output, we see that the query “large pizza” resulted in the purchase of one pizza and one bottle of cola.  </p><p>What the script does is tie the purchase to its original query using the <code>query_id</code>. Finding the trending phrases and keywords is now simply a matter of grouping on the <code>query</code>.  </p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">trending = metrics.groupby([<span class="string">"query"</span>])[<span class="string">"count"</span>].<span class="built_in">sum</span>().reset_index().sort_values([<span class="string">"count"</span>], ascending=<span class="literal">False</span>)</span><br><span class="line"><span class="built_in">print</span>(trending)</span><br></pre></td></tr></table></figure>  <p>This gives us the following output which shows us that the query “large pizza” resulted in the purchase of two items.</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">         query  count</span><br><span class="line"><span class="number">0</span>  large pizza      <span class="number">2</span></span><br></pre></td></tr></table></figure>  <p>As we collect more events, we’ll see more query phrases emerge. These can then be presented to the user in the application. Similarly, tying the click events to queries will show us what query phrases bring the user to a particular restaurant. Finally, aggregating the click or purchase events will show us which restaurants are trending.</p><p>This small example shows how tracking user interactions with events can be used to find which query phrases are trending.   </p><p>In a more production environment, and to make this real-time, events would have to be emitted to a log like Kafka and processed with a stream processing framework like Spark or Flink. Once these aggregated metrics are generated, they’d be written to some database from where they can be served to the user.  </p><p>Finito.</p><link rel="stylesheet" href="/css/spoiler.css" type="text/css"><script src="/js/spoiler.js" type="text/javascript" async></script>]]></content>
    
    <summary type="html">
    
      
      
        &lt;p&gt;One of my favorite food delivery apps recently launched a feature where your friends can recommend dishes to you. This got me thinking. W
      
    
    </summary>
    
    
      <category term="architecture" scheme="http://fasihkhatib.com/tags/architecture/"/>
    
  </entry>
  
  <entry>
    <title>Creating a realtime data platform - indexing</title>
    <link href="http://fasihkhatib.com/2025/01/12/Creating-a-realtime-data-platform-indexing/"/>
    <id>http://fasihkhatib.com/2025/01/12/Creating-a-realtime-data-platform-indexing/</id>
    <published>2025-01-12T16:01:29.000Z</published>
    <updated>2025-01-14T15:18:33.342Z</updated>
    
    <content type="html"><![CDATA[<p>In the <a href="/2025/01/10/Creating-a-realtime-data-platform-embedding/">previous post</a> we saw how we can embed a Superset dashboard in a webpage using a Flask app. In this post we’ll look at creating indexes on the tables that are stored in Pinot. We’ll look at the various types of indexes that can be created and then create indexes of our own to speed up the queries.  </p><h2 id="Getting-started"><a href="#Getting-started" class="headerlink" title="Getting started"></a>Getting started</h2><p>Very briefly, a database index is a data structure that improves retrieving data from the tables. It is used to quickly find the rows that we’re interested in without having to scan the entire table. Pinot supports different types of indexes that can be used to speed up different types of queries. Knowing the access patterns of the queries helps us decide which indexes we’d like to create. We’ll quickly look at each type of index that Pinot provides, write a few queries, and then create indexes to speed them up. Let’s start by looking at the various index types that are available in Pinot.  </p><p>Pinot supports the following types of indexes — bloom filter index, forward index, FST index, geospatial index, inverted index, JSON index, range index, star-tree index, text search index, and timestamp index. In the sections that follow, we’ll look at how to create an inverted index, and a range index to speed up our queries. Let’s start with the inverted index.  </p><p>Let’s say we’d like to find out the sum of the amounts of orders placed on January 1st. One way to write this query would be to convert the <code>created_at</code> time to a date on the fly. This would require us to look at every row we have in the table and compare it against the date we’re looking for. Another way to write this query would be to convert <code>created_at</code> to a date during ingestion time and create an inverted index on it. Creating an inverted index would store a mapping of each date to the row IDs with the same date. This will enable us to filter only those rows which have the date we’re looking for instead of having to look at each one of them.</p><p>We’d have to update our table and schema for the <code>orders</code> table. We’ll start by adding a column to our schema definition which will store the date as a string.  </p><figure class="highlight json"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="punctuation">{</span></span><br><span class="line">  <span class="attr">"name"</span><span class="punctuation">:</span> <span class="string">"date"</span><span class="punctuation">,</span></span><br><span class="line">  <span class="attr">"dataType"</span><span class="punctuation">:</span> <span class="string">"STRING"</span></span><br><span class="line"><span class="punctuation">}</span></span><br></pre></td></tr></table></figure><p>Next, we’ll convert the <code>created_at</code> time to a date and store it in the <code>date</code> column. To do this, we’ll create a field-level transformation in the table definition.   </p><figure class="highlight json"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="punctuation">{</span></span><br><span class="line">  <span class="attr">"columnName"</span><span class="punctuation">:</span> <span class="string">"date"</span><span class="punctuation">,</span></span><br><span class="line">  <span class="attr">"transformFunction"</span><span class="punctuation">:</span> <span class="string">"toDateTime(created_at, 'yyyy-MM-dd')"</span></span><br><span class="line"><span class="punctuation">}</span></span><br></pre></td></tr></table></figure><p>Next, we’ll enable the inverted index on the <code>date</code> column by adding the following to the table definition.</p><figure class="highlight json"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"><span class="punctuation">{</span></span><br><span class="line">  <span class="attr">"tableIndexConfig"</span><span class="punctuation">:</span> <span class="punctuation">{</span></span><br><span class="line">    <span class="attr">"invertedIndexColumns"</span><span class="punctuation">:</span> <span class="punctuation">[</span></span><br><span class="line">      <span class="string">"date"</span></span><br><span class="line">    <span class="punctuation">]</span><span class="punctuation">,</span></span><br><span class="line">    <span class="attr">"createInvertedIndexDuringSegmentGeneration"</span><span class="punctuation">:</span> <span class="literal"><span class="keyword">true</span></span></span><br><span class="line"><span class="punctuation">}</span></span><br></pre></td></tr></table></figure><p>Once we PUT these configs, we’ll reload the segments so that the index is created. We can now write our query in the Pinot console.  </p><figure class="highlight sql"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">SELECT</span> "date", <span class="built_in">SUM</span>(amount) <span class="keyword">as</span> total</span><br><span class="line"><span class="keyword">FROM</span> orders</span><br><span class="line"><span class="keyword">WHERE</span> "date" <span class="operator">=</span> <span class="string">'2024-01-01'</span></span><br><span class="line"><span class="keyword">GROUP</span> <span class="keyword">BY</span> <span class="number">1</span>; </span><br></pre></td></tr></table></figure> <p>We can verify that the index was used by looking at the explain plan for the query.  </p><figure class="highlight sql"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">EXPLAIN PLAN <span class="keyword">FOR</span></span><br><span class="line"><span class="keyword">SELECT</span> "date", <span class="built_in">SUM</span>(amount) <span class="keyword">as</span> total</span><br><span class="line"><span class="keyword">FROM</span> orders</span><br><span class="line"><span class="keyword">WHERE</span> "date" <span class="operator">=</span> <span class="string">'2024-01-01'</span></span><br><span class="line"><span class="keyword">GROUP</span> <span class="keyword">BY</span> <span class="number">1</span>;</span><br></pre></td></tr></table></figure>  <p>This produces the following result.  </p><img src="/2025/01/12/Creating-a-realtime-data-platform-indexing/screen_1.png" class="">  <p>Looking at the explain plan, we find the following line. It tells us that the index was used to look up the date ‘2024-01-01’</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">FILTER_SORTED_INDEX(indexLookUp:sorted_index,operator:EQ,predicate:date = '2024-01-01')</span><br></pre></td></tr></table></figure><p>Let’s modify the query slightly and look for the daily total for the last 90 days. It looks as follows.  </p><figure class="highlight sql"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">SELECT</span> "date", <span class="built_in">SUM</span>(amount) <span class="keyword">as</span> total</span><br><span class="line"><span class="keyword">FROM</span> orders</span><br><span class="line"><span class="keyword">WHERE</span> created_at <span class="operator">&gt;=</span> ago(<span class="string">'P90D'</span>)</span><br><span class="line"><span class="keyword">GROUP</span> <span class="keyword">BY</span> <span class="number">1</span></span><br><span class="line"><span class="keyword">ORDER</span> <span class="keyword">BY</span> <span class="number">1</span></span><br></pre></td></tr></table></figure><p>The <code>ago()</code> function takes as argument a duration string and returns milliseconds since epoch. We compare it against <code>created_at</code>, which is also expressed as milliseconds since epoch, to get the final result. Let’s look at the explain plan for the query.  </p><img src="/2025/01/12/Creating-a-realtime-data-platform-indexing/screen_2.png" class="">  <p>We find the following line which indicates that an index was not used.  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">FILTER_FULL_SCAN(operator:RANGE,predicate:created_at &gt;= '1728886039203')</span><br></pre></td></tr></table></figure><p>We can improve the performance of the query by using a range index which allows us to efficiently query over a range of values. This helps us speed up queries which involve comparison operators like less than, greater than, etc. To create a range index, we’ll have to update the table definition and specify the column on which we’d like to create the index. We’ll add the following to the table definition.  </p><figure class="highlight json"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line"><span class="punctuation">{</span></span><br><span class="line">  <span class="attr">"tableIndexConfig"</span><span class="punctuation">:</span> <span class="punctuation">{</span></span><br><span class="line">    <span class="attr">"rangeIndexColumns"</span><span class="punctuation">:</span> <span class="punctuation">[</span></span><br><span class="line">      <span class="string">"created_at"</span></span><br><span class="line">    <span class="punctuation">]</span></span><br><span class="line"><span class="punctuation">}</span></span><br></pre></td></tr></table></figure><p>Like we did previously, we’ll PUT this config and reload the segments. We’ll once again look at the explain plan for the query above. </p><img src="/2025/01/12/Creating-a-realtime-data-platform-indexing/screen_3.png" class=""><p>We now find the following line which indicates that the range index was used.  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">FILTER_RANGE_INDEX(indexLookUp:range_index,operator:RANGE,predicate:created_at &gt;= '1728887647851')</span><br></pre></td></tr></table></figure><p>This was a quick overview of using indexes to speed up queries on Pinot. We looked at the inverted index and the range index. In the next post, we’ll look at the star-tree index which lets us compute pre-aggregations on columns and speeds up the result of operations like SUM and COUNT. We’ll also look at stream processing to generate new events from the ones emitted by Debezium.</p><link rel="stylesheet" href="/css/spoiler.css" type="text/css"><script src="/js/spoiler.js" type="text/javascript" async></script>]]></content>
    
    <summary type="html">
    
      
      
        &lt;p&gt;In the &lt;a href=&quot;/2025/01/10/Creating-a-realtime-data-platform-embedding/&quot;&gt;previous post&lt;/a&gt; we saw how we can embed a Superset dashboard 
      
    
    </summary>
    
    
      <category term="architecture" scheme="http://fasihkhatib.com/tags/architecture/"/>
    
  </entry>
  
  <entry>
    <title>Creating a realtime data platform - embedding</title>
    <link href="http://fasihkhatib.com/2025/01/10/Creating-a-realtime-data-platform-embedding/"/>
    <id>http://fasihkhatib.com/2025/01/10/Creating-a-realtime-data-platform-embedding/</id>
    <published>2025-01-10T12:21:10.000Z</published>
    <updated>2025-01-11T11:41:42.021Z</updated>
    
    <content type="html"><![CDATA[<p>In the <a href="/2025/01/06/Creating-a-realtime-data-platform-visualization/">previous post</a> we looked at how to visualize our data using Superset. In this post we’ll look at embedding our dashboards. Embedding allows us to display a dashboard outside of Superset and within a webpage. This lets us blend analytics seamlessly into the user’s workflow. We’ll tweak a few Superset settings to allow us to embed a dashboard, and then build a Flask application which will render it in a webpage. We’ll also look at row-level security which lets us limit a user’s access to only those rows that they are allowed access to.  </p><h2 id="Getting-started"><a href="#Getting-started" class="headerlink" title="Getting started"></a>Getting started</h2><p>Let’s say we’d like to create a dashboard that consists of a line chart and a table. The line chart plots the number of orders the user has placed daily. The table shows the cafes where the user frequently orders from. We’ll start by writing a query which will enable us to create the line chart. This will be stored as a view so that rendering the data produces realtime results. The query is as follows.  </p><figure class="highlight sql"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">CREATE</span> <span class="keyword">VIEW</span> hive.views.daily_orders_by_users <span class="keyword">AS</span></span><br><span class="line"><span class="keyword">SELECT</span> user_id,</span><br><span class="line">       <span class="built_in">CAST</span>(FROM_UNIXTIME(<span class="built_in">CAST</span>(created_at <span class="keyword">AS</span> <span class="keyword">DOUBLE</span>) <span class="operator">/</span> <span class="number">1e3</span>) <span class="keyword">AS</span> <span class="type">DATE</span>) <span class="keyword">AS</span> dt,</span><br><span class="line">       <span class="built_in">COUNT</span>(<span class="operator">*</span>) <span class="keyword">AS</span> count</span><br><span class="line"><span class="keyword">FROM</span> pinot.default.orders</span><br><span class="line"><span class="keyword">GROUP</span> <span class="keyword">BY</span> <span class="number">1</span>, <span class="number">2</span>;</span><br></pre></td></tr></table></figure><p>Next, we’ll create the view which will tell us the cafes the user orders from. It is as follows.  </p><figure class="highlight sql"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">CREATE</span> <span class="keyword">VIEW</span> hive.views.frequent_cafes <span class="keyword">AS</span></span><br><span class="line"><span class="keyword">SELECT</span> o.user_id,</span><br><span class="line">       c.name <span class="keyword">AS</span> cafe,</span><br><span class="line">       <span class="built_in">COUNT</span>(<span class="operator">*</span>) <span class="keyword">AS</span> count,</span><br><span class="line">       <span class="built_in">RANK</span>() <span class="keyword">OVER</span>(<span class="keyword">PARTITION</span> <span class="keyword">BY</span> o.user_id <span class="keyword">ORDER</span> <span class="keyword">BY</span> <span class="built_in">COUNT</span>(<span class="operator">*</span>) <span class="keyword">DESC</span>) <span class="keyword">AS</span> rank</span><br><span class="line"><span class="keyword">FROM</span> pinot.default.orders <span class="keyword">AS</span> o</span><br><span class="line">  <span class="keyword">INNER</span> <span class="keyword">JOIN</span> pinot.default.cafe <span class="keyword">AS</span> c</span><br><span class="line">  <span class="keyword">ON</span> <span class="built_in">CAST</span>(o.cafe_id <span class="keyword">AS</span> <span class="type">VARCHAR</span>) <span class="operator">=</span> c.id</span><br><span class="line"><span class="keyword">GROUP</span> <span class="keyword">BY</span> <span class="number">1</span>, <span class="number">2</span></span><br><span class="line"><span class="keyword">ORDER</span> <span class="keyword">BY</span> <span class="number">1</span> <span class="keyword">ASC</span>, <span class="number">3</span> <span class="keyword">DESC</span>;</span><br></pre></td></tr></table></figure><p>Once the views have been constructed, we can use them in Superset to build our dashboard. To enable dashboard embedding, we must first update some Superset settings. If you have any Superset containers running from the last post, start by shutting them down. Then, run the following command to remove any Docker images linked to Superset; we’ll rebuild it with the new settings.</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">docker images | grep superset | awk '{print $3}' | xargs docker rmi</span><br></pre></td></tr></table></figure><p>Let’s begin editing the settings. Navigate to the <code>superset</code> directory within the <code>superset</code> repository and open the <code>config.py</code> file. This contains the configuration that will be used by the Superset application when it runs. We’ll edit this line-by-line and see why these changes are required. First, we’ll disable the Talisman library used by Superset. Find the variable <code>TALISMAN_ENABLED</code> and update it to the following.</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">TALISMAN_ENABLED = <span class="literal">False</span></span><br></pre></td></tr></table></figure><p>Talisman is a Python library that protects Flask against some of the common web application security issues. Since we’ll be running this locally over HTTP, we can disable this to allow the embedded dashboard to render within the webpage. Next, we’ll disable CSRF protection. Again, all of these settings are to make things run locally over an HTTP connection. You should let these be for production deployment. Find the variable <code>WTF_CSRF_ENABLED</code> and set it to <code>False</code>. </p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">WTF_CSRF_ENABLED = <span class="literal">False</span></span><br></pre></td></tr></table></figure><p>Next, enable dashboard embedding. This is disabled by default. Find the <code>EMBEDDED_SUPERSET</code> variable and set it to <code>True</code>.</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="string">"EMBEDDED_SUPERSET"</span>: <span class="literal">True</span></span><br></pre></td></tr></table></figure><p>Finally, we’ll elevate the permissions of the Guest user to enable us to render dashboards in an iframe. Set <code>GUEST_ROLE_NAME</code> to <code>Gamma</code>.</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">GUEST_ROLE_NAME = <span class="string">"Gamma"</span></span><br></pre></td></tr></table></figure>  <p>Superset has some predefined roles with permissions attached to them. The Gamma role is the one for data consumers and has limited access. Assigning the Gamma role to the guest user lets us embed the dashboard within a web application. As we’ll see shortly, we’ll generate a guest token which we’ll use when embeding a dashboard.  </p><p>With these changes made, we can rebuild the Superset images. Run the following command to build them. </p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">TAG=4.1.1 docker compose -f docker-compose-non-dev.yml build</span><br></pre></td></tr></table></figure><p>This will take a while to build. In the meantime, we’ll start writing our Flask application. It’ll be a simple application that renders a Jinja template. In that template we’ll add the code to display the embedded dashboard. Let’s see what the template looks like.  </p><figure class="highlight html"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br></pre></td><td class="code"><pre><span class="line"><span class="tag">&lt;<span class="name">title</span>&gt;</span>Dashboard<span class="tag">&lt;/<span class="name">title</span>&gt;</span></span><br><span class="line"></span><br><span class="line"><span class="tag">&lt;<span class="name">style</span>&gt;</span><span class="language-css"></span></span><br><span class="line"><span class="language-css">  <span class="selector-tag">body</span>, <span class="selector-tag">div</span> {</span></span><br><span class="line"><span class="language-css">      <span class="attribute">width</span>: <span class="number">100vw</span>;</span></span><br><span class="line"><span class="language-css">      <span class="attribute">height</span>: <span class="number">100vh</span>;</span></span><br><span class="line"><span class="language-css">  }</span></span><br><span class="line"><span class="language-css"></span><span class="tag">&lt;/<span class="name">style</span>&gt;</span></span><br><span class="line"></span><br><span class="line"><span class="tag">&lt;<span class="name">div</span> <span class="attr">id</span>=<span class="string">"chart"</span>&gt;</span><span class="tag">&lt;/<span class="name">div</span>&gt;</span></span><br><span class="line"></span><br><span class="line"><span class="tag">&lt;<span class="name">script</span> <span class="attr">src</span>=<span class="string">"https://unpkg.com/@superset-ui/embedded-sdk"</span>&gt;</span><span class="tag">&lt;/<span class="name">script</span>&gt;</span></span><br><span class="line"><span class="tag">&lt;<span class="name">script</span>&gt;</span><span class="language-javascript"></span></span><br><span class="line"><span class="language-javascript">  supersetEmbeddedSdk.<span class="title function_">embedDashboard</span>({</span></span><br><span class="line"><span class="language-javascript">      <span class="attr">id</span>: <span class="string">'{{ chart_id }}'</span>,</span></span><br><span class="line"><span class="language-javascript">      <span class="attr">supersetDomain</span>: <span class="string">'http://192.168.0.103:8088'</span>,</span></span><br><span class="line"><span class="language-javascript">      <span class="attr">mountPoint</span>: <span class="variable language_">document</span>.<span class="title function_">getElementById</span>(<span class="string">"chart"</span>),</span></span><br><span class="line"><span class="language-javascript">      <span class="attr">fetchGuestToken</span>: <span class="function">() =&gt;</span> <span class="string">"{{ guest_token  }}"</span>,</span></span><br><span class="line"><span class="language-javascript">      <span class="attr">dashboardUiConfig</span>: {</span></span><br><span class="line"><span class="language-javascript">        <span class="attr">hideTitle</span>: <span class="literal">true</span></span></span><br><span class="line"><span class="language-javascript">      },</span></span><br><span class="line"><span class="language-javascript">      <span class="attr">iframeSandboxExtras</span>: []</span></span><br><span class="line"><span class="language-javascript">  })</span></span><br><span class="line"><span class="language-javascript"></span></span><br><span class="line"><span class="language-javascript">  <span class="comment">// This is a hack to make the iframe bigger.</span></span></span><br><span class="line"><span class="language-javascript">  <span class="variable language_">document</span>.<span class="title function_">getElementById</span>(<span class="string">"chart"</span>).<span class="property">children</span>[<span class="number">0</span>].<span class="property">width</span>=<span class="string">"100%"</span>;</span></span><br><span class="line"><span class="language-javascript">  <span class="variable language_">document</span>.<span class="title function_">getElementById</span>(<span class="string">"chart"</span>).<span class="property">children</span>[<span class="number">0</span>].<span class="property">height</span>=<span class="string">"100%"</span>;</span></span><br><span class="line"><span class="language-javascript"></span><span class="tag">&lt;/<span class="name">script</span>&gt;</span></span><br></pre></td></tr></table></figure><p>Let’s unpack what’s going on. The embedded dashboard is rendered within an iframe and we need a container element to hold it. This is what the <code>div</code> is for; it’ll hold the iframe. </p><p>Next, we load the Superset embed SDK from the CDN. This make the <code>supersetEmbededSdk</code> variable globally available. We call the <code>embedDashboard</code> method on it to embed the dashboard. This method takes an object which contains the information needed for embedding. The first piece of informaton we pass is the <code>id</code> of the chart. As we’ll see shortly, we get this from the Superset UI. We’re using a template variable here and we’ll replace it with its actual value when we render the webpage.  </p><p>Next, we specify the address of our Superset instance in the <code>supersetDomain</code> field. Here I’ve used the IP address of my local machine to point to the Docker containers running Superset.  </p><p>Next, we specify the mount point. The <code>mountPoint</code> is the element within the page where the chart will be rendered. We’re retrieving the <code>div</code> using its ID.   </p><p>Next, we specify the <code>fetchGuestToken</code> function. This function retrieves the guest token from the backend. Since we’re rendering the template from a Flask application, we’ll fetch the guest token on the servier side. Therefore, we simply return the guest token from the function. We’ve used a Jinja variable <code>guest_token</code> which we’ll replace with its actual value when we render the template.  </p><p>Next, we specify some configuration information. In our example, we’ve hidden the title of the dashboard.  </p><p>Finally, we increase the size of the iframe so that it fills the screen.  </p><p>Having written the template, we’ll move on to writing the Flask web application. It’s a single file with one endpoint to render the chart. We’ll write functions to see how we can log into Superset using its API and then fetch a guest token. The complete code for the app is given below.  </p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">from</span> flask <span class="keyword">import</span> Flask, render_template</span><br><span class="line"><span class="keyword">import</span> requests</span><br><span class="line"></span><br><span class="line">app = Flask(__name__)</span><br><span class="line">chart_id = <span class="string">"2e8635f3-349a-4c91-bb6e-ff4883a543cc"</span></span><br><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">get_access_token</span>() -&gt; <span class="built_in">str</span>:</span><br><span class="line">    json = {<span class="string">"username"</span>: <span class="string">"admin"</span>, <span class="string">"password"</span>: <span class="string">"admin"</span>, <span class="string">"provider"</span>: <span class="string">"db"</span>, <span class="string">"refresh"</span>: <span class="literal">True</span>}</span><br><span class="line"></span><br><span class="line">    response = requests.post(</span><br><span class="line">        <span class="string">"http://192.168.0.103:8088/api/v1/security/login"</span>,</span><br><span class="line">        headers={</span><br><span class="line">            <span class="string">"Content-Type"</span>: <span class="string">"application/json"</span>,</span><br><span class="line">            <span class="string">"Accept"</span>: <span class="string">"application/json"</span>,</span><br><span class="line">        },</span><br><span class="line">        json=json,</span><br><span class="line">    )</span><br><span class="line"></span><br><span class="line">    response.raise_for_status()</span><br><span class="line">    <span class="keyword">return</span> response.json()[<span class="string">"access_token"</span>]</span><br><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">get_guest_token</span>(<span class="params">user_id: <span class="built_in">int</span></span>) -&gt; <span class="built_in">str</span>:</span><br><span class="line">    access_token = get_access_token()</span><br><span class="line"></span><br><span class="line">    response = requests.post(</span><br><span class="line">        <span class="string">"http://192.168.0.103:8088/api/v1/security/guest_token"</span>,</span><br><span class="line">        headers={</span><br><span class="line">            <span class="string">"Content-Type"</span>: <span class="string">"application/json"</span>,</span><br><span class="line">            <span class="string">"Accept"</span>: <span class="string">"application/json"</span>,</span><br><span class="line">            <span class="string">"Authorization"</span>: <span class="string">f"Bearer <span class="subst">{access_token}</span>"</span>,</span><br><span class="line">        },</span><br><span class="line">        json={</span><br><span class="line">            <span class="string">"resources"</span>: [{<span class="string">"id"</span>: chart_id, <span class="string">"type"</span>: <span class="string">"dashboard"</span>}],</span><br><span class="line">            <span class="string">"rls"</span>: [{<span class="string">"clause"</span>: <span class="string">f"user_id=<span class="subst">{user_id}</span>"</span>}],</span><br><span class="line">            <span class="string">"user"</span>: {<span class="string">"first_name"</span>: <span class="string">"..."</span>, <span class="string">"last_name"</span>: <span class="string">"..."</span>, <span class="string">"username"</span>: <span class="string">"..."</span>},</span><br><span class="line">        },</span><br><span class="line">    )</span><br><span class="line"></span><br><span class="line">    response.raise_for_status()</span><br><span class="line">    <span class="keyword">return</span> response.json()[<span class="string">"token"</span>]</span><br><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="meta">@app.route(<span class="params"><span class="string">"/chart/&lt;int:user_id&gt;"</span></span>)</span></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">chart</span>(<span class="params">user_id: <span class="built_in">int</span></span>):</span><br><span class="line">    guest_token = get_guest_token(user_id)</span><br><span class="line">    <span class="keyword">return</span> render_template(<span class="string">"chart.html"</span>, guest_token=guest_token, chart_id=chart_id)</span><br><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> __name__ == <span class="string">"__main__"</span>:</span><br><span class="line">    app.run(host=<span class="string">"0.0.0.0"</span>, port=<span class="number">5555</span>, debug=<span class="literal">True</span>)</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>Let’s step through the code. The <code>chart_id</code> is the unique identifier for the chart we’re trying to embed. As we’ll see shortly, this comes from the Superset UI.   </p><p>Next, we define the <code>get_access_token</code> function which retrieves an access token. To get this, we need to log into Superset. We use the default username and password which we POST to the login endpoint and extract the token from the JSON that’s returned. This token lets us fetch a guest token which is required for embedding the dashboard.  </p><p>Next, we define the <code>get_guest_token</code> function which retrieves the guest token. It takes the ID of the user as an argument so that it can apply row-level security to the dataset that powers the dashboard. Row-level security, often abbreviated as RLS, is a mechanism which restricts a user’s access to only those rows that they have the permission to access. If we look at the view we’ve created, it contains the data for all the users. Applying row-level security allows us to display only those rows which pertain to the given user. The <code>rls</code> field in the JSON body contains the clause which limits the access of the user. It is applied as a part of the <code>WHERE</code> clause and filters the rows in the dataset. The <code>resources</code> field contains the ID of the chart that we’d like to embed. The <code>user</code> field contains the details of the guest user.  </p><p>Next, we define the <code>chart</code> function which actually renders the tempalte. The template is rendered by calling the <code>render_template</code> function which takes the guest token and chart ID as arguments. This generates the final HTML which is returned to the user. </p><p>We’ll run this app in a separate terminal by executing the following command.  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">python run_app.py</span><br></pre></td></tr></table></figure><p>After waiting for a while to let the Superset images build, we can go ahead and bring up its containers.  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">TAG=4.1.1 docker compose -f docker-compose-non-dev.yml up -d</span><br></pre></td></tr></table></figure><p>After opening the Superset UI, we’ll begin by creating a dashboard. We’ll add a chart to the dasboard which is backed by the <code>daily_orders_by_users</code> view which we created earlier. We’ll add an area chart where we have date on the x-axis and the count on the y-axis with the ID of the user being the dimension. The screenshot below shows what it looks like.  </p><img src="/2025/01/10/Creating-a-realtime-data-platform-embedding/screen_1.png" class=""><p>The chart looks cluttered because it is displaying the data of every user. When we embed this chart, we’ll rely on row-level security to display the chart that only belongs to a particular user. Similarly, we’ll add a table backed by the <code>frequent_cafes</code> view and display it in the dashboard. Remember that the filtering is applied to the dataset backing the chart. This means that we can create the table and exclude the <code>user_id</code> column from display.  </p><img src="/2025/01/10/Creating-a-realtime-data-platform-embedding/screen_2.png" class=""><p>Having added all our charts to the dashboard, we can embed it in the webapp. Begin by saving the dashboard. Then, click on the three dots on the top-right hand and click “Embed dashboard”. You should see a dialog box pop up which allows you to list the domains from which the embedded dashboard can be accessed. We’ll leave this empty to allow embedding from all domains and click “Enable Embedding”. From the next dialog box, we’ll copy the ID of the dashboard and then click the X button on the top-right hand to close it.</p><img src="/2025/01/10/Creating-a-realtime-data-platform-embedding/screen_3.png" class=""><p>We’ll replace the <code>chart_id</code> variable in our web application and then run it. It will start a Flask application that listens on port 5555. We’ll navigate to <a href="http://localhost:5555/1">http://localhost:5555/1</a> which will display the chart and the table for the user with ID 1. The embedded dashboard will apply the row-level security clause for this user ID and only display their data. The dashboard looks as follows.  </p><img src="/2025/01/10/Creating-a-realtime-data-platform-embedding/screen_5.png" class=""> <p>That’s it. That’s how to embed a Superset dashboard.</p><link rel="stylesheet" href="/css/spoiler.css" type="text/css"><script src="/js/spoiler.js" type="text/javascript" async></script>]]></content>
    
    <summary type="html">
    
      
      
        &lt;p&gt;In the &lt;a href=&quot;/2025/01/06/Creating-a-realtime-data-platform-visualization/&quot;&gt;previous post&lt;/a&gt; we looked at how to visualize our data us
      
    
    </summary>
    
    
      <category term="architecture" scheme="http://fasihkhatib.com/tags/architecture/"/>
    
  </entry>
  
  <entry>
    <title>Creating a realtime data platform - visualization</title>
    <link href="http://fasihkhatib.com/2025/01/06/Creating-a-realtime-data-platform-visualization/"/>
    <id>http://fasihkhatib.com/2025/01/06/Creating-a-realtime-data-platform-visualization/</id>
    <published>2025-01-06T07:15:17.000Z</published>
    <updated>2025-01-06T07:15:17.606Z</updated>
    
    <content type="html"><![CDATA[<p>In the <a href="/2025/01/04/Creating-a-realtime-data-platform-orchestration/">previous post</a> we saw how we can use Airflow to create datasets on top of the data that’s stored in Pinot. Once a dataset is created, we may need to present it as a report or a dashboard. In this post we’ll look at how to use Superset to create visualizations on top of the datasets that we’ve created. We’ll create a dashboard to display the daily change in the number of orders placed each day.</p><h2 id="Getting-started"><a href="#Getting-started" class="headerlink" title="Getting started"></a>Getting started</h2><p>In an <a href="/2024/02/08/Setting-up-a-SQL-IDE-with-Apache-Superset/">earlier post</a> I’d written about how to run Superset locally. In a nutshell, running Superset using Docker compose requires cloning the repository and checking out the version of the project you’d like to run. In that post I’d also shown how to install additional packages so that we can add the ability to connect to another database. For the sake of brevity, I’ll simply repeat the steps here and refer you to the earlier post for more details. </p><p>Let’s start by cloning the repo and navigating to it.  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">git clone https://github.com/apache/superset.git --depth 1</span><br><span class="line">cd superset</span><br></pre></td></tr></table></figure><p>Next, we’ll checkout the git repository to a specific tag so that we can build from it.  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">git checkout 4.1.1</span><br></pre></td></tr></table></figure> <p>Next, we’ll add the Python package which will let us connect to Pinot.  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">echo "pinotdb" &gt;&gt; ./docker/requirements-local.txt</span><br></pre></td></tr></table></figure><p>Finally, we’ll bring up the containers for this specific tag.  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">TAG=4.1.1 docker-compose -f docker-compose-non-dev.yml up -d</span><br></pre></td></tr></table></figure> <p>This will build the Docker images for Superset and run them as containers. Once the containers are running, Superset will be available on <a href="http://localhost:8088">http://localhost:8088</a>. You can use <code>admin</code> as both the username and password to log in. Now that we’re logged in, we can begin creating charts and dashboards. Let’s start by connecting to Trino.  </p><p>Click on “Settings” on the top-right corner. Click on “Database Connections”. Finally, click on “+ Database”. You should see the following screen pop up.  </p><img src="/2025/01/06/Creating-a-realtime-data-platform-visualization/screen_1.png" class=""><p>From the supported databases, select “Trino”. Once selected, you’ll be asked to enter the connection string to connect to it. Since the Superset Docker containers are running seperately from the ones where Trino is running, we’ll have to use the IP address of the local machine to make the connection between Superset and Trino. Depending on your operating system, the steps may vary. Go ahead and find the IP address of your machine. Let’s continue with my machine. We’ll have to enter the connection string that SQLAlchemy expects. It looks as follows. </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">trino://admin:@192.168.0.103:9080</span><br></pre></td></tr></table></figure><p>Notice how we’ve specified <code>admin</code> as the user and a blank password. I’m also specifying the port since I’ve mapped Trino’s port 8080 to 9080 locally. Once we’ve entered the details, we can click on “Test Connection” to make sure we’re able to connect. Finally, if the connection succeeds, we’ll click on “Connect” to save the connection. We can now proceed to creating charts and dashboards.  </p><p>Click on “Dashboards” on the top-left and then click on “+ Dashboard”. This will create a new draft dashboard in which we’ll visualize datasets. Change its name to “Orders and click “Save”. You should see a screen similar to the one shown below.  </p><img src="/2025/01/06/Creating-a-realtime-data-platform-visualization/screen_2.png" class=""> <p>Click on “Create a new chart” button in the middle of the screen. From there, click on “Add a dataset”. Every chart needs to be associated with a dataset. We’ll create them based on the ones we’ve stored in Trino. You should see the following screen.   </p><img src="/2025/01/06/Creating-a-realtime-data-platform-visualization/screen_3.png" class="">  <p>From the drop down on the left, select “Trino” as the database. Select the “views” schema. Finally, select “daily_orders” as the table. Once done, click on “Create Dataset and Create Chart” option on the bottom-right. In the screen that shows, select “Area Chart” and click on “Create New Chart”. This screen is shown below.  </p><img src="/2025/01/06/Creating-a-realtime-data-platform-visualization/screen_4.png" class="">  <p>Once on the screen to create the chart, we’ll add the <code>dt</code> column to the X-axis and <code>SUM(pct)</code> to the Y-axis. Since there’s only one value for each day, <code>SUM</code> will return the same value. Clicking on “Update Chart” on the bottom causes a query to be fired to Trino and the result to be rendered as a line chart. It looks as follows.  </p><img src="/2025/01/06/Creating-a-realtime-data-platform-visualization/screen_5.png" class="">  <p>Click on “Save” on the top-right. Give the chart a name and associate it with the dashboard we just created. It’ll take you to the dashboard with the chart rendered in it. Clicking on “Edit Dashboard” will allow you to change the size of the chart. Go ahead and increase its width. Saving the chart will make your dashboard look as follows.  </p><img src="/2025/01/06/Creating-a-realtime-data-platform-visualization/screen_6.png" class="">  <p>At this point, we can create more datasets and charts. I’ve created two more - one for displaying the data for the line chart as a table, and another for displaying the top customers. My final dashboard looks as follows.  </p><img src="/2025/01/06/Creating-a-realtime-data-platform-visualization/screen_7.png" class=""><p>The table and the line-chart are created from a view we’d created on top of Trino. Superset caches the results of the query for faster rendering. We can refresh the charts to see the latest data. Clicking on the three dots on the top-right of any chart will show the “Force refresh” option. Clicking on it will cause Superset to query Trino again to fetch the latest data. Since Pinot is getting updated in realtime and we have a view backing the chart, we’ll get the realtime numbers in the dashboard. We can make this happen automatically. Let’s go ahead and do that.  </p><p>Click on “Edit dashboard”. Click on the three dots on the top-right and click on “Set auto-refresh interval”. In the pop-up that shows, select a frequency. Click “Save” to save this setting. Click “Save” one more time to save the dashboard. This will make the chart update in realtime without requiring the user to manually refersh it.  </p><p>That’s it. That’s how you can create realtime dashbord on top of Pinot using Superset and Trino.</p><link rel="stylesheet" href="/css/spoiler.css" type="text/css"><script src="/js/spoiler.js" type="text/javascript" async></script>]]></content>
    
    <summary type="html">
    
      
      
        &lt;p&gt;In the &lt;a href=&quot;/2025/01/04/Creating-a-realtime-data-platform-orchestration/&quot;&gt;previous post&lt;/a&gt; we saw how we can use Airflow to create d
      
    
    </summary>
    
    
      <category term="architecture" scheme="http://fasihkhatib.com/tags/architecture/"/>
    
  </entry>
  
  <entry>
    <title>Creating a realtime data platform - orchestration</title>
    <link href="http://fasihkhatib.com/2025/01/04/Creating-a-realtime-data-platform-orchestration/"/>
    <id>http://fasihkhatib.com/2025/01/04/Creating-a-realtime-data-platform-orchestration/</id>
    <published>2025-01-04T14:06:02.000Z</published>
    <updated>2025-01-06T16:06:51.667Z</updated>
    
    <content type="html"><![CDATA[<p>In the <a href="/2025/01/02/Creating-a-realtime-data-platform-SQL/">previous post</a> we looked at how to query Pinot. We queried it using its REST API, the console, and Trino. In this post we’re going to look at how to use Apache Airflow to periodically create datasets on top of the data that’s been stored in Pinot. Creating datasets allows us to reference them in data visualization tools for quicker rendering. We’ll leverage Trino’s query federation to store the resultant dataset in S3 so that it can be queried using the Hive connector.</p><h2 id="Getting-started"><a href="#Getting-started" class="headerlink" title="Getting started"></a>Getting started</h2><p>Let’s say that the marketing team wants to run email campaigns where the users who actively place orders are given a promotional discount. The dataset they need contains the details of the user like their name and email so that the communication sent out can be personalised. We can write the following query in the source Postgres database to see what the dataset would look like.  </p><figure class="highlight sql"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">SELECT</span></span><br><span class="line">    u.id,</span><br><span class="line">    u.first_name,</span><br><span class="line">    u.email,</span><br><span class="line">    <span class="built_in">COUNT</span>(<span class="operator">*</span>) <span class="keyword">AS</span> COUNT</span><br><span class="line"><span class="keyword">FROM</span></span><br><span class="line">    public.user <span class="keyword">AS</span> u</span><br><span class="line">    <span class="keyword">INNER</span> <span class="keyword">JOIN</span> orders <span class="keyword">AS</span> o <span class="keyword">ON</span> u.id <span class="operator">=</span> o.user_id</span><br><span class="line"><span class="keyword">WHERE</span></span><br><span class="line">    o.created_at <span class="operator">&gt;=</span> NOW() <span class="operator">-</span> <span class="type">INTERVAL</span> <span class="string">'30'</span> <span class="keyword">DAY</span></span><br><span class="line"><span class="keyword">GROUP</span> <span class="keyword">BY</span> <span class="number">1</span>, <span class="number">2</span>, <span class="number">3</span></span><br><span class="line"><span class="keyword">ORDER</span> <span class="keyword">BY</span> <span class="number">4</span> <span class="keyword">DESC</span>;</span><br></pre></td></tr></table></figure><p>This gives us the following result. </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line">| id | first_name | email                       | count |</span><br><span class="line">|----|------------|-----------------------------|-------|</span><br><span class="line">|  5 | Michelle   | gilljasmine@example.com     |  4906 |</span><br><span class="line">|  1 | Alejandra  | wilcoxstephanie@example.org |  4904 |</span><br><span class="line">|  3 | Hailey     | james97@example.com         |  4877 |</span><br><span class="line">|  4 | Michelle   | ivillanueva@example.com     |  4872 |</span><br><span class="line">|  2 | Brandon    | julie33@example.com         |  4846 |</span><br></pre></td></tr></table></figure><p>For us to create this dataset using Trino, we’ll have to ingest the <code>user</code> table into Pinot. Like we did in the earlier posts, we’ll use Debezium to stream the rows. We’ll skip repeating the steps here since we’ve already seen them. Instead, we’ll move to writing the query in Trino. Once we’ve written the query, we’ll look at how to use Airflow to run it periodically. I’ve also updated the <code>orders</code> table to extract the <code>user_id</code> column out of the <code>source</code> payload. Let’s translate the query written for Postgres to Trino.  </p><figure class="highlight sql"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">SELECT</span> u.id,</span><br><span class="line">       u.first_name,</span><br><span class="line">       u.email,</span><br><span class="line">       <span class="built_in">COUNT</span>(<span class="operator">*</span>) <span class="keyword">AS</span> count</span><br><span class="line"><span class="keyword">FROM</span> pinot.default.user <span class="keyword">AS</span> u</span><br><span class="line">     <span class="keyword">INNER</span> <span class="keyword">JOIN</span> pinot.default.orders <span class="keyword">AS</span> o</span><br><span class="line">     <span class="keyword">ON</span> o.user_id <span class="operator">=</span> u.id</span><br><span class="line"><span class="keyword">WHERE</span> FROM_UNIXTIME(<span class="built_in">CAST</span>(o.created_at <span class="keyword">AS</span> <span class="keyword">DOUBLE</span>) <span class="operator">/</span> <span class="number">1e3</span>) <span class="operator">&gt;=</span> NOW() <span class="operator">-</span> <span class="type">INTERVAL</span> <span class="string">'30'</span> <span class="keyword">DAY</span></span><br><span class="line"><span class="keyword">GROUP</span> <span class="keyword">BY</span> <span class="number">1</span>, <span class="number">2</span>, <span class="number">3</span></span><br><span class="line"><span class="keyword">ORDER</span> <span class="keyword">BY</span> <span class="number">4</span> <span class="keyword">DESC</span>;</span><br></pre></td></tr></table></figure><p>We saw earlier that we’d like to save the result of this query so that the marketing team could use it. We can do that by storing the results in a table created using the above <code>SELECT</code> statement. Let’s create a schema in Hive called <code>datasets</code> where we’ll store the results. The following query creates the schema.  </p><figure class="highlight sql"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">CREATE</span> SCHEMA hive.datasets</span><br><span class="line"><span class="keyword">WITH</span> (</span><br><span class="line">    "location" <span class="operator">=</span> <span class="string">'s3://apache-pinot-hive/datasets'</span></span><br><span class="line">);</span><br></pre></td></tr></table></figure><p>We can now create the table using the above <code>SELECT</code> statement.  </p><figure class="highlight sql"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">CREATE</span> <span class="keyword">TABLE</span> hive.datasets.top_users <span class="keyword">AS</span> </span><br><span class="line"><span class="keyword">SELECT</span> u.id,</span><br><span class="line">       u.first_name,</span><br><span class="line">       u.email,</span><br><span class="line">       <span class="built_in">COUNT</span>(<span class="operator">*</span>) <span class="keyword">AS</span> count</span><br><span class="line"><span class="keyword">FROM</span> pinot.default.user <span class="keyword">AS</span> u</span><br><span class="line">     <span class="keyword">INNER</span> <span class="keyword">JOIN</span> pinot.default.orders <span class="keyword">AS</span> o</span><br><span class="line">     <span class="keyword">ON</span> o.user_id <span class="operator">=</span> u.id</span><br><span class="line"><span class="keyword">WHERE</span> FROM_UNIXTIME(<span class="built_in">CAST</span>(o.created_at <span class="keyword">AS</span> <span class="keyword">DOUBLE</span>) <span class="operator">/</span> <span class="number">1e3</span>) <span class="operator">&gt;=</span> NOW() <span class="operator">-</span> <span class="type">INTERVAL</span> <span class="string">'30'</span> <span class="keyword">DAY</span></span><br><span class="line"><span class="keyword">GROUP</span> <span class="keyword">BY</span> <span class="number">1</span>, <span class="number">2</span>, <span class="number">3</span>;</span><br></pre></td></tr></table></figure><p>We can now query the table using the query that follows.  </p><figure class="highlight sql"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">SELECT</span> <span class="operator">*</span></span><br><span class="line"><span class="keyword">FROM</span> hive.datasets.top_users;</span><br></pre></td></tr></table></figure><p>Having seen how to create datasets as tables using Trino CLI, let’s see how we can do the same using Airflow. Let’s say that the marketing team requires the data to be regenerated everyday. We can schedule an Airflow DAG to run daily to recreate this dataset.  Briefly, Airflow is a task orchestrator. An orchestrator allows creating and executing workflows expressed as directed acyclic graphs (DAGs). Each workflow consists of multiple tasks, which form the nodes in the graph, and edges between the tasks indicate the directionality.  </p><p>We’ll create a workflow which recreates the dataset daily. In a nutshell, the workflow first drops the older table and then recreates a newer one. This is because the Trino connector does not allow creating or replacing table as an atomic operation. We’ll leverage Airflow’s SQL operator and templating mechanism to create and execute queries. The following shows the files and folders we’ll be working with as we create the DAG.  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line">$ tree airflow/dags</span><br><span class="line">airflow/dags</span><br><span class="line">├── __init__.py</span><br><span class="line">├── create_top_users_dataset.py</span><br><span class="line">└── sql</span><br><span class="line">    ├── common</span><br><span class="line">    │   ├── drop_table.sql</span><br><span class="line">    └── datasets</span><br><span class="line">        └── top_users.sql</span><br></pre></td></tr></table></figure><p>Let’s begin with the <code>drop_table.sql</code> query. This is a templated query which allows dropping a table. Writing queries as templates allows us to leverage the Jinja2 templating engine that comes with Airflow. We can create queries depending on the parameters passed to the operator. This allows reusing the same template across multiple tasks. The variables are enclosed in two pairs of braces and the name of the variable is written between them. The content of the file is shown below.  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">DROP TABLE IF EXISTS {{ params.name }} </span><br></pre></td></tr></table></figure><p>As we’ll see shortly, we’ll pass parameters to the Airflow task executing this query which are available in the <code>params</code> dictionary in the template. In the query above, we’d have to pass the <code>name</code> variable which contains the name of the table we’d like to drop. The <code>top_users.sql</code> file contains the same query we’ve seen above so we’ll move on to setting up the DAG.  </p><p>The DAG is written in the file <code>create_top_users_dataset.py</code> and contains two tasks. First, to drop the table, which uses the templated query above. Second, to create the dataset. Both of these tasks use the <code>SQLExecuteQueryOperator</code>. Its parameters include a connection ID, which is used to connect to the database, the path to the SQL file containing the templated query to execute, and the values for the parameters. The contents of the file are shown below.  </p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">import</span> pendulum</span><br><span class="line"><span class="keyword">from</span> airflow.providers.common.sql.operators.sql <span class="keyword">import</span> SQLExecuteQueryOperator</span><br><span class="line"></span><br><span class="line"><span class="keyword">from</span> airflow <span class="keyword">import</span> DAG</span><br><span class="line"></span><br><span class="line">dag = DAG(</span><br><span class="line">    dag_id=<span class="string">"create_daily_datasets"</span>,</span><br><span class="line">    catchup=<span class="literal">False</span>,</span><br><span class="line">    schedule=<span class="string">"@daily"</span>,</span><br><span class="line">    start_date=pendulum.now(<span class="string">"GMT"</span>),</span><br><span class="line">)</span><br><span class="line"></span><br><span class="line">drop_table = SQLExecuteQueryOperator(</span><br><span class="line">    task_id=<span class="string">"drop_top_users"</span>,</span><br><span class="line">    conn_id=<span class="string">"trino"</span>,</span><br><span class="line">    params={<span class="string">"name"</span>: <span class="string">"hive.datasets.top_users"</span>},</span><br><span class="line">    sql=<span class="string">"sql/common/drop_table.sql"</span>,</span><br><span class="line">    dag=dag,</span><br><span class="line">)</span><br><span class="line"></span><br><span class="line">create_table = SQLExecuteQueryOperator(</span><br><span class="line">    task_id=<span class="string">"create_top_users"</span>,</span><br><span class="line">    conn_id=<span class="string">"trino"</span>,</span><br><span class="line">    sql=<span class="string">"sql/datasets/top_users.sql"</span>,</span><br><span class="line">    dag=dag,</span><br><span class="line">)</span><br><span class="line"></span><br><span class="line"><span class="comment"># -- Dependencies between tasks</span></span><br><span class="line">drop_table &gt;&gt; create_table</span><br></pre></td></tr></table></figure><p>We define the DAG in line 6 and specify that it will execute daily. Line 13 and 21 define the tasks within the DAG. The first task drops the table, if it exists. The second task creates the table again. On line 29 we define the relationship between the tasks. Both of the tasks are of type <code>SQLExecuteQueryOperator</code>. This task allows executing arbitrary SQL queries by connecting to a database. The connection to the database is specified as the <code>conn_id</code>; we’ve specified it as <code>trino</code>. As we’ll see next, we need to create the connection using the Airflow console. Once we create the connection, we can execute the DAG.</p><p>Airflow UI is available on <a href="http://localhost:8080">http://localhost:8080</a>. From there we’ll click on ‘Admin’ up top, and then ‘Connections’. From there we’ll click on the ‘+’ icon to create a connection. The screenshot below shows what we need to fill in to make the connection. In my setup, Trino runs with the hostname <code>trino</code> and you’ll have to replace this to match what you have. Once the details are entered, we’ll click the ‘Save’ button at the bottom to create the connection.</p><img src="/2025/01/04/Creating-a-realtime-data-platform-orchestration/conn_1.png" class=""><p>Finally, we can trigger the DAG. We’ll click on the ‘DAGs’ option on top left side of the screen. This will show us the list of DAGs available. From there, we’ll click the play button for the <code>create_daily_datsets</code> DAG. This will trigger and run the DAG. We’ll wait for the DAG to finish running. Assuming everything works correctly, we will have created a table in S3. Leaving the DAG in the enabled state causes it to run on the specified schedule; in this case it is daily. As the DAG continues to run, it’ll create and recreate the table daily.  </p><p>That’s it on orchestrating tasks to create datasets on top of Pinot.</p><link rel="stylesheet" href="/css/spoiler.css" type="text/css"><script src="/js/spoiler.js" type="text/javascript" async></script>]]></content>
    
    <summary type="html">
    
      
      
        &lt;p&gt;In the &lt;a href=&quot;/2025/01/02/Creating-a-realtime-data-platform-SQL/&quot;&gt;previous post&lt;/a&gt; we looked at how to query Pinot. We queried it usin
      
    
    </summary>
    
    
      <category term="architecture" scheme="http://fasihkhatib.com/tags/architecture/"/>
    
  </entry>
  
  <entry>
    <title>Creating a realtime data platform - SQL</title>
    <link href="http://fasihkhatib.com/2025/01/02/Creating-a-realtime-data-platform-SQL/"/>
    <id>http://fasihkhatib.com/2025/01/02/Creating-a-realtime-data-platform-SQL/</id>
    <published>2025-01-02T07:06:42.000Z</published>
    <updated>2025-01-04T02:51:03.350Z</updated>
    
    <content type="html"><![CDATA[<p>In the <a href="/2024/12/27/Creating-a-realtime-data-platform-nullability/">previous post</a> we looked at nullability and how Pinot requires that a default value be specified in place of an actual null. In this post we’ll begin looking at how to query data stored in Pinot. We’ll begin by querying Pinot using its API and query console. Then, we’ll query Pinot using Trino. We’ll use Trino’s query federation capability to create views and tables on top of the data that’s stored in Pinot.</p><h2 id="Getting-started"><a href="#Getting-started" class="headerlink" title="Getting started"></a>Getting started</h2><p>Pinot provides an SQL interface for writing queries that’s built on top of the Apache Calcite query parser. It ships with two query engines - the single-stage engine called v1 and the multi-stage engine called v2. The single-stage engine allows writing simpler SQL queries that do not involve joins or window functions. The multi-stage query engine allows more complex queries involving joins on distributed tables, window functions, common table expressions, and much more. It’s optimized for in-memory processing latency. Queries can be submitted to Pinot using the query console, REST API, or Trino. When querying Pinot using either the API or query console, we need to explicitly enable the multi-stage engine.  </p><p>We’ll begin by writing SQL queries and submitting them first using the API and then using the query console. For queries that require a lot of data shuffling or data that spills to disk, it is recommended to use Presto or Trino. Let’s start by writing a simple SQL query that retrieves the user agent from the orders table. The SQL query to do this is given below.  </p><figure class="highlight sql"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">SELECT</span> user_agent <span class="keyword">FROM</span> orders LIMIT <span class="number">1</span>;</span><br></pre></td></tr></table></figure><p>We’ll create a file called <code>query.json</code> which contains the payload that we’ll POST to Pinot. It contains the SQL query and options to indicate to Pinot that we’d like to use the multi-stage engine to execute the query. The content of the file is given below.  </p><figure class="highlight json"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"><span class="punctuation">{</span></span><br><span class="line">    <span class="attr">"sql"</span><span class="punctuation">:</span> <span class="string">"SELECT user_agent FROM orders LIMIT 1;"</span><span class="punctuation">,</span></span><br><span class="line">    <span class="attr">"trace"</span><span class="punctuation">:</span> <span class="literal"><span class="keyword">false</span></span><span class="punctuation">,</span></span><br><span class="line">    <span class="attr">"queryOptions"</span><span class="punctuation">:</span> <span class="string">"useMultistageEngine=true"</span></span><br><span class="line"><span class="punctuation">}</span></span><br></pre></td></tr></table></figure>  <p>We can now POST this payload to the appropriate endpoint using curl.  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">curl -s -d @query.json localhost:9000/sql | jq ".resultTable"</span><br></pre></td></tr></table></figure>  <p>The response is returned as a large JSON object but the part we’re interested in is stored in the key called <code>resultTable</code>. It contains the names of the columns returned, the values of the columns, and their datatypes. The following shows the result returned for the query that we’ve submitted above.  </p><figure class="highlight json"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br></pre></td><td class="code"><pre><span class="line"><span class="punctuation">{</span></span><br><span class="line">  <span class="attr">"dataSchema"</span><span class="punctuation">:</span> <span class="punctuation">{</span></span><br><span class="line">    <span class="attr">"columnNames"</span><span class="punctuation">:</span> <span class="punctuation">[</span></span><br><span class="line">      <span class="string">"user_agent"</span></span><br><span class="line">    <span class="punctuation">]</span><span class="punctuation">,</span></span><br><span class="line">    <span class="attr">"columnDataTypes"</span><span class="punctuation">:</span> <span class="punctuation">[</span></span><br><span class="line">      <span class="string">"STRING"</span></span><br><span class="line">    <span class="punctuation">]</span></span><br><span class="line">  <span class="punctuation">}</span><span class="punctuation">,</span></span><br><span class="line">  <span class="attr">"rows"</span><span class="punctuation">:</span> <span class="punctuation">[</span></span><br><span class="line">    <span class="punctuation">[</span></span><br><span class="line">      <span class="string">"Mozilla/5.0 (Android 8.1.0; Mobile; rv:123.0) Gecko/123.0 Firefox/123.0"</span></span><br><span class="line">    <span class="punctuation">]</span></span><br><span class="line">  <span class="punctuation">]</span></span><br><span class="line"><span class="punctuation">}</span></span><br></pre></td></tr></table></figure><p>We’ll now look at writing SQL queries using the query console. Let’s write a SQL query which counts the number of orders placed each day. To do this, we’d have to convert the <code>created_at</code> column from milliseconds to date and then run a group by to find the count of the orders that have been placed. The following query gives us the desired result.  </p><figure class="highlight sql"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">SELECT</span> </span><br><span class="line">    TODATETIME(created_at, <span class="string">'yyyy-MM-dd'</span>) <span class="keyword">AS</span> dt,</span><br><span class="line">    <span class="built_in">COUNT</span>(<span class="operator">*</span>) <span class="keyword">AS</span> count</span><br><span class="line"><span class="keyword">FROM</span> orders</span><br><span class="line"><span class="keyword">GROUP</span> <span class="keyword">BY</span> <span class="number">1</span></span><br><span class="line"><span class="keyword">ORDER</span> <span class="keyword">BY</span> <span class="number">1</span> <span class="keyword">ASC</span>;</span><br></pre></td></tr></table></figure><p>We can run this in the console using the single-stage engine since it is one of the simpler queries. To do this, we’ll paste the query in the query console and leave the “Use Multi-Stage Engine” checkbox unchecked. The result of running the query is shown in the screenshot below.  </p><img src="/2025/01/02/Creating-a-realtime-data-platform-SQL/query_1.png" class="">  <p>We’ll now modify the query so that it requires the multi-stage engine. Using features of SQL language like window functions, common table expressions, joins, etc. requires executing the query using the multi-stage engine. We’ll write a query which finds the top five user agents and ranks them. This requires using common table expressions, and window functions and is the perfect candidate for using the multi-stage engine. The query is shown below. </p><figure class="highlight sql"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">WITH</span> ua <span class="keyword">AS</span> (</span><br><span class="line">    <span class="keyword">SELECT</span> </span><br><span class="line">        user_agent, </span><br><span class="line">        <span class="built_in">COUNT</span>(<span class="operator">*</span>) <span class="keyword">AS</span> count,</span><br><span class="line">        <span class="built_in">RANK</span>() <span class="keyword">OVER</span> (<span class="keyword">ORDER</span> <span class="keyword">BY</span> <span class="built_in">COUNT</span>(<span class="operator">*</span>) <span class="keyword">DESC</span>) <span class="keyword">AS</span> rank</span><br><span class="line">    <span class="keyword">FROM</span> orders</span><br><span class="line">    <span class="keyword">GROUP</span> <span class="keyword">BY</span> <span class="number">1</span></span><br><span class="line">)</span><br><span class="line"><span class="keyword">SELECT</span> user_agent, count</span><br><span class="line"><span class="keyword">FROM</span> ua</span><br><span class="line"><span class="keyword">WHERE</span> rank <span class="operator">&lt;=</span> <span class="number">5</span></span><br><span class="line"><span class="keyword">ORDER</span> <span class="keyword">BY</span> rank;</span><br></pre></td></tr></table></figure><p>The result of running this query is shown below. Notice how the checkbox to use the multi-stage engine is checked.  </p><img src="/2025/01/02/Creating-a-realtime-data-platform-SQL/query_2.png" class=""><p>Having seen how to query Pinot using the API, and the query console with either single-stage or multi-stage engine, we’ll move on to querying Pinot using Trino. We’ll begin by connecting to Trino using its CLI utility and create a catalog which connects to our Pinot instance. Then, we’ll run queries using Trino. We’ll also see how we can leverage query federation provided by Trino to connect to AWS Glue and create views and tables on top of the data stored in Pinot.  </p><p>Let’s start by connecting to Trino. The following command shows how to connect to the Trino instance running as a Docker container using its command-line utility. You can follow the <a href="https://trino.io/docs/current/client/cli.html">instructions mentioned in the official documentation</a> to setup the CLI.</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">./trino http://localhost:9080</span><br></pre></td></tr></table></figure><p>Every database in Trino that we’d like to connect to is configured using a catalog. A catalog is a collection of properties that specify how to connect to the database. We’ll begin by creating a catalog which allows us to query Pinot.  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">CREATE CATALOG pinot USING pinot </span><br><span class="line">WITH (</span><br><span class="line">    "pinot.controller-urls" = 'pinot-controller:9000'</span><br><span class="line">);</span><br></pre></td></tr></table></figure><p>The <code>CREATE CATALOG</code> command creates a catalog. It takes the name of the catalog, which we’ve specified as <code>pinot</code>, and the name of the connector which connects to the database, which we’ve also specified as <code>pinot</code>. The <code>WITH</code> section specifies properties that are required to connect to the database. We’ve specified the URL of the controller. Once the catalog is created, we can begin querying the tables in the database. The tables in Pinot are stored in the <code>default</code> schema of the <code>pinot</code> connector. To be able to query these, we’ll have to <code>USE</code> the catalog and schema. The following command sets the schema for the current session.  </p><figure class="highlight sql"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">USE pinot.default;</span><br></pre></td></tr></table></figure><p>To view the tables in the current schema, we’ll execute the <code>SHOW TABLES</code> command.  </p><figure class="highlight sql"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">trino:<span class="keyword">default</span><span class="operator">&gt;</span> <span class="keyword">SHOW</span> TABLES;</span><br><span class="line"> <span class="keyword">Table</span></span><br><span class="line"><span class="comment">--------</span></span><br><span class="line"> orders</span><br><span class="line">(<span class="number">1</span> <span class="type">row</span>)</span><br></pre></td></tr></table></figure><p>Let’s build upon the query that we wrote previously which calculates the count of the orders placed on a given day. Let’s say we’d like to find the percentage change between the number of orders placed on a given day and the day prior. We can do this using the <code>LAG</code> window function which will allow us to access the value of the prior row. The following query shows how to calculates this.  </p><figure class="highlight sql"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">WITH</span> <span class="keyword">FUNCTION</span> div(x <span class="keyword">DOUBLE</span>, y <span class="keyword">DOUBLE</span>)</span><br><span class="line">  <span class="keyword">RETURNS</span> <span class="keyword">DOUBLE</span></span><br><span class="line">  <span class="keyword">RETURN</span> x <span class="operator">/</span> y</span><br><span class="line"><span class="keyword">WITH</span> ua <span class="keyword">AS</span> (</span><br><span class="line">  <span class="keyword">SELECT</span> <span class="built_in">CAST</span>(FROM_UNIXTIME(created_at <span class="operator">/</span> <span class="number">1e3</span>) <span class="keyword">AS</span> <span class="type">DATE</span>) <span class="keyword">AS</span> dt,</span><br><span class="line">         <span class="built_in">COUNT</span>(<span class="operator">*</span>) <span class="keyword">AS</span> count</span><br><span class="line">  <span class="keyword">FROM</span> orders</span><br><span class="line">  <span class="keyword">GROUP</span> <span class="keyword">BY</span> <span class="number">1</span>     </span><br><span class="line">)</span><br><span class="line"><span class="keyword">SELECT</span> dt,</span><br><span class="line">       count,</span><br><span class="line">       <span class="built_in">LAG</span>(count, <span class="number">1</span>) <span class="keyword">OVER</span> (<span class="keyword">ORDER</span> <span class="keyword">BY</span> dt <span class="keyword">ASC</span>) <span class="keyword">AS</span> prev_count,</span><br><span class="line">       ROUND(DIV(count,  <span class="built_in">LAG</span>(count, <span class="number">1</span>) <span class="keyword">OVER</span> (<span class="keyword">ORDER</span> <span class="keyword">BY</span> dt <span class="keyword">ASC</span>)), <span class="number">2</span>) <span class="keyword">AS</span> pct</span><br><span class="line"><span class="keyword">FROM</span> ua;</span><br></pre></td></tr></table></figure><p>There’s a lot going on in the query above so let’s break it down. We begin by defining an inline function called <code>DIV</code> which performs division on two numbers. The reason for writing this function is that the division operator in Pinot returns the integer part of the quotient. To get the quotient as a decimal, we’d have to cast the values to <code>DOUBLE</code>. The function does just that. In the common table expression, we calculate the number of orders placed each day. Finally, in the <code>SELECT</code> statement, we find the number of orders placed on the day prior using the <code>LAG</code> window function.  </p><p>Running the query gives us the following result.  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">     dt     | count | prev_count | pct</span><br><span class="line">------------+-------+------------+------</span><br><span class="line"> 2024-12-29 | 11610 |       NULL | NULL</span><br><span class="line"> 2024-12-30 | 13849 |      11610 | 1.19</span><br><span class="line"> 2024-12-31 | 13649 |      13849 | 0.99</span><br></pre></td></tr></table></figure><p>Let’s now say that we’d like to run the query frequently. Perhaps we’d like to display the table in a visualization tool. One way to do this would be to store the query as a view. However, Pinot does not support views. We’d have to work around this by relying on Trino’s query federation to write the result to another data store which supports creating views. For the sake of this post, we’ll use AWS Glue as a replacement for HDFS.</p><p>Let’s start with creating a Hive catalog in which we use S3 as the backing store.  </p><figure class="highlight sql"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">CREATE</span> CATALOG hive <span class="keyword">USING</span> hive</span><br><span class="line"><span class="keyword">WITH</span> (</span><br><span class="line">    "hive.metastore" <span class="operator">=</span> <span class="string">'glue'</span>,</span><br><span class="line">    "hive.recursive-directories" <span class="operator">=</span> <span class="string">'true'</span>,</span><br><span class="line">    "hive.storage-format" <span class="operator">=</span> <span class="string">'PARQUET'</span>,</span><br><span class="line">    "hive.insert-existing-partitions-behavior" <span class="operator">=</span> <span class="string">'APPEND'</span>,</span><br><span class="line">    "fs.native-s3.enabled" <span class="operator">=</span> <span class="string">'true'</span>,</span><br><span class="line">    "s3.endpoint" <span class="operator">=</span> <span class="string">'https://s3.us-east-1.amazonaws.com'</span>,</span><br><span class="line">    "s3.region" <span class="operator">=</span> <span class="string">'us-east-1'</span>,</span><br><span class="line">    "s3.aws-access-key" <span class="operator">=</span> <span class="string">'...'</span>,</span><br><span class="line">    "s3.aws-secret-key" <span class="operator">=</span> <span class="string">'...'</span></span><br><span class="line">);</span><br></pre></td></tr></table></figure>  <p>In the query above, we’re creating the Hive catalog which we’ll use to create and store datasets. Since we’re storing files in S3, we need to specify the region and its S3 endpoint. You’d have to replace the keys with those of your own if you’re running the example locally.  </p><p>Once we create the catalog, we’ll have to create the schema where datasets will be stored. Schemas are created with <code>CREATE SCHEMA</code> command and require that we provide a path in an S3 bucket where the files will be stored. The query below shows how to create a schema named <code>views</code>.  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">CREATE SCHEMA hive.views</span><br><span class="line">WITH (</span><br><span class="line">    "location" = 's3://apache-pinot-hive/views'</span><br><span class="line">);</span><br></pre></td></tr></table></figure><p>Once the schema is created, we can persist the query or its results for quicker access. In the query that follows, we create a view on top of Pinot.</p><figure class="highlight sql"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">CREATE</span> <span class="keyword">VIEW</span> hive.views.daily_orders <span class="keyword">AS</span></span><br><span class="line"><span class="keyword">WITH</span> ua <span class="keyword">AS</span> (</span><br><span class="line">  <span class="keyword">SELECT</span> <span class="built_in">CAST</span>(FROM_UNIXTIME(created_at <span class="operator">/</span> <span class="built_in">CAST</span>(<span class="number">1e3</span> <span class="keyword">AS</span> <span class="keyword">DOUBLE</span>)) <span class="keyword">AS</span> <span class="type">DATE</span>) <span class="keyword">AS</span> dt,</span><br><span class="line">         <span class="built_in">COUNT</span>(<span class="operator">*</span>) <span class="keyword">AS</span> count</span><br><span class="line">  <span class="keyword">FROM</span> pinot.default.orders</span><br><span class="line">  <span class="keyword">GROUP</span> <span class="keyword">BY</span> <span class="number">1</span>     </span><br><span class="line">)</span><br><span class="line"><span class="keyword">SELECT</span> dt,</span><br><span class="line">       count,</span><br><span class="line">       <span class="built_in">LAG</span>(count, <span class="number">1</span>) <span class="keyword">OVER</span> (<span class="keyword">ORDER</span> <span class="keyword">BY</span> dt <span class="keyword">ASC</span>) <span class="keyword">AS</span> prev_count,</span><br><span class="line">       ROUND(<span class="built_in">CAST</span>(count <span class="keyword">AS</span> <span class="keyword">DOUBLE</span>) <span class="operator">/</span> <span class="built_in">LAG</span>(count, <span class="number">1</span>) <span class="keyword">OVER</span> (<span class="keyword">ORDER</span> <span class="keyword">BY</span> dt <span class="keyword">ASC</span>), <span class="number">2</span>) <span class="keyword">AS</span> pct</span><br><span class="line"><span class="keyword">FROM</span> ua;</span><br></pre></td></tr></table></figure><p>We can query the view once it is created. This allows us to save the queries so that they can be referenced in data visualization tools. In the view above, we’ll get the realtime difference between orders placed since Pinot will be queried every time we select from the view. A small caveat to note is that views cannot store inline functions so we had to cast one of the operands in the division operation to double manually.  </p><p>To see this in action, we’ll run the data generation script one more time and then query the view. We can see that the counts for orders placed today, yesterday, and day before yesterday increase after the script runs.  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line">trino&gt; SELECT * FROM hive.views.daily_orders;</span><br><span class="line">     dt     | count | prev_count | pct</span><br><span class="line">------------+-------+------------+------</span><br><span class="line"> 2024-12-29 | 11610 |       NULL | NULL</span><br><span class="line"> 2024-12-30 | 13849 |      11610 | 1.19</span><br><span class="line"> 2024-12-31 | 26281 |      13849 |  1.9</span><br><span class="line"> 2025-01-01 | 15033 |      26281 | 0.57</span><br><span class="line"> 2025-01-02 | 12266 |      15033 | 0.82</span><br></pre></td></tr></table></figure><p>Creating views like this provides us with a way to save queries that we’d like to run frequently. As we’ll see in later posts, these can also be referenced in data visualization tools to provide realtime analytics.</p><p>That’s it on how to run SQL on Pinot.</p><link rel="stylesheet" href="/css/spoiler.css" type="text/css"><script src="/js/spoiler.js" type="text/javascript" async></script>]]></content>
    
    <summary type="html">
    
      
      
        &lt;p&gt;In the &lt;a href=&quot;/2024/12/27/Creating-a-realtime-data-platform-nullability/&quot;&gt;previous post&lt;/a&gt; we looked at nullability and how Pinot requ
      
    
    </summary>
    
    
      <category term="architecture" scheme="http://fasihkhatib.com/tags/architecture/"/>
    
  </entry>
  
  <entry>
    <title>Creating a realtime data platform - nullability</title>
    <link href="http://fasihkhatib.com/2024/12/27/Creating-a-realtime-data-platform-nullability/"/>
    <id>http://fasihkhatib.com/2024/12/27/Creating-a-realtime-data-platform-nullability/</id>
    <published>2024-12-27T13:46:15.000Z</published>
    <updated>2024-12-27T13:46:15.464Z</updated>
    
    <content type="html"><![CDATA[<p>In the <a href="/2024/12/26/Creating-a-realtime-data-platform-evolving-the-schema/">previous post</a> we looked at evolving the schema. We briefly discussed handling null values in columns when we added the <code>user_agent</code> column. It allows null values since <code>enableColumnBasedNullHandling</code> is set to true. However, we weren’t able to see nullability in action since that column always had values in it. In this post we’ll evolve the schema one more time and add columns that have null values in them. We’ll see how to handle null values in Pinot queries, and how they differ from nulls in other databases. Let’s dive right in.</p><h2 id="Getting-started"><a href="#Getting-started" class="headerlink" title="Getting started"></a>Getting started</h2><p>We’ll begin by looking at the <code>source</code> payload that we’ve stored in Pinot.</p><figure class="highlight json"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line"><span class="punctuation">{</span></span><br><span class="line">    <span class="attr">"user_id"</span><span class="punctuation">:</span> <span class="number">4</span><span class="punctuation">,</span></span><br><span class="line">    <span class="attr">"cafe_id"</span><span class="punctuation">:</span> <span class="number">27</span><span class="punctuation">,</span></span><br><span class="line">    <span class="attr">"address_id"</span><span class="punctuation">:</span> <span class="number">4</span><span class="punctuation">,</span></span><br><span class="line">    <span class="attr">"created_at"</span><span class="punctuation">:</span> <span class="number">1735211553094</span><span class="punctuation">,</span></span><br><span class="line">    <span class="attr">"id"</span><span class="punctuation">:</span> <span class="number">1</span><span class="punctuation">,</span></span><br><span class="line">    <span class="attr">"user_agent"</span><span class="punctuation">:</span> <span class="string">"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36 Edg/121.0.0.0"</span><span class="punctuation">,</span></span><br><span class="line">    <span class="attr">"status"</span><span class="punctuation">:</span> <span class="number">0</span></span><br><span class="line"><span class="punctuation">}</span></span><br></pre></td></tr></table></figure>  <p>In the payload above, we notice that there’s <code>created_at</code> but no <code>updated_at</code> or <code>deleted_at</code>. That’s because these have null values in the source table in Postgres. Let’s update the schema and table definitions to store these fields.   </p><p>To update the schema, we’ll add the following to <code>dateTimeFieldSpecs</code>.  </p><figure class="highlight json"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br></pre></td><td class="code"><pre><span class="line"><span class="punctuation">{</span></span><br><span class="line">  <span class="attr">"name"</span><span class="punctuation">:</span> <span class="string">"updated_at"</span><span class="punctuation">,</span></span><br><span class="line">  <span class="attr">"dataType"</span><span class="punctuation">:</span> <span class="string">"LONG"</span><span class="punctuation">,</span></span><br><span class="line">  <span class="attr">"format"</span><span class="punctuation">:</span> <span class="string">"1:MILLISECONDS:EPOCH"</span><span class="punctuation">,</span></span><br><span class="line">  <span class="attr">"granularity"</span><span class="punctuation">:</span> <span class="string">"1:MILLISECONDS"</span><span class="punctuation">,</span></span><br><span class="line">  <span class="attr">"defaultNullValue"</span><span class="punctuation">:</span> <span class="string">"-1"</span></span><br><span class="line"><span class="punctuation">}</span><span class="punctuation">,</span></span><br><span class="line"><span class="punctuation">{</span></span><br><span class="line">  <span class="attr">"name"</span><span class="punctuation">:</span> <span class="string">"deleted_at"</span><span class="punctuation">,</span></span><br><span class="line">  <span class="attr">"dataType"</span><span class="punctuation">:</span> <span class="string">"LONG"</span><span class="punctuation">,</span></span><br><span class="line">  <span class="attr">"format"</span><span class="punctuation">:</span> <span class="string">"1:MILLISECONDS:EPOCH"</span><span class="punctuation">,</span></span><br><span class="line">  <span class="attr">"granularity"</span><span class="punctuation">:</span> <span class="string">"1:MILLISECONDS"</span><span class="punctuation">,</span></span><br><span class="line">  <span class="attr">"defaultNullValue"</span><span class="punctuation">:</span> <span class="string">"-1"</span></span><br><span class="line"><span class="punctuation">}</span></span><br></pre></td></tr></table></figure>   <p>In the JSON above, we specify the usual fields just as we did for <code>created_at</code>. We also set <code>defaultNullValue</code>. This value will be used instead of null when these fields are extracted from the <code>source</code> payload. This is different from what you’d usually observe in a database that supports null values. The reason for this is that Pinot uses a <a href="https://docs.pinot.apache.org/basics/indexing/forward-index">forward index</a> to store the values of each column. This index does not support storing null values and instead requires that a value be provided which will be stored in place of null. In our case, we’ve specified <code>-1</code>. The value that we specify as a default must be of the same data type as the the column. Since the two fields are of type <code>LONG</code>, specifying <code>-1</code> suffices.  </p><p>We’ll PUT this schema using the following curl command.  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">curl -XPUT -F schemaName=@tables/002-orders/orders_schema.json localhost:9000/schemas/orders | jq .</span><br></pre></td></tr></table></figure><p>Next, we’ll update the table definition by adding a couple of field-level transformations.  </p><figure class="highlight json"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line"><span class="punctuation">{</span></span><br><span class="line">    <span class="attr">"columnName"</span><span class="punctuation">:</span> <span class="string">"updated_at"</span><span class="punctuation">,</span></span><br><span class="line">    <span class="attr">"transformFunction"</span><span class="punctuation">:</span> <span class="string">"jsonPath(source, '$.updated_at')"</span></span><br><span class="line"><span class="punctuation">}</span><span class="punctuation">,</span></span><br><span class="line"><span class="punctuation">{</span></span><br><span class="line">    <span class="attr">"columnName"</span><span class="punctuation">:</span> <span class="string">"deleted_at"</span><span class="punctuation">,</span></span><br><span class="line">    <span class="attr">"transformFunction"</span><span class="punctuation">:</span> <span class="string">"jsonPath(source, '$.deleted_at')"</span></span><br><span class="line"><span class="punctuation">}</span></span><br></pre></td></tr></table></figure>  <p>And POST it using the following curl command.   </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">curl -XPUT -H 'Content-Type: application/json' -d @tables/002-orders/orders_table.json localhost:9000/tables/orders | jq .</span><br></pre></td></tr></table></figure>  <p>Like we did last time, we’ll reload all the segments using the following curl command.  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">curl -XPOST localhost:9000/segments/orders/reload | jq .</span><br></pre></td></tr></table></figure>  <p>Now when we open the query console, we’ll see the table with <code>updated_at</code> and <code>deleted_at</code> fields with their values set to <code>-1</code>.  </p><img src="/2024/12/27/Creating-a-realtime-data-platform-nullability/fields.png" class="">  <p>We know that we have a total of 5000 rows where the <code>deleted_at</code> field is set to null. This can be verified by running a count query in Pinot. This shows that although the values in the column are set to <code>-1</code>, Pinot identifies them as null and returns the correct result.</p><figure class="highlight sql"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">SELECT</span> <span class="built_in">COUNT</span>(<span class="operator">*</span>)</span><br><span class="line"><span class="keyword">FROM</span> orders</span><br><span class="line"><span class="keyword">WHERE</span> deleted_at <span class="keyword">IS</span> <span class="keyword">NULL</span>;</span><br></pre></td></tr></table></figure><img src="/2024/12/27/Creating-a-realtime-data-platform-nullability/count.png" class="">   <p><a href="https://docs.pinot.apache.org/developers/advanced/null-value-support#appendix-workarounds-to-handle-null-values-without-storing-nulls">A workaround suggested in the documentation</a> is to use comparison operators to compare against the value used in place of null. For example, the following query will produce the same result as the one shown above.   </p><figure class="highlight sql"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">SELECT</span> <span class="built_in">COUNT</span>(<span class="operator">*</span>)</span><br><span class="line"><span class="keyword">FROM</span> orders</span><br><span class="line"><span class="keyword">WHERE</span> deleted_at <span class="operator">=</span> <span class="number">-1</span>;</span><br></pre></td></tr></table></figure>  <img src="/2024/12/27/Creating-a-realtime-data-platform-nullability/alt.png" class="">  <p>That’s it on how to handle null values in Pinot.</p><link rel="stylesheet" href="/css/spoiler.css" type="text/css"><script src="/js/spoiler.js" type="text/javascript" async></script>]]></content>
    
    <summary type="html">
    
      
      
        &lt;p&gt;In the &lt;a href=&quot;/2024/12/26/Creating-a-realtime-data-platform-evolving-the-schema/&quot;&gt;previous post&lt;/a&gt; we looked at evolving the schema. W
      
    
    </summary>
    
    
      <category term="architecture" scheme="http://fasihkhatib.com/tags/architecture/"/>
    
  </entry>
  
  <entry>
    <title>Creating a realtime data platform - evolving the schema</title>
    <link href="http://fasihkhatib.com/2024/12/26/Creating-a-realtime-data-platform-evolving-the-schema/"/>
    <id>http://fasihkhatib.com/2024/12/26/Creating-a-realtime-data-platform-evolving-the-schema/</id>
    <published>2024-12-26T05:15:10.000Z</published>
    <updated>2024-12-26T05:15:10.806Z</updated>
    
    <content type="html"><![CDATA[<p>In the <a href="/2024/12/24/Creating-a-realtime-data-platform-bringing-data-in/">previous post</a> we saw how to ingest the data into Pinot using Debezium. In this post we’re going to see how to evolve the schema of the tables stored in Pinot. We’ll begin with a simple query which computes an aggregate on the user agent column stored within the <code>source</code> payload. Then, we’ll extract the value of user agent column out of the <code>source</code> payload into a column of its own.  </p><h2 id="Getting-started"><a href="#Getting-started" class="headerlink" title="Getting started"></a>Getting started</h2><p>Let’s start with a simple query. Let’s count the number of times each user agent appears in the table and sort it in descending order. We can do this using the following SQL query.</p><figure class="highlight sql"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">SELECT</span> JSON_EXTRACT_SCALAR(source, <span class="string">'$.user_agent'</span>, <span class="string">'STRING'</span>, <span class="string">'...'</span>) <span class="keyword">AS</span> user_agent, </span><br><span class="line">       <span class="built_in">COUNT</span>(<span class="operator">*</span>) <span class="keyword">AS</span> count </span><br><span class="line"><span class="keyword">FROM</span> orders </span><br><span class="line"><span class="keyword">GROUP</span> <span class="keyword">BY</span> <span class="number">1</span> </span><br><span class="line"><span class="keyword">ORDER</span> <span class="keyword">BY</span> <span class="number">2</span> <span class="keyword">DESC</span>;</span><br></pre></td></tr></table></figure>  <p>The result of running this query is shown below.  </p><img src="/2024/12/26/Creating-a-realtime-data-platform-evolving-the-schema/query_1.png" class="">  <p>Let’s take a closer look at the query. To extract values out of the <code>source</code> column, we use the <code>JSON_EXTRACT_SCALAR()</code> function. It takes the name of the column containing the JSON payload, the path of the value within the payload, the datatype of the value returned, and the value to be used as a replacement for null.  </p><p>For a simple query like this, using <code>JSON_EXTRACT_SCALAR()</code> works. However, it becomes unwieldy when there are more than one column to extract or when writing ad-hoc business analytics queries that join multiple tables on values present within a JSON column. Writing SQL would be easier if we could extract the value out of the JSON payload into a column of its own.  </p><p>To extract the values out of the <code>source</code> payload into its own column, we’ll have to update the schema and table definitions. We’ll update the schema definition to add new columns, and we’ll update the table definition to extract fields out of the <code>source</code> column using field-level transformations.  </p><p>Let’s begin by updating the schema.  </p><figure class="highlight json"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br></pre></td><td class="code"><pre><span class="line"><span class="punctuation">{</span></span><br><span class="line">  <span class="attr">"schemaName"</span><span class="punctuation">:</span> <span class="string">"orders"</span><span class="punctuation">,</span></span><br><span class="line">  <span class="attr">"enableColumnBasedNullHandling"</span><span class="punctuation">:</span> <span class="literal"><span class="keyword">true</span></span><span class="punctuation">,</span></span><br><span class="line">  <span class="attr">"dimensionFieldSpecs"</span><span class="punctuation">:</span> <span class="punctuation">[</span></span><br><span class="line">    <span class="punctuation">{</span></span><br><span class="line">      <span class="attr">"name"</span><span class="punctuation">:</span> <span class="string">"id"</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">"dataType"</span><span class="punctuation">:</span> <span class="string">"STRING"</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">"notNull"</span><span class="punctuation">:</span> <span class="literal"><span class="keyword">true</span></span></span><br><span class="line">    <span class="punctuation">}</span><span class="punctuation">,</span></span><br><span class="line">    <span class="punctuation">{</span></span><br><span class="line">      <span class="attr">"name"</span><span class="punctuation">:</span> <span class="string">"source"</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">"dataType"</span><span class="punctuation">:</span> <span class="string">"JSON"</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">"notNull"</span><span class="punctuation">:</span> <span class="literal"><span class="keyword">true</span></span></span><br><span class="line">    <span class="punctuation">}</span><span class="punctuation">,</span></span><br><span class="line">    <span class="punctuation">{</span></span><br><span class="line">      <span class="attr">"name"</span><span class="punctuation">:</span> <span class="string">"user_agent"</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">"dataType"</span><span class="punctuation">:</span> <span class="string">"STRING"</span></span><br><span class="line">    <span class="punctuation">}</span></span><br><span class="line">  <span class="punctuation">]</span><span class="punctuation">,</span></span><br><span class="line">  <span class="attr">"dateTimeFieldSpecs"</span><span class="punctuation">:</span> <span class="punctuation">[</span></span><br><span class="line">    <span class="punctuation">{</span></span><br><span class="line">      <span class="attr">"name"</span><span class="punctuation">:</span> <span class="string">"created_at"</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">"dataType"</span><span class="punctuation">:</span> <span class="string">"LONG"</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">"format"</span><span class="punctuation">:</span> <span class="string">"1:MILLISECONDS:EPOCH"</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">"granularity"</span><span class="punctuation">:</span> <span class="string">"1:MILLISECONDS"</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">"notNull"</span><span class="punctuation">:</span> <span class="literal"><span class="keyword">true</span></span></span><br><span class="line">    <span class="punctuation">}</span></span><br><span class="line">  <span class="punctuation">]</span><span class="punctuation">,</span></span><br><span class="line">  <span class="attr">"primaryKeyColumns"</span><span class="punctuation">:</span> <span class="punctuation">[</span></span><br><span class="line">    <span class="string">"id"</span></span><br><span class="line">  <span class="punctuation">]</span><span class="punctuation">,</span></span><br><span class="line">  <span class="attr">"metricFieldSpecs"</span><span class="punctuation">:</span> <span class="punctuation">[</span><span class="punctuation">]</span></span><br><span class="line"><span class="punctuation">}</span></span><br></pre></td></tr></table></figure>  <p>We’ve added <code>user_agent</code> as a new column under <code>dimensionFieldSpecs</code>. Notice that we’ve set <code>enableColumnBasedNullHandling</code> to true. This allows columns to store null values in them. In Pinot, allowing or disallowing null values is configured per-table. The recommended way is to use column-based null handling where each column is configured to allow or disallow null values. This is what we’ve used in our schema above. The <code>id</code>, <code>source</code>, and <code>created_at</code> columns do not allow null values in them since they have <code>notNull</code> set to true. The <code>user_agent</code> column allows null values in it since it is implicitly nullable.  </p><p>We’ll PUT the updated schema using curl.  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">curl -XPUT -F schemaName=@tables/002-orders/orders_schema.json localhost:9000/schemas/orders | jq .</span><br></pre></td></tr></table></figure>  <p>Upon opening the query console we find that there’s an error message. This message indicates that the segments are invalid because they were created using an older version of the schema. We can reload all the segments to fix this error but we’ll get to that in a minute. We’ll first update the table definition.  </p><img src="/2024/12/26/Creating-a-realtime-data-platform-evolving-the-schema/invalid.png" class=""><p>To update the table definition, we’ll add the following field-level transformation to the <code>transformConfigs</code>.  </p><figure class="highlight json"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="punctuation">{</span></span><br><span class="line">  <span class="attr">"columnName"</span><span class="punctuation">:</span> <span class="string">"user_agent"</span><span class="punctuation">,</span></span><br><span class="line">  <span class="attr">"transformFunction"</span><span class="punctuation">:</span> <span class="string">"jsonPath(source, '$.user_agent')"</span></span><br><span class="line"><span class="punctuation">}</span></span><br></pre></td></tr></table></figure>  <p>In this transformation we’re extracting the <code>user_agent</code> field into a column with the same name. Notice how we’re referencing the <code>source</code> column instead of the <code>payload</code> emitted by Debezium to get the value. Once we’ve made this change we’ll PUT the new table definition using curl.  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">curl -XPUT -H 'Content-Type: application/json' -d @tables/002-orders/orders_table.json localhost:9000/tables/orders | jq .</span><br></pre></td></tr></table></figure>  <p>Finally, we’ll reload all the segments for this table using the following curl command.  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">curl -XPOST localhost:9000/segments/orders/reload | jq .</span><br></pre></td></tr></table></figure>  <p>Upon opening the query console, we find that the a new <code>user_agent</code> column has been added to the table.  </p><img src="/2024/12/26/Creating-a-realtime-data-platform-evolving-the-schema/segment.png" class="">  <p>It’s common for the table to change with time as new columns are added. Consequently, the schema and table definitions will evolve in Pinot. As we update the schema in Pinot, we have to keep in mind that columns can only be added and not removed. In other words, the schema needs to remain backwards compatible. If you’d like to drop a column or rename it, you’ll have to recreate the table.  </p><p>That’s it for how to evolve schema in Pinot.</p><link rel="stylesheet" href="/css/spoiler.css" type="text/css"><script src="/js/spoiler.js" type="text/javascript" async></script>]]></content>
    
    <summary type="html">
    
      
      
        &lt;p&gt;In the &lt;a href=&quot;/2024/12/24/Creating-a-realtime-data-platform-bringing-data-in/&quot;&gt;previous post&lt;/a&gt; we saw how to ingest the data into Pin
      
    
    </summary>
    
    
      <category term="architecture" scheme="http://fasihkhatib.com/tags/architecture/"/>
    
  </entry>
  
  <entry>
    <title>Creating a realtime data platform - bringing data in</title>
    <link href="http://fasihkhatib.com/2024/12/24/Creating-a-realtime-data-platform-bringing-data-in/"/>
    <id>http://fasihkhatib.com/2024/12/24/Creating-a-realtime-data-platform-bringing-data-in/</id>
    <published>2024-12-24T13:45:02.000Z</published>
    <updated>2024-12-25T12:56:47.360Z</updated>
    
    <content type="html"><![CDATA[<p>In the first part we saw the overall design of the system. In the second part we created a dataset that we can work with. In this post we’ll look at the first category of components and these are the ones that bring the data into the platform. We’ll see how we can stream data from the database using Debezium and store it in Pinot realtime tables.  </p><h2 id="Before-we-begin"><a href="#Before-we-begin" class="headerlink" title="Before we begin"></a>Before we begin</h2><p>The setup is still Dockerized and now has containers for Debezium, Kafka, and Pinot. In a nutshell, we’ll stream data from the Postgres instance into Kafka using Debezium and then write it to Pinot tables.  </p><h2 id="Getting-started"><a href="#Getting-started" class="headerlink" title="Getting started"></a>Getting started</h2><p>In the first part of the series we briefly looked at Debezium. To recap, Debezium is a platform for change data capture. It consists of connectors which capture change data from the database and emit them as events into Kafka. Which database tables to monitor and which Kafka topic to write them to are specified as a part of the connector’s configuration. This configuration is written as a JSON object and sent to a specfic endpoint to spawn a new connector.  </p><p>We’ll begin by creating configuration for a connector which will monitor all the tables in the database and route each of them to a dedicated Kafka topic.  </p><figure class="highlight json"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br></pre></td><td class="code"><pre><span class="line"><span class="punctuation">{</span></span><br><span class="line">    <span class="attr">"name"</span><span class="punctuation">:</span> <span class="string">"order_service"</span><span class="punctuation">,</span></span><br><span class="line">    <span class="attr">"config"</span><span class="punctuation">:</span> <span class="punctuation">{</span></span><br><span class="line">        <span class="attr">"connector.class"</span><span class="punctuation">:</span> <span class="string">"io.debezium.connector.postgresql.PostgresConnector"</span><span class="punctuation">,</span></span><br><span class="line">        <span class="attr">"database.hostname"</span><span class="punctuation">:</span> <span class="string">"db"</span><span class="punctuation">,</span></span><br><span class="line">        <span class="attr">"database.user"</span><span class="punctuation">:</span> <span class="string">"postgres"</span><span class="punctuation">,</span></span><br><span class="line">        <span class="attr">"database.password"</span><span class="punctuation">:</span> <span class="string">"my-secret-pw"</span><span class="punctuation">,</span></span><br><span class="line">        <span class="attr">"database.dbname"</span><span class="punctuation">:</span> <span class="string">"postgres"</span><span class="punctuation">,</span></span><br><span class="line">        <span class="attr">"database.server.name"</span><span class="punctuation">:</span> <span class="string">"postgres"</span><span class="punctuation">,</span></span><br><span class="line">        <span class="attr">"plugin.name"</span><span class="punctuation">:</span> <span class="string">"pgoutput"</span><span class="punctuation">,</span></span><br><span class="line">        <span class="attr">"publication.autocreate.mode"</span><span class="punctuation">:</span> <span class="string">"filtered"</span><span class="punctuation">,</span></span><br><span class="line">        <span class="attr">"time.precision.mode"</span><span class="punctuation">:</span> <span class="string">"connect"</span><span class="punctuation">,</span></span><br><span class="line">        <span class="attr">"tombstones.on.delete"</span><span class="punctuation">:</span> <span class="string">"false"</span><span class="punctuation">,</span></span><br><span class="line">        <span class="attr">"snapshot.mode"</span><span class="punctuation">:</span> <span class="string">"no_data"</span><span class="punctuation">,</span></span><br><span class="line">        <span class="attr">"heartbeat.interval.ms"</span><span class="punctuation">:</span> <span class="string">"1000"</span><span class="punctuation">,</span></span><br><span class="line">        <span class="attr">"transforms"</span><span class="punctuation">:</span> <span class="string">"route"</span><span class="punctuation">,</span></span><br><span class="line">        <span class="attr">"transforms.route.type"</span><span class="punctuation">:</span> <span class="string">"org.apache.kafka.connect.transforms.RegexRouter"</span><span class="punctuation">,</span></span><br><span class="line">        <span class="attr">"transforms.route.regex"</span><span class="punctuation">:</span> <span class="string">"([^.]+)\\.([^.]+)\\.([^.]+)"</span><span class="punctuation">,</span></span><br><span class="line">        <span class="attr">"transforms.route.replacement"</span><span class="punctuation">:</span> <span class="string">"$3"</span><span class="punctuation">,</span></span><br><span class="line">        <span class="attr">"event.processing.failure.handling.mode"</span><span class="punctuation">:</span> <span class="string">"skip"</span><span class="punctuation">,</span></span><br><span class="line">        <span class="attr">"producer.override.compression.type"</span><span class="punctuation">:</span> <span class="string">"snappy"</span><span class="punctuation">,</span></span><br><span class="line">        <span class="attr">"signal.data.collection"</span><span class="punctuation">:</span> <span class="string">"debezium.signal"</span><span class="punctuation">,</span></span><br><span class="line">        <span class="attr">"topic.prefix"</span><span class="punctuation">:</span> <span class="string">"microservice"</span><span class="punctuation">,</span></span><br><span class="line">        <span class="attr">"decimal.handling.mode"</span><span class="punctuation">:</span> <span class="string">"float"</span></span><br><span class="line">    <span class="punctuation">}</span></span><br><span class="line"><span class="punctuation">}</span></span><br></pre></td></tr></table></figure>  <p>There are two main parts to this configuration - <code>name</code> and <code>config</code>. The <code>name</code> is the name we’ve given to the connector. The <code>config</code> contains the actual configuration of the connector. We specify quite a few things in the <code>config</code> object. We specify the class of the connector which is the fully qualified name of the Java class, the credentials to connect to the database, whether or not to take a snapshot, how to route the data to the appropriate Kafka topics, and how to pass signals to Debezium.  </p><p>While most of the configuration is self-explanatory, we’ll look closely at the ones related to snapshot, signalling, and routing. We set the snapshot mode to <code>no_data</code> which means that the connector will stream historical rows from the database. The only rows that will be emitted are the ones created or updated after the connector began running. We’ll use this setting in conjunction with signals to incrementally snapshot the tables we’re interested in. Signals are a way to modify the behavior of the connector, or to trigger a one-time action like taking an ad-hoc snapshot. When we combine <code>no_data</code> with signals, we can tell Debezium to selectively snapshot the tables we’re interested in. The <code>signal.data.collection</code> property specifies the name of the table which the connector will monitor for any signals that are sent to it.</p><p>Finally, we specify a route transform. We do this by writing a regex which matches against the fully qualified name of the table, and extracts only the table name. This allows us to send the data from every table into a dedicated Kafka topic of its own.  </p><p>Notice how we’ve not specified which tables to monitor. Since it is a Postgres database, the connector will monitor all the tables in all the schemas within the database and stream them. Now that the configuration is created, we’ll POST it to the appropriate endpoint to create the connector.   </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">curl -H "Content-Type: application/json" -XPOST -d @tables/002-orders/debezium.json localhost:8083/connectors | jq .</span><br></pre></td></tr></table></figure>  <p>Now that the connector is created, we will signal it to initiate a snapshot. Signals are sent to the connector using rows inserted into the table. We’ll execute the following <code>INSERT</code> query to tell the connector to take a snapshot of the <code>orders</code> table.  </p><figure class="highlight sql"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">INSERT</span> <span class="keyword">INTO</span> debezium.signal </span><br><span class="line"><span class="keyword">VALUES</span> (</span><br><span class="line">    gen_random_uuid()::TEXT,</span><br><span class="line">    <span class="string">'execute-snapshot'</span>,</span><br><span class="line">    <span class="string">'{"data-collections": [".*\\.orders"], "type": "incremental"}'</span></span><br><span class="line">);</span><br></pre></td></tr></table></figure>  <p>The row tells the connector to initiate a snapshot, as indicated by <code>execute-snapshot</code>, and stream historical rows from the <code>orders</code> table in all the schemas within the database. It is an incremental snapshot so it will happen in batches. If we <code>docker exec</code> into the Kafka container and use the console consumer, we’ll find that all the rows eventually get streamed to the topic. The command to show it is given below.  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">[kafka@kafka ~]$ kafka-console-consumer.sh --bootstrap-server kafka:9092 --topic orders --from-beginning | wc -l</span><br><span class="line">^CProcessed a total of 5000 messages</span><br></pre></td></tr></table></figure>  <p>We can compare this with the row count in the table using the following SQL command.  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">SELECT COUNT(*) FROM public.orders;</span><br><span class="line">| count |</span><br><span class="line">|-------|</span><br><span class="line">|  5000 |</span><br></pre></td></tr></table></figure>  <p>Now that the data is in Kafka, we’ll move on to how to stream it into a Pinot table. Before we get to that, we’ll look at what a table and schema are in Pinot.  </p><p>A table in Pinot is similar to a table in a relational database. It has rows and columns where each column has a datatype. Tables are where data is stored in Pinot. Every table in Pinot has an associated schema and it is in the schema where the columns and their datatypes are defined. Tables can be realtime, where they store data from a streaming source such as Kafka. They can be offline, where they load data from batch sources. Or they can be hybrid, where they load data from both a batch source and a streaming source. Both the schema and table are defined as JSON.  </p><p>Let’s start by creating the schema.  </p><figure class="highlight json"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br></pre></td><td class="code"><pre><span class="line"><span class="punctuation">{</span></span><br><span class="line">  <span class="attr">"schemaName"</span><span class="punctuation">:</span> <span class="string">"orders"</span><span class="punctuation">,</span></span><br><span class="line">  <span class="attr">"enableColumnBasedNullHandling"</span><span class="punctuation">:</span> <span class="literal"><span class="keyword">true</span></span><span class="punctuation">,</span></span><br><span class="line">  <span class="attr">"dimensionFieldSpecs"</span><span class="punctuation">:</span> <span class="punctuation">[</span></span><br><span class="line">    <span class="punctuation">{</span></span><br><span class="line">      <span class="attr">"name"</span><span class="punctuation">:</span> <span class="string">"id"</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">"dataType"</span><span class="punctuation">:</span> <span class="string">"STRING"</span></span><br><span class="line">    <span class="punctuation">}</span><span class="punctuation">,</span></span><br><span class="line">    <span class="punctuation">{</span></span><br><span class="line">      <span class="attr">"name"</span><span class="punctuation">:</span> <span class="string">"source"</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">"dataType"</span><span class="punctuation">:</span> <span class="string">"JSON"</span></span><br><span class="line">    <span class="punctuation">}</span></span><br><span class="line">  <span class="punctuation">]</span><span class="punctuation">,</span></span><br><span class="line">  <span class="attr">"dateTimeFieldSpecs"</span><span class="punctuation">:</span> <span class="punctuation">[</span></span><br><span class="line">    <span class="punctuation">{</span></span><br><span class="line">      <span class="attr">"name"</span><span class="punctuation">:</span> <span class="string">"created_at"</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">"dataType"</span><span class="punctuation">:</span> <span class="string">"LONG"</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">"format"</span><span class="punctuation">:</span> <span class="string">"1:MILLISECONDS:EPOCH"</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">"granularity"</span><span class="punctuation">:</span> <span class="string">"1:MILLISECONDS"</span></span><br><span class="line">    <span class="punctuation">}</span></span><br><span class="line">  <span class="punctuation">]</span><span class="punctuation">,</span></span><br><span class="line">  <span class="attr">"primaryKeyColumns"</span><span class="punctuation">:</span> <span class="punctuation">[</span></span><br><span class="line">    <span class="string">"id"</span></span><br><span class="line">  <span class="punctuation">]</span><span class="punctuation">,</span></span><br><span class="line">  <span class="attr">"metricFieldSpecs"</span><span class="punctuation">:</span> <span class="punctuation">[</span><span class="punctuation">]</span></span><br><span class="line"><span class="punctuation">}</span></span><br></pre></td></tr></table></figure>  <p>The schema defines a few things. It defines the name of the schema. This will also become the name of the table. Next, it defines the fields that will be present in the table. We’ve defined <code>id</code>, <code>source</code>, and <code>created_at</code>. The first two are specified in <code>dimensionFieldSpecs</code> and specify a column which becomes a dimension for any metric. The <code>created_at</code> field is specified in <code>dateTimeFieldSpecs</code> since it specifies a time column; Debezium will send timestamp columns as milliseconds since epoch. We’ve specified <code>id</code> as the primary key. Finally, <code>enableColumnBasedNullHandling</code> allows columns to have null values in them.</p><p>Once the schema is defined, we can create the table configuration.  </p><img src="/2024/12/24/Creating-a-realtime-data-platform-bringing-data-in/table.png" class="">  <p>The configuration of tbe table is more involved than the schema so we’ll go over it one key at a time. We begin by specifying the <code>tableName</code> as “orders”. This matches the name of the schema. We specify <code>tableType</code> as “REALTIME” since the data we’re going to ingest comes from a Kafka topic. The <code>query</code> key specifies properties related to query execution. The <code>segmentsConfig</code> key specifies properties related to segments like the time column to use for creating a segment. The <code>tenants</code> key specifies the tenants for this table. A tenant is a logical namespace which restricts where the cluster processes queries on the table. The <code>tableIndexConfig</code> defines the indexing related information for the table. The <code>metadata</code> key specifies the metadata for this table. The <code>upsertCconfig</code> key specifies configuration for upserting into the table. The <code>ingestionConfig</code> key defines where we’d be ingesting data from and what field-level transformations we’d like to apply. The <code>routing</code> key defines properties that determine how the broker selects the servers to route.  </p><p>The part of the configuration we’ll specifically look at is the <code>ingestionConfig</code> and <code>upsertConfig</code>. First, <code>ingestionConfig</code>.</p><figure class="highlight json"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br></pre></td><td class="code"><pre><span class="line"><span class="punctuation">{</span></span><br><span class="line">  <span class="attr">"ingestionConfig"</span><span class="punctuation">:</span> <span class="punctuation">{</span></span><br><span class="line">    <span class="attr">"streamIngestionConfig"</span><span class="punctuation">:</span> <span class="punctuation">{</span></span><br><span class="line">      <span class="attr">"streamConfigMaps"</span><span class="punctuation">:</span> <span class="punctuation">[</span></span><br><span class="line">        <span class="punctuation">{</span></span><br><span class="line">          <span class="attr">"realtime.segment.flush.threshold.rows"</span><span class="punctuation">:</span> <span class="string">"0"</span><span class="punctuation">,</span></span><br><span class="line">          <span class="attr">"stream.kafka.decoder.prop.format"</span><span class="punctuation">:</span> <span class="string">"JSON"</span><span class="punctuation">,</span></span><br><span class="line">          <span class="attr">"key.serializer"</span><span class="punctuation">:</span> <span class="string">"org.apache.kafka.common.serialization.ByteArraySerializer"</span><span class="punctuation">,</span></span><br><span class="line">          <span class="attr">"stream.kafka.decoder.class.name"</span><span class="punctuation">:</span> <span class="string">"org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder"</span><span class="punctuation">,</span></span><br><span class="line">          <span class="attr">"streamType"</span><span class="punctuation">:</span> <span class="string">"kafka"</span><span class="punctuation">,</span></span><br><span class="line">          <span class="attr">"value.serializer"</span><span class="punctuation">:</span> <span class="string">"org.apache.kafka.common.serialization.ByteArraySerializer"</span><span class="punctuation">,</span></span><br><span class="line">          <span class="attr">"stream.kafka.consumer.type"</span><span class="punctuation">:</span> <span class="string">"LOWLEVEL"</span><span class="punctuation">,</span></span><br><span class="line">          <span class="attr">"realtime.segment.flush.threshold.segment.rows"</span><span class="punctuation">:</span> <span class="string">"50000"</span><span class="punctuation">,</span></span><br><span class="line">          <span class="attr">"stream.kafka.broker.list"</span><span class="punctuation">:</span> <span class="string">"kafka:9092"</span><span class="punctuation">,</span></span><br><span class="line">          <span class="attr">"realtime.segment.flush.threshold.time"</span><span class="punctuation">:</span> <span class="string">"3600000"</span><span class="punctuation">,</span></span><br><span class="line">          <span class="attr">"stream.kafka.consumer.factory.class.name"</span><span class="punctuation">:</span> <span class="string">"org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory"</span><span class="punctuation">,</span></span><br><span class="line">          <span class="attr">"stream.kafka.consumer.prop.auto.offset.reset"</span><span class="punctuation">:</span> <span class="string">"smallest"</span><span class="punctuation">,</span></span><br><span class="line">          <span class="attr">"stream.kafka.topic.name"</span><span class="punctuation">:</span> <span class="string">"orders"</span></span><br><span class="line">        <span class="punctuation">}</span></span><br><span class="line">      <span class="punctuation">]</span></span><br><span class="line">    <span class="punctuation">}</span><span class="punctuation">,</span></span><br><span class="line">    <span class="attr">"transformConfigs"</span><span class="punctuation">:</span> <span class="punctuation">[</span></span><br><span class="line">      <span class="punctuation">{</span></span><br><span class="line">        <span class="attr">"columnName"</span><span class="punctuation">:</span> <span class="string">"id"</span><span class="punctuation">,</span></span><br><span class="line">        <span class="attr">"transformFunction"</span><span class="punctuation">:</span> <span class="string">"jsonPath(payload, '$.after.id')"</span></span><br><span class="line">      <span class="punctuation">}</span><span class="punctuation">,</span></span><br><span class="line">      <span class="punctuation">{</span></span><br><span class="line">        <span class="attr">"columnName"</span><span class="punctuation">:</span> <span class="string">"source"</span><span class="punctuation">,</span></span><br><span class="line">        <span class="attr">"transformFunction"</span><span class="punctuation">:</span> <span class="string">"jsonPath(payload, '$.after')"</span></span><br><span class="line">      <span class="punctuation">}</span><span class="punctuation">,</span></span><br><span class="line">      <span class="punctuation">{</span></span><br><span class="line">        <span class="attr">"columnName"</span><span class="punctuation">:</span> <span class="string">"created_at"</span><span class="punctuation">,</span></span><br><span class="line">        <span class="attr">"transformFunction"</span><span class="punctuation">:</span> <span class="string">"jsonPath(payload, '$.after.created_at')"</span></span><br><span class="line">      <span class="punctuation">}</span></span><br><span class="line">    <span class="punctuation">]</span></span><br><span class="line">  <span class="punctuation">}</span></span><br><span class="line"><span class="punctuation">}</span></span><br></pre></td></tr></table></figure>  <p>In the <code>ingestionConfig</code> we specify the the Kafka topics to read from. In the snippet above, we’ve specified the “orders” topic. We also specify field-level transformations in <code>transformConfigs</code>. Here we extract the <code>id</code>, <code>source</code>, and <code>created_at</code> fields from the JSON payload generated by Debezium.  </p><p>With the schema and table defined, we’ll POST them to the appropriate endpoints using curl. The following two commands create the schema followed by the table.  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">curl -F schemaName=@tables/002-orders/orders_schema.json localhost:9000/schemas | jq .</span><br><span class="line">curl -XPOST -H 'Content-Type: application/json' -d @tables/002-orders/orders_table.json localhost:9000/tables | jq .</span><br></pre></td></tr></table></figure>  <p>Once the table is created, it will begin ingesting data from the “orders” Kafka topic. We can view this data by opening the Pinot query console. Notice how the <code>source</code> column contains the entire “after” payload generated by Debezium.</p><img src="/2024/12/24/Creating-a-realtime-data-platform-bringing-data-in/pinot.png" class="">  <p>That’s it. That’s how to stream data using Debezium into Pinot.</p><link rel="stylesheet" href="/css/spoiler.css" type="text/css"><script src="/js/spoiler.js" type="text/javascript" async></script>]]></content>
    
    <summary type="html">
    
      
      
        &lt;p&gt;In the first part we saw the overall design of the system. In the second part we created a dataset that we can work with. In this post we
      
    
    </summary>
    
    
      <category term="architecture" scheme="http://fasihkhatib.com/tags/architecture/"/>
    
  </entry>
  
  <entry>
    <title>Creating a realtime data platform - creating the data</title>
    <link href="http://fasihkhatib.com/2024/12/22/Creating-a-realtime-data-platform-creating-the-data/"/>
    <id>http://fasihkhatib.com/2024/12/22/Creating-a-realtime-data-platform-creating-the-data/</id>
    <published>2024-12-22T07:03:43.000Z</published>
    <updated>2024-12-22T07:03:43.977Z</updated>
    
    <content type="html"><![CDATA[<p>In the <a href="/2024/12/18/Creating-a-data-wrehouse-with-Apache-Pinot-and-Debezium/">previous post</a> we saw the overall design of the platform. We saw how the components of the system are divided into three separate categories: those that bring the data in, those that create datasets on this data, and those that display visualizations. Starting from this post, we’re going to start building the system. We’ll work with the data of a fictitious online cafe that we’ll populate using a Python script. In subsequent posts, we’ll ingest this data into the platform and create visualizations on top of it.</p><h2 id="Before-we-begin"><a href="#Before-we-begin" class="headerlink" title="Before we begin"></a>Before we begin</h2><p>The setup, for this post, consists of a Docker container for Postgres which is a part of the compose file. We’ll bring up the container before we begin populating the database.</p><h2 id="The-data"><a href="#The-data" class="headerlink" title="The data"></a>The data</h2><p>We’ll create and populate a table which stores the orders placed by the customer. The table contains, among other fields, the id of the user who placed the order, the id of the address where the order needs to be delivered, the status of the order, and the user agent of the device used to place the order. The code snippet below shows how the model is represented as a Python class.</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">class</span> <span class="title class_">Order</span>(peewee.Model):</span><br><span class="line">    <span class="string">"""An order placed by the customer."""</span></span><br><span class="line"></span><br><span class="line">    <span class="keyword">class</span> <span class="title class_">Meta</span>:</span><br><span class="line">        table_name = <span class="string">"orders"</span></span><br><span class="line">        database = database</span><br><span class="line"></span><br><span class="line">    <span class="built_in">id</span> = peewee.BigAutoField()</span><br><span class="line">    user_id = peewee.IntegerField()</span><br><span class="line">    address_id = peewee.IntegerField()</span><br><span class="line">    cafe_id = peewee.IntegerField()</span><br><span class="line">    partner_id = peewee.IntegerField(null=<span class="literal">True</span>)</span><br><span class="line">    created_at = peewee.DateTimeField(default=datetime.datetime.now)</span><br><span class="line">    updated_at = peewee.DateTimeField(null=<span class="literal">True</span>)</span><br><span class="line">    deleted_at = peewee.DateTimeField(null=<span class="literal">True</span>)</span><br><span class="line">    status = peewee.IntegerField(default=<span class="number">0</span>)</span><br><span class="line">    user_agent = peewee.TextField()</span><br></pre></td></tr></table></figure>  <p>Once we’ve created this class, we’ll write a function which creates instances of this class and persists them in the database. There are classes representing the cafe, the addresses saved by the user, and the delivery partner who will be assigned to deliver the order. However, these have been left out for the sake of brevity. The code snippet below shows this function.  </p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">def</span> <span class="title function_">create_orders</span>(<span class="params"></span></span><br><span class="line"><span class="params">    users: <span class="built_in">list</span>[User],</span></span><br><span class="line"><span class="params">    addresses: <span class="built_in">list</span>[Address],</span></span><br><span class="line"><span class="params">    cafes: <span class="built_in">list</span>[Cafe],</span></span><br><span class="line"><span class="params">    partners: <span class="built_in">list</span>[Partner],</span></span><br><span class="line"><span class="params">    n: <span class="built_in">int</span> = <span class="number">100</span>,</span></span><br><span class="line"><span class="params"></span>) -&gt; <span class="built_in">list</span>[Order]:</span><br><span class="line">    ua = UserAgent()</span><br><span class="line">    orders = []</span><br><span class="line"></span><br><span class="line">    <span class="keyword">def</span> <span class="title function_">base_order</span>() -&gt; <span class="built_in">dict</span>:</span><br><span class="line">        cafe = cafes[random.randint(<span class="number">0</span>, <span class="built_in">len</span>(cafes) - <span class="number">1</span>)]</span><br><span class="line">        user = users[random.randint(<span class="number">0</span>, <span class="built_in">len</span>(users) - <span class="number">1</span>)]</span><br><span class="line">        addr = [_ <span class="keyword">for</span> _ <span class="keyword">in</span> addresses <span class="keyword">if</span> _.user_id == user.<span class="built_in">id</span>][<span class="number">0</span>]</span><br><span class="line">        user_agent = ua.random</span><br><span class="line"></span><br><span class="line">        <span class="keyword">return</span> {</span><br><span class="line">            <span class="string">"user_id"</span>: user.<span class="built_in">id</span>,</span><br><span class="line">            <span class="string">"address_id"</span>: addr.<span class="built_in">id</span>,</span><br><span class="line">            <span class="string">"cafe_id"</span>: cafe.<span class="built_in">id</span>,</span><br><span class="line">            <span class="string">"user_agent"</span>: user_agent,</span><br><span class="line">        }</span><br><span class="line"></span><br><span class="line">    <span class="keyword">for</span> _ <span class="keyword">in</span> <span class="built_in">range</span>(n):</span><br><span class="line">        data = {**base_order(), <span class="string">"status"</span>: OrderStatus.PLACED.value}</span><br><span class="line">        order = Order.create(**data)</span><br><span class="line">        orders.append(order)</span><br><span class="line"></span><br><span class="line">    <span class="keyword">return</span> orders</span><br></pre></td></tr></table></figure>  <p>Once we have all our classes and functions in place, we’ll run the script which populates the data.</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">python faker/data.py</span><br></pre></td></tr></table></figure>  <p>We can now query the database to see our data.  </p><img src="/2024/12/22/Creating-a-realtime-data-platform-creating-the-data/data.png" class="">  <p>This is it for the second part of the series.</p><link rel="stylesheet" href="/css/spoiler.css" type="text/css"><script src="/js/spoiler.js" type="text/javascript" async></script>]]></content>
    
    <summary type="html">
    
      
      
        &lt;p&gt;In the &lt;a href=&quot;/2024/12/18/Creating-a-data-wrehouse-with-Apache-Pinot-and-Debezium/&quot;&gt;previous post&lt;/a&gt; we saw the overall design of the 
      
    
    </summary>
    
    
      <category term="architecture" scheme="http://fasihkhatib.com/tags/architecture/"/>
    
  </entry>
  
  <entry>
    <title>Creating a realtime data platform - the design</title>
    <link href="http://fasihkhatib.com/2024/12/18/Creating-a-data-wrehouse-with-Apache-Pinot-and-Debezium/"/>
    <id>http://fasihkhatib.com/2024/12/18/Creating-a-data-wrehouse-with-Apache-Pinot-and-Debezium/</id>
    <published>2024-12-18T02:13:04.000Z</published>
    <updated>2024-12-22T03:44:08.124Z</updated>
    
    <content type="html"><![CDATA[<p>I’d previously written about <a href="/2024/06/20/Creating-a-realtime-data-platform-with-Pinot-Airflow-Trino-and-Debezium/">creating a data platform using Pinot, Trino, Airflow, and Debezium</a>. It was a quick how-to that showed how to glue the pieces together to create a data platform. In this post we’ll go deeper into the design of the system and look at building the system in the posts that follow.</p><h2 id="The-design"><a href="#The-design" class="headerlink" title="The design"></a>The design</h2><p>A common requirement for data engineering teams is to move data stored within the databases owned by various microservices into a central data warehouse. One of the ways to move this data is by loading it incrementally. In this approach, once the data has been loaded fully, subsequent loads are done in smaller increments. These contain rows that have changed since the last time the warehouse was loaded. This brings the data into the warehouse periodically as the loads are run on a specified schedule.  </p><p>Recently the shift has been towards moving data in realtime so that analytics can be derived quickly. Change data capture allows capturing row-level changes as they happen as a result of inserts, updates, and deletes in the tables. Responding to these events allows us to load the warehouse in realtime.  </p><p>The diagram below shows how we can combine Pinot, Trino, Airflow, Debezium, and Superset to create a realtime data platform.</p><img src="/2024/12/18/Creating-a-data-wrehouse-with-Apache-Pinot-and-Debezium/Diagram2.png" class="">  <p>The components of the system can be divided into three broad categories. The first category is those that bring data into the platform and are shown in dark green. This category consists of the source database system, Debezium, and Pinot. Debezium reads the stream of changes happening in the database and writes them into Pinot. The second category is those that create datasets on top of the data ingested into Pinot and are shown in dark grey. This category consists of Airflow, Trino, and HDFS. Airflow uses Trino to create tables and views in HDFS on top of the data stored in Pinot. Finally, the last category is those that consume the datasets and present them to the end user. This category consists of data visualization tools like Superset.  </p><p>Let’s discuss each of these components in more detail.  </p><p>Debezium is a platform for change data capture. It consists of connectors which monitor the database tables for inserts, updates, and deletes and emit events into Kafka. These events can then be written into the data warehouse to create an up-to-date version of the table in the upstream database. We’ll run Debezium as a Docker container. When run like this, the connectors are available as a part of the image and can be configured using a REST API. To configure the connector we’ll send a JSON object to a specific endpoint. This object contains information such as the credentials of the database, the databases or tables we’d like to monitor, any transformations we’d like to apply to this data, and so on. As we’ll see when we begin building the system, we can monitor all of our tables for changes happening in them.  </p><p>Pinot is an OLAP datastore that is built for real-time analytics. It supports creating tables that consume data in realtime so that insights can be derived quickly. Pinot, when combined with Debezium, allows us to ingest row-level changes as they happen in the source table. Configurations for Pinot tables and schemas are written in JSON and sent to their respective endpoints to create them. We’ll create a realtime table which ingests events emitted by Debezium. Using the upsert functionality provided by Pinot, we’ll keep only the latest state of the row in the table. This makes it easier to to create reports or do ad-hoc analysis.  </p><p>Trino provides query federation by allowing us to query multiple data sources with a unified SQL interface. It is fast and distributed which means we can use it to query large amounts of data. We’ll use it in conjunction with Pinot since the latter does not yet provide full SQL capabilities. Trino allows connecting to a database by creating a catalog. As we can see from the diagram, we’ll need two catalogs - Pinot and HDFS. Since it is currently not possible to create views, materialized views, or tables from select statements in Pinot, we’ll create them in HDFS using Trino. This allows us to speed up the reports and dashboards since all of the required data will be precomputed and available in HDFS as either a materialized view or a table.  </p><p>Airflow is an orchestrator that allows creating complex workflows. These workflows are created as Python scripts that define a directed acyclic graph (DAG) of tasks. Airflow then schedules these tasks for execution at defined intervals. Tasks are defined using operators. For example, to execute a Python function one would use the PythonOperator. Similarly, there are operators to execute SQL queries. We’ll use these operators to query Trino and create the datasets that are needed for reporting and dashboards. Peridocially regenerating these datasets would allow us to provide reports that present the latest data.  </p><p>Superset is a data visualization tool. We’ll connect Superset to Trino so that we can visualize the datasets that we’ve created in HDFS.   </p><p>Having discussed the various components of the design, let’s look at the design goals it achieves. First, the system is designed to be realtime. With change data capture using Debezium, we can respond to every insert, update, and delete happening in the source table as soon as it happens. Second, the system is designed with open-source technologies. This allows us to benefit from the experience of the collaborators and community behind each of these projects. Finally, the system is designed to be as close to self-service as possible. As we’ll see, the design of the system reduces the dependency of the of the downstream business analytics and data scientist teams on the data engineering team significantly.  </p><p>This is it for the first part of the series.</p><link rel="stylesheet" href="/css/spoiler.css" type="text/css"><script src="/js/spoiler.js" type="text/javascript" async></script>]]></content>
    
    <summary type="html">
    
      
      
        &lt;p&gt;I’d previously written about &lt;a href=&quot;/2024/06/20/Creating-a-realtime-data-platform-with-Pinot-Airflow-Trino-and-Debezium/&quot;&gt;creating a da
      
    
    </summary>
    
    
      <category term="architecture" scheme="http://fasihkhatib.com/tags/architecture/"/>
    
  </entry>
  
</feed>
