<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://willayd.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://willayd.com/" rel="alternate" type="text/html" /><updated>2025-01-14T17:01:21+00:00</updated><id>https://willayd.com/feed.xml</id><title type="html">Will Ayd | Personal Blog</title><subtitle>Will Ayd is a data consultant / solutions architect by day, doubling as a programmer by night. In his personal blog he discusses and explains all of the latest trends within data and analytics.</subtitle><author><name>Will Ayd</name></author><entry><title type="html">15 Years of Data - From Closed to Open Source</title><link href="https://willayd.com/15-years-in-data.html" rel="alternate" type="text/html" title="15 Years of Data - From Closed to Open Source" /><published>2025-01-08T00:00:00+00:00</published><updated>2025-01-08T00:00:00+00:00</updated><id>https://willayd.com/15-years-in-data</id><content type="html" xml:base="https://willayd.com/15-years-in-data.html"><![CDATA[<p>It feels almost surreal to take a step back and recognize that I have now spent 15 years of my life working professionally in the field of data.</p>

<p>Over this time, I have experienced a monumental shift in how organizations configure their reporting platforms. What was once a field dominated by add-ons provided by corporate B2B titans like SAP, Oracle, and IBM, has evolved into a field where open source solutions provide far superior options for organizations to utilize.</p>

<p>In this post I’ll share some of my experiences that have coincided with that shift, while providing anecdotes of how open source tools have changed the landscape for the better. I’ll also add in my thoughts on where open source tools are going to take us over the next few years.</p>

<h1 id="industry-experience-in-the-early-2010s">Industry experience in the early 2010s</h1>

<p>My first professional job was as a “Business Intelligence Analyst” at the company Under Armour, which at the time was growing rather rapidly. Under Armour had a huge partnership with SAP to provide their technological solutions, which included not just their flagship ERP solution, but a proliferation of analytical tools as well.</p>

<p>I didn’t realize it at the time, but the SAP analytical tools were downright…awful. To load data, we were forced to use SAP’s own proprietary programming language <a href="https://en.wikipedia.org/wiki/ABAP">ABAP</a>, which was very poorly documented and understood. It is highly likely that we wrote poor ABAP, but given its closed nature and lack of community, there was no way to really tell.</p>

<p>The data extraction jobs that we wrote in this language exhibited awful performance and were horribly unstable, even by 2010 standards. The vast majority of our business critical data loads from the ERP system ran anywhere from 4-8 hours in batch every night. I would estimate that on 60-70% of the nights we had a job failure that required the on call person to wake up at 3 AM and log into the system to restart it.</p>

<p>If we were lucky to have data loaded, the only interface through which we could then access the data would be through a tool called Bex Analyzer. I’ll let you google images of this tool, but needless to say, it was a glorified pivot table. SAP’s solution to visualization was implemented in another product they acquired called BusinessObjects. These tools in this suite were fine at best, but, particularly for interactive visualizations, they lagged far behind a tool like Tableau, which at the time was considered first-in-class for drag and drop visualization.</p>

<p>Another “selling point” to using these vendor-provided tools was that we had an official support contract. Unfortunately, a support contract with a company like SAP is just a game of cat and mouse, with the ultimate goal of discouraging you as a customer from using the contract in the future. There were many instances where we would open a high priority ticket that impacted business operations. In turn, we would get connected with support “experts” who had very little knowledge of the inner workings of their tool.</p>

<p>The fact that the first few layers of support did not have much knowledge of the tools they supported is not an admonition of the people; rather, it is a rebuke of the closed source model where, even within an organization, only a select few are allowed to see the inner workings of a tool. The only thing initial lines of support possessed was a private collection of internal notes for common issues. Think of a site like StackOverflow, but instead of being freely accessible, you pay for someone else to have exclusive access to it and they just tell you what they see.</p>

<p>Of course, the notes that were collected did not cover many of the issues we would face. There were many times where we would be engaging support for multiple hours, only for the support team to say “sorry our workday has finished, we will share our findings with associates in the next time zone that will help you.” Rarely ever were findings shared, so we ended up in the support Twilight Zone until something by chance resolved itself. In extreme cases, this would take days, and really no one learned anything from it - everyone was just relieved that they could close the ticket and move on until the next one.</p>

<p>In all fairness to SAP, I have had similar experiences with Microsoft and their Power Platform support contracts. SAP is not unique in offering poor value in exchange for these contracts; generally, closed-source B2B enterprises can make a large amount of profit off of these contracts while offering customers very poor ROI in return. In the early 2010s, this was common practice.</p>

<p>The cherry on top of poor ROI would manifest itself every few years when it was time to upgrade systems. In the early 2010s many customers had to manage their stack and avoid obsolescence (cloud, PaaS, and SaaS were not yet as common as they are today).</p>

<p>For the proprietary reporting platform provider like SAP, this presented a great opportunity to double dip on profits. By providing customers with subpar tools and not having accountability to fix issues, the solution to many unresolved issues would be “hey, you just need to upgrade.” Upgraded software came with a licensing fee, would be bundled with resold hardware, and would require the hiring of consultants to just get you onto a more recent version of the tool. Needless to say, this process was really expensive. At times, you would have to pay millions if not more to simply avoid obsolescence. Without no alternatives, many customers had no choice but to follow down this path.</p>

<h1 id="open-source-tools-to-the-rescue">Open Source Tools to the Rescue</h1>

<p>My first serious exposure to open source data tools was back in 2017. It was not something that was promoted by the industry I worked in, but rather a curiosity of mine that led to some awesome discoveries.</p>

<p>Speaking briefly to my first use case, Under Armour had no standardization as to how their vendors were packing certain types of products into boxes. That might not sound like a big deal, but it has measurable costs when boxes break, or your warehouse needs to repack goods.</p>

<p>The dataset I was working with was a few million records of vendor shipments, and our SAP-provided tools we had were of no use with data, even at this scale. Curiously, I fired up pandas and used the seaborn library to generate a violin plot of this data. If you don’t know what a violin plot is, here’s one that I created in my recent book, the <a href="https://www.amazon.com/Pandas-Cookbook-Practical-scientific-exploratory/dp/1836205872">Pandas Cookbook, Third Edition</a>:</p>

<p><img src="assets/images/violin_plot.png" alt="A violin plot from the Pandas Cookbook, Third Edition" /></p>

<p>This plot shows the distribution IMDB scores across different decades, and you can rather easily see trends not only in terms of averages but with the distributions of data as well. Imagine replacing the years on the y-axis with different product types and the x-axis values with the number of units packed into a carton, and you have an idea of what I was able to visualize in Python. The fact that I could do this with freely available tools in a matter of seconds, and in a way that was highly auditable and reproducible was downright…amazing!</p>

<p>Shortly after I started working with open source tools in corporate settings, I moved into the world of consulting, first at a small practice and then independently. During that time, I was able to contribute back to the open source ecosystem, and in doing so felt like I stumbled upon a gold mine. The first project I contributed to was <a href="https://pandas.pydata.org/">pandas</a> (of which I am a maintainer today), and that unlocked interactions with users in <a href="https://scikit-learn.org/stable/index.html">scikit-learn</a>, <a href="https://scipy.org/">SciPy</a>, <a href="https://numpy.org/">NumPy</a>, and many other great tools. I also followed a lot of the work that <a href="https://wesmckinney.com/">Wes McKinney</a> did with his move from pandas to <a href="https://arrow.apache.org/">Apache Arrow</a>, a project which I also became a Committer to in 2024.</p>

<p>The work I put into open source at this time didn’t pay the bills, but it offered me a huge network and ecosystem to work with in my consulting engagements. It also taught me best practices in software engineering, which can be translated over into data engineering to drive real scalability.</p>

<p>Instead of using reporting tools provided by the SAPs and Oracles of the world, the stacks I build at clients today typically use some combination of:</p>

<ul>
  <li><a href="https://www.terraform.io/">Terraform</a>, for Infrastructure as Code</li>
  <li><a href="https://airflow.apache.org/">Apache Airflow</a>, for Job Orchestration</li>
  <li><a href="https://www.python.org/">Python</a> as a glue language for ETL</li>
  <li><a href="https://docs.getdbt.com/docs/core/installation-overview">dbt-core</a> for transformations and testing</li>
  <li><a href="https://git-scm.com/">git</a> for version control</li>
  <li><a href="https://github.com/">Github Actions</a> for CI/CD</li>
</ul>

<p>…and more. At a minimum these tools are entirely free to use, but if you wanted to go more of a hosted solution route you can find many third party PaaS and SaaS providers for them. The costs of these services are very affordable compared to the reporting platforms of the early 2010s.</p>

<p>Perhaps the most important thing to building a stack like this is that you limit vendor lock-in. Gone are the days where a B2B provider can provide a subpar solution and charge you more money to upgrade it - you ultimately have control over how your data is managed, maintained, and evolved.</p>

<p>Granted, you need some degree of technical inclination to maintain, but keep in mind that these tools are being taught and used at universities, and can be used at home free-of-cost by those willing to learn. That alone is a huge advantage compared to traditional closed-source reporting architectures. Also keep in mind that as LLMs continue to become more powerful, that the amount of open information for these tools is a huge asset. You can ask popular chatbots questions about these and get much higher quality information than if you were to ask about an SAP, Oracle, or IBM closed-source tool.</p>

<p>Open platforms saved companies <strong>significant</strong> amounts of money. I have not seen anywhere near the amount of money invested to implement, upgrade, and maintain open source reporting platforms as compared to traditional providers. Large project implementations are reduced from tens to hundreds of millions of dollars down to less than a million.</p>

<p>Of course, there are still some gaps in the open source space. You may have noticed that I didn’t list a visualization platform in my tools above. With the companies I’ve partnered with, Power BI and Tableau tend to dominate that space, with Looker not far behind. <a href="https://hex.tech/">Hex</a> is a tool that I will be following closely as well, given it integrates well into the open source ecosystem, but generally there is not an open source visualization tool I know of today that can compete in this space. Open source markets itself to technical audiences, but visualization tools are a bridge for non-technical audiences into the technical world. To build a big community in that space around an open source tool might be challenging, but who knows - maybe one day it will happen.</p>

<h1 id="where-are-we-headed">Where are We Headed</h1>

<p>Throughout this blog post I’ve highlighted my experiences moving from closed to open source reporting tools, and I think that has been largely reflective of the industry as a whole. However, open source reporting architecture still needs to capture more market share. From both a technical and economical perspective, I see no reason why this won’t continue, but it takes time to shift industries. If you happen to work at a company that still has not yet adopted open source tools, shoot me an email at <a href="mailto:will_ayd@innobi.io">will_ayd@innobi.io</a> - I’d love to chat about how the right platform will drive down your operational costs and improve the value of your data.</p>

<p>For companies that have already adopted an open source architecture, there is still a lot that the open source community itself can do to improve interoperability. If you have followed the dataframe space for the past few years, you have seen tools like <a href="https://pola.rs/">Polars</a>, <a href="https://datafusion.apache.org/">Apache DataFusion</a>, and <a href="https://duckdb.org/">DuckDb</a> work their way into a space that was once dominated by tools like pandas or R. As a maintainer of pandas, I ultimately think this choice is a good thing. Tools should be a means to an end, not the end goal of expression. Find whatever works best for you and let the open source community optimize it.</p>

<p>The idea that you can plug and play different open source tools into your stack is a core tenet of the “Composable Data System.” Wes McKinney and many other titans of the industry have espoused this idea in their own writing; see Wes’ own <a href="https://wesmckinney.com/blog/looking-back-15-years/">15 year reflection</a> and the <a href="https://www.vldb.org/pvldb/vol16/p2679-pedreira.pdf">The Composable Data Management System Manifesto</a>.</p>

<p>To think about what that means non-technically, just imagine that you are a company that uses SAP’s ERP solution and Microsoft Power BI for reporting. Those are both closed source tools that implement data storage in their own ways, so every time you move data from one to the other you have to pay some kind of cost. That cost can manifest itself in compute time, data loss, or in the troubleshooting of job failures.</p>

<p>In the open source space, the Apache Arrow project gives all of the tools the ability to “speak the same language.” So if you have a dataframe library like pandas and want to create visualizations in a tool like <a href="https://vega.github.io/vega/">Vega</a>, there is no additional cost to using those two together. Pandas can store data in a way that Vega understands; the two communicate and operate on the same data in memory, so your ETL costs go down to zero, you have no data loss, and there becomes no job failures to troubleshoot.</p>

<p>Apache Arrow will continue to help a lot in terms of exchanging data. If you’ve used the Apache Parquet format for exchanging data, you have already seen this. That format in particular enables highly efficient, lossless storage between dataframe libraries, databases, and visualization tools. It drastically improved a space that struggled mightily when the de facto method of data exchange was through CSV files.</p>

<p>As successful as that has been, there are other methods of data exchange that can still improve, with database communication being a prime example. To that end, the Apache Arrow project has developed the <a href="https://arrow.apache.org/docs/format/ADBC.html">Arrow Database Connectivity (ADBC)</a> standard. While a relatively young project, it has made it immensely more efficient to exchange data between dataframes and databases like PostgreSQL, SQLite, Snowflake, and BigQuery. If you haven’t yet seen my PyData 2023 talk on ADBC - <a href="https://youtu.be/XhnfybpWOgA?si=uGVbQhVLurh_Lxoi">check it out</a>. Simply put, ADBC makes it much faster, cheaper, and safer to exchange data with databases. I hope to see adoption of ADBC continue to grow with more databases in the next few years.</p>

<p>There’s also the ability to exchange Arrow data over HTTP, which can solve a rather significant throughput problem. In today’s world, companies often partner with a multitude of SaaS solutions that refuse to provide lower level access to their database, instead giving customers a REST API through which they can access the data. This checks the box for saying “hey we can share your data with you,” but different consumers need data shared in a different way. Usually, the data exchange mechanisms provided by SaaS solutions have been sufficient for creating reactionary web hooks, but struggle to exchange data in bulk. Reporting platforms often need the latter, so I’m hoping to see more adoption of this practice in the next few years to solve real scalability problems that companies face today.</p>]]></content><author><name>Will Ayd</name></author><category term="data management" /><category term="data" /><summary type="html"><![CDATA[In this article I discuss my shift from closed to open source reporting solutions over the past 15 years, while also offering a brief glimpse into future opportunities.]]></summary></entry><entry><title type="html">The 5 Data Mistakes Every Apparel Company Makes</title><link href="https://willayd.com/apparel-data-mistakes.html" rel="alternate" type="text/html" title="The 5 Data Mistakes Every Apparel Company Makes" /><published>2024-11-19T00:00:00+00:00</published><updated>2024-11-19T00:00:00+00:00</updated><id>https://willayd.com/apparel-data-mistakes</id><content type="html" xml:base="https://willayd.com/apparel-data-mistakes.html"><![CDATA[<p>Over the course of 15 years, I have had the pleasure of working at and partnering with some amazing companies in the Apparel space. While I am not qualified to design product, I’ve been fortunate to work with many of the different business areas that make up an Apparel organization. As a data practitioner, this has immensely helpful when thinking about crafting data strategies that work for the entire organization.</p>

<p>Interestingly, I found that the companies I worked at repeated the same data mistakes over and over again. In this article, I highlight five major mistakes to avoid; doing so will pay immense dividends in terms of how much your organization has to spend to manage data, and in turn how effectively your organization can use data.</p>

<h1 id="plm-system-overreach">PLM System Overreach</h1>

<p>The first major mistake I see Apparel companies make is to try and solve all of their product management issues with a PLM system.</p>

<p>Companies have a real need to categorize products, plan, merchandise, and manage changes throughout the product’s lifecycle. While in theory a PLM system could help you manage all of these, in practice I think PLM systems become far less useful the further downstream you get from managing the technical aspects of a product.</p>

<p>For example, I’ve seen companies try to implement planning solutions inside of their PLM. The logic that leads you to this solution usually goes:</p>

<ol>
  <li>We have a lot of plans that we manage in Excel. We need to store them somewhere else</li>
  <li>We do a lot of data entry into our PLM system already</li>
  <li>Product line managers might want to know how well a product is planned to craft their line</li>
</ol>

<p>Sure, this is all well intentioned, but assuming your company has gone down this path, the question you must ask yourself after putting planning data into your PLM is <em>now what?</em>. I’ve yet to come across a PLM system that offers any type of robust planning solution (ironically, Excel has more to offer). By going down this path, your company has done nothing but “shuffle sand.” It may feel good that you got your data into a “real system,” but that system offers no capabilities to augment the existence of that data, and you’ve actually created more busy work for people to store, transmit, and analyze plans.</p>

<p>Another common culprit for abuse of the PLM system is Master Data Management (MDM). Once again, there’s some pretty logical thinking that deceives you into thinking your PLM system is the right place to try and manage this. Here’s the reasoning:</p>

<ol>
  <li>To achieve good MDM, we need to fix problems at the source</li>
  <li>Our product line is developed in the PLM system</li>
  <li>Product line managers should own the quality of their data</li>
</ol>

<p>There are at least two major issues with this logic. First, the odds that your PLM system will completely encompass your product line offering are low. In the direct-to-consumer line of business, it is common to partner with 3rd parties on licensing agreements that allow you to cross-sell products. If you try to build these 3rd party products into your PLM system, you’ll likely violate all of the assumptions your system has in place about how products should be managed. Although very few PLM implementations actually enforce automated rules, just these few extra styles will weaken the overall way that you use your PLM to manage products that your company actually designs.</p>

<p>The other main issue is that product line managers often do not have complete control over their product. Sure, they manage a lot, but teams downstream from them almost assuredly will have their own custom product categorizations. To illustrate, let’s assume that your supply planning team wants to codify how a product should be stocked. If your goal is to have this information in your PLM system, then either your supply planning teams or product line management teams have to enter it somehow.</p>

<p>In the case that your supply planning teammates maintain this data, you’d have to train them up on a system to leverage only a very small portion of it. This is a waste of your supply planning teams’ time, and many PLM systems have per-user licensing agreements that make this unnecessarily expensive. If, on the other hand, you ask the product line manager to maintain this, your data quality and maintenance are surely going to suffer. You’ve asked your product line manager to own data that has nothing to do with their core function, which is a recipe for disaster.</p>

<h1 id="erp-system-overreach">ERP System Overreach</h1>

<p>Now that we’ve cast some doubts on the flexibility offered by PLM systems, let’s move our focus onto the next biggest systems culprit - the ERP system.</p>

<p>Towards the beginning part of this millennium, the concept of partnering with an ERP system provider was strongly rooted at many Apparel companies. Not only did ERP providers want to help you manage and track inventory, they wanted to sell you a reporting platform and anything under the sun that your IT organization wanted.</p>

<p>While still strongly rooted, I think this model has started to scale back in recent years. Instead of relying on the Oracle’s and SAP’s of the world for reporting solutions, many organizations have started investing in specialized reporting platform providers like Snowflake and Databricks to handle their data needs. Others may be perfectly content to build their own reporting solution, using a variety of cloud-based offerings and open source tools that have come into vogue within the past 10 years.</p>

<p>However, old habits die hard, and there is a large consulting business within the Apparel space that will still push you towards leveraging ERP systems and the tools that their providers offer for any customizations. ERP systems are inherently inflexible, and their paired reporting solutions are extremely subpar. If you go this route, you are going to use proprietary, poorly understood ERP-provided tools (does anyone still write ABAP?!?). This will make the Oracle and SAP consultants of the world very happy (and rich!), but you’ll continue along with an inflexible, non-specialized solution.</p>

<p>But what if you’ve seen the writing on the wall and avoided buying more than an ERP from ERP vendors for the past decade? Well done for being ahead of the curve, but you still need to be careful when customizing your ERP system to add more data than it is designed to manage.</p>

<p>I usually see this customization manifest itself through custom SQL scripts that modify the underlying database. In rarer cases, this has the downside of locking you into the database underlying your ERP. Unless you strictly write ANSI SQL, you’ve just added to your workload for upgrading or replacing your database.</p>

<p>Another issue is that ERP providers are very strict about what they support when it comes to customization. If you make a mistake and modify the wrong table or schema, you’ve opened yourself up to the possibility that the ERP provider will claim their expensive support agreement as having been violated.</p>

<p>Finally, it’s worth noting that this is a pretty archaic way of managing systems. There are very few other modern day applications where database-level customization is considered a best practice; more commonly, you’d have some type of middleware that affords you safety and flexibility when augmenting your system. Directly modifying the underlying application database is so 90s.</p>

<p>To be clear, I don’t want this section to read as saying that there is no value to ERP systems. Smaller organizations tend to forgo traditional ERP systems for SaaS solutions to manage production, shipping and inventory. These solutions work up to a certain extent, but they become very difficult to scale. Having a centralized ERP system alongside many standard integrations (vendor management portals, warehouse management modules, point-of-sale systems, etc…) can be of immense value to an organization.</p>

<p>However, you should keep in mind that the benefits of an ERP are mainly relevant to tracking and creating a transactional record of events. When it comes to data and analytics, open-source tooling has completely disrupted the field that ERP vendors tried to dominate in the early 2000s. Open-source tooling is better, and hiring associates with that skill-set is significantly easier. The odds of a young data professional coming out of college knowing how to use SAP or Oracle is pretty low, yet the odds of them knowing how to use tools like Python are high.</p>

<h1 id="ceding-data-to-3rd-parties">Ceding Data to 3rd Parties</h1>

<p>As a data professional, this one hurts to see. Once again, let’s talk about the logical steps that get you to this point:</p>

<ol>
  <li>Your marketing team uses a 3rd party tool that is great for &lt;customer tracking, email marketing, etc…&gt;</li>
  <li>The 3rd party tool promises you they can build you a “single view of the customer”</li>
  <li>You start copying all of your data to the third party tool, so you don’t have to manage it yourself</li>
</ol>

<p>There’s a few issues with this. For starters, I’ve yet to come across a 3rd party SaaS provider that effectively manages an organization’s data. Sure, they can likely produce some attractive reports on top of the data that they created, but they simply do not have the capabilities to cover your entire organization. Managing data quality, ensuring systems talk to one another in meaningful ways, and aligning systems with business processes is an insanely complex task. If you think your email marketing platform that you pay for on a subscription basis is going to solve that for you…well I’ve got some snake oil to sell you too!</p>

<p>If you go this route, you need to be aware that you are giving away one of the greatest assets that your organization has. Within the past decade, we’ve continued to see data become more valuable by the year. Whether you are using data for analytical purpose, or you are collecting data that one day may help you augment an AI model for your organization, <em>you</em> should own that.</p>

<p>Of course a 3rd party SaaS provider would love to take this from you. Storage is exceptionally cheap, and, save having some very low-level integration with your third party, the amount of data that you send to them is going to cost peanuts. This will make your finance team happy, but it won’t take long to find that not only have you ceded control of your invaluable treasure chest of data, you’ve locked yourself into a 3rd party provider.</p>

<p>You should also consider that third party SaaS providers are being bought, sold, and acquired at a profound pace; any of these events can greatly change the priorities of that provider. Even large players like Google have had to make sweeping architectural changes to how they solution their products (Universal Analytics -&gt; GA4, anyone?).</p>

<p>Your organization needs your data and they need it to be comprehensive, accurate, and insightful. Before you give your data away to 3rd parties, ask yourself if you truly think that they are going to forever architect a comprehensive solution for your organization, with limited interaction with your business and at the cost of a subscription. Odds are low, so that’s a huge risk to take on.</p>

<h1 id="no-processes-for-data-quality-control">No Processes for Data Quality Control</h1>

<p>This definitely falls under the purview of MDM, but is critical enough to report here as we talk about data in more general terms. Please DO NOT allow your teams to categorize their data differently across business units. This is short-term win for the business unit that feels like it needs more flexibility to manage its view of products and consumers, but you are proverbially “robbing Peter to pay Paul” when you do this.</p>

<p>This is a top-down failure within your data organization. Each analyst may be happy to have this freedom, but undoubtedly your data and communications will need to cross multiple parts of your organization. Can you imagine trying to cobble together a spreadsheet from three different business units that each refer to your product as:</p>

<ul>
  <li>Awesome Women’s T</li>
  <li>Awesome W’s T</li>
  <li>Awesome Womens T</li>
</ul>

<p>As humans, we easily recognize these as the same thing. Computers are pretty stupid though (yes, even in the age of AI) so you’ve introduced work for someone somewhere to try and clean up this mess manually.</p>

<p><em>What a huge waste of time</em>. I can’t stress this enough. Your poor data quality analyst is going to have to go through, reclassify these, communicate with people to try and establish a best practice, update a multitude of Excel files, etc…all to ultimately produce a non-reproducible report that leaves some ambiguity as to how well this product is being managed.</p>

<p>You may have less control over this on the consumer side and have to pay third party services to help cleanse your data (addresses come to mind), but those services tend to be relatively affordable. On the product side, setting standards up front and measuring adherence to those standards is something you can do up front. Please do this and save your organization from all of the non-value added data cleansing activities downstream.</p>

<h1 id="undervaluing-manufacturing-data">Undervaluing Manufacturing Data</h1>

<p>Very few large brands in Apparel do their own manufacturing at scale. Historically, large brands in the U.S. have partnered with overseas factories to produce goods at very low prices. As the world has changed and the geo-political climate evolves, it is hard to say how the future of this will shape out, but I sincerely doubt that there will be a radical, overnight shift to this setup.</p>

<p>With that being the case, manufacturing partners are rarely ever managed within a comprehensive data platform. At a minimum, I’ve seen Excel be the system of choice to transmit data from the manufacturing company to the Apparel brand. If you wanted to get fancy, you might have a quality control system and an ERP integration with your manufacturer, but these aren’t typically technologically savvy integrations.</p>

<p>Is that the best we can do with manufacturing data? Of course not! Think about the potential for automation - wouldn’t it be cool to have more AI trained to automate tasks like folding and sewing? You can find any number of conceptual inventions on that front - here’s one as an example:</p>

<p><a href="https://blogs.nvidia.com/blog/hugging-face-lerobot-open-source-robotics/">https://blogs.nvidia.com/blog/hugging-face-lerobot-open-source-robotics/</a></p>

<p>Sure, that’s a very crude folding method and it probably won’t change the labor landscape in the coming months. But how fast can we evolve that space? I would imagine that 99.9% of manufacturing data points are simply lost to time because we don’t track them, and we don’t have the incentives to do so. Maybe the next big thing in Apparel comes from having enough video data to train the robotics to perform these tasks at scale and efficiently?</p>

<p>Outside of theoretical future applications, there’s still so much more that can be done in this space right now. If Apparel companies can apply more technology to manufacturing, they could do things like:</p>

<ol>
  <li>Deeply analyze flaws with the production process design</li>
  <li>Get near real time updates into supply chain bottlenecks</li>
  <li>Send customers ultra-detailed tracking information</li>
</ol>

<p>If you are vertically integrated, data collection in your manufacturing process may be a killer feature for your organization. Even if you aren’t vertically integrated, an investment now in partners that are inclined down this path may pay dividends later. Like manufacturing in many other spaces, automation is the future. Don’t be naive and think that Apparel will always be made the same way it is today!</p>

<h1 id="how-can-you-fix-these-problems">How can you fix these problems?</h1>

<h2 id="lean-into-open-source-software">Lean into open source software</h2>

<p>When I first started my career, the idea of using open source software to run an organization’s data stack was amount to heresy. Thankfully, the perception of open-source software has changed alongside the evolution of cloud offerings; these two things pair well together.</p>

<p>If done correctly, you can create an extremely robust, resilient, and highly performant data architecture <em>that you own</em>. You are no longer beholden to how a SaaS or ERP provider thinks you should run your business, and, quite frankly, open source tooling is light years ahead of any one-stop shop that I’ve found.</p>

<p>For instance, here’s a stack that I’ve found personal success with:</p>

<ul>
  <li>Infrastructure Management through Terraform</li>
  <li>Job orchestration through Airflow</li>
  <li>Data Modeling through dbt</li>
  <li>Data Quality / Testing through dbt</li>
  <li>Python as a glue language</li>
</ul>

<p>I’d be lying if I said that this is all a “one-click” deployment, but I don’t think the bar to implement part or all of this stack is all that high either.</p>

<p>The trend of open source software is something that affects more than just the Apparel industry as well; essentially, deploying a stack like the above just keeps you in line with larger shifts in the data space. Just recently, some of the greatest minds in data teamed up to write <a href="The Composable Data Management System Manifesto">https://www.vldb.org/pvldb/vol16/p2679-pedreira.pdf</a>. For this with a keen interest in data systems, this is a must read.</p>

<h2 id="treat-data-quality-as-a-shared-goal">Treat data quality as a shared goal</h2>

<p>The problem with data quality is that traditional corporate stacks have done a poor job of automation. QA and MDM were often both treated as manual exercises for people to manage, which quickly become unwieldy at large organizations. Fortunately, and thanks to the ever-increasing role of open-source software within organizations, there are very capable tools that can help you manage data quality much more effectively than ever before.</p>

<p>With the tool of your choice, you should be able to orchestrate automated jobs that immediately flag and capture data quality issues. When you find them, you should immediately forward them to the appropriate parties to manage. This affords your organization the flexibility to have different people weave together parts of data that make up the larger picture. Done more accurately, you’ll find far less “busy work” in place to achieve this goal.</p>

<p>Also be sure to organizationally assign ownership to different pieces of data. If data quality issues arise with fields A, B, and C in your data warehouse, the organization should know who is responsible for maintaining those fields.</p>

<p>Overall the process for improving this is not complicated, but rather historically neglected. With modern tooling, you can turn that history on its head and reap higher quality, more trustworthy reporting.</p>

<h2 id="embrace-technology-in-manufacturing">Embrace Technology in Manufacturing</h2>

<p>This last solution is a bit more open-ended because it is highly dependent upon how your company is organized and how it manages any potential manufacturing partnerships. So without offering a blanket solution, I’ll bet your company can be doing more to become tech-forward with its development. If you are of the belief that Apparel will be manufactured the same way in years’ time, I think you’ll find many others in the industry looking to disrupt that thought. As AI becomes more accessible and storage costs go down, the barriers to solving the automation problem are breaking down as well.</p>

<p>So please, do whatever you can on this front to not fall behind. If you are vertically integrated, make sure you have robust production tracking software. If you think your company wants to automate more, starting taking videos of your production process to use as training data for AI models. On the flip side, if you rely on a third party manufacturers, make sure you value their technological investments as part of your sourcing strategy. Sure, it may be difficult for them to compete on price with a manufacturer that has no technological investments today, but your supply chain and data management are going to pay those costs in the long run.</p>]]></content><author><name>Will Ayd</name></author><category term="data management" /><category term="data" /><summary type="html"><![CDATA[In this article, we discuss the 5 data mistakes that every Apparel company makes, provide some context as to why, and discuss ways to solve them.]]></summary></entry><entry><title type="html">Leveraging the Arrow C Data Interface</title><link href="https://willayd.com/leveraging-the-arrow-c-data-interface.html" rel="alternate" type="text/html" title="Leveraging the Arrow C Data Interface" /><published>2024-02-20T00:00:00+00:00</published><updated>2024-02-20T00:00:00+00:00</updated><id>https://willayd.com/leveraging-the-arrow-c-data-interface</id><content type="html" xml:base="https://willayd.com/leveraging-the-arrow-c-data-interface.html"><![CDATA[<p>The <a href="https://arrow.apache.org/docs/format/CDataInterface.html">Arrow C Data Interface</a> is an amazing tool, and while it documents its own potential use cases I wanted to dedicate a blog post to my personal experience using it.</p>

<h2 id="problem-statement">Problem Statement</h2>

<p>Transferring data across systems and libraries is difficult and time-consuming. This statement applies not only to compute time but perhaps more importantly to developer time as well.</p>

<p>I first ran into this issue over 5 years ago when I started a library called <a href="https://pantab.readthedocs.io/en/latest/">pantab</a>. At the time, I had just become a core developer of <a href="https://pandas.pydata.org/">pandas</a>, and through consulting work had been dealing a lot with <a href="https://www.tableau.com/">Tableau</a>. Tableau had just released their <a href="https://www.tableau.com/developer/learning/tableau-hyper-api">Hyper API</a>, which is a way to exchange data to/from their proprietary Hyper database.</p>

<p><em>Great…</em>, I said to myself, <em>I know a lot of pandas internals and I think writing a DataFrame to a Hyper database will be easier than any other option</em>. Hence, pantab was created.</p>

<p>As you may or may not already be aware, most high-performance Python libraries in the analytics space get their performance from implementing parts of their code base in <em>lower-level</em> languages like C/C++/Rust. So with pantab I set out to do the same thing.</p>

<p>The problem, however, is that pandas did NOT expose any of its internal data structures to other libraries. pantab was forced to hack a lot of things to make this integration “work”, but in a way that was very fragile across pandas releases.</p>

<p>Late in 2023 I decided that pantab was due for a rewrite. Hacking into the pandas internals was not going to work any more, especially as the number of data types that pandas supported started to grow. What pantab needed was an agreement with a library like pandas as to how to exchange low-level data at an extremely high level of performance.</p>

<p>Fortunately, I wasn’t the only person with that idea. Data interchange libraries that weren’t even a thought when pantab started were now a reality, so it was time to test those out.</p>

<h2 id="status-quo">Status Quo</h2>

<p>pantab initially used <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.itertuples.html">pandas.DataFrame.itertuples</a> to loop over every row and every element within a DataFrame before writing it out to a Hyper file. While this worked and was faster than what most users would write by hand, it still really wasn’t that fast.</p>

<p>Here is a high level overview of that process, with heavy Python runtime interactions highlighted in red:</p>

<!--
digraph G {
  node [
    shape=box
    style=filled
    color=black
    fillcolor=white
  ]

  rawdata [
    label = "df.itertuples()"
    color="#b20100"
    fillcolor="#edd5d5"
  ]
  df -> rawdata;

  forloop [
    label = "Python for loop"
    color="#b20100"
    fillcolor="#edd5d5"
  ]
  rawdata -> forloop;

  convert [
    label = "PyObject -> primitive"
    color="#b20100"
    fillcolor="#edd5d5"
  ]
  forloop -> convert;

  write [
    label = "Database write"
  ]
  convert -> write;
}
}-->

<svg width="180pt" height="332pt" viewBox="0.00 0.00 180.00 332.00" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
<g id="graph0" class="graph" transform="scale(1 1) rotate(0) translate(4 328)">
<title>G</title>
<polygon fill="white" stroke="transparent" points="-4,4 -4,-328 176,-328 176,4 -4,4" />
<!-- rawdata -->
<g id="node1" class="node">
<title>rawdata</title>
<polygon fill="#edd5d5" stroke="#b20100" points="143.5,-252 28.5,-252 28.5,-216 143.5,-216 143.5,-252" />
<text text-anchor="middle" x="86" y="-230.3" font-family="Times,serif" font-size="14.00">df.itertuples()</text>
</g>
<!-- forloop -->
<g id="node3" class="node">
<title>forloop</title>
<polygon fill="#edd5d5" stroke="#b20100" points="149,-180 23,-180 23,-144 149,-144 149,-180" />
<text text-anchor="middle" x="86" y="-158.3" font-family="Times,serif" font-size="14.00">Python for loop</text>
</g>
<!-- rawdata&#45;&gt;forloop -->
<g id="edge2" class="edge">
<title>rawdata&#45;&gt;forloop</title>
<path fill="none" stroke="black" d="M86,-215.7C86,-207.98 86,-198.71 86,-190.11" />
<polygon fill="black" stroke="black" points="89.5,-190.1 86,-180.1 82.5,-190.1 89.5,-190.1" />
</g>
<!-- df -->
<g id="node2" class="node">
<title>df</title>
<polygon fill="white" stroke="black" points="113,-324 59,-324 59,-288 113,-288 113,-324" />
<text text-anchor="middle" x="86" y="-302.3" font-family="Times,serif" font-size="14.00">df</text>
</g>
<!-- df&#45;&gt;rawdata -->
<g id="edge1" class="edge">
<title>df&#45;&gt;rawdata</title>
<path fill="none" stroke="black" d="M86,-287.7C86,-279.98 86,-270.71 86,-262.11" />
<polygon fill="black" stroke="black" points="89.5,-262.1 86,-252.1 82.5,-262.1 89.5,-262.1" />
</g>
<!-- convert -->
<g id="node4" class="node">
<title>convert</title>
<polygon fill="#edd5d5" stroke="#b20100" points="172,-108 0,-108 0,-72 172,-72 172,-108" />
<text text-anchor="middle" x="86" y="-86.3" font-family="Times,serif" font-size="14.00">PyObject &#45;&gt; primitive</text>
</g>
<!-- forloop&#45;&gt;convert -->
<g id="edge3" class="edge">
<title>forloop&#45;&gt;convert</title>
<path fill="none" stroke="black" d="M86,-143.7C86,-135.98 86,-126.71 86,-118.11" />
<polygon fill="black" stroke="black" points="89.5,-118.1 86,-108.1 82.5,-118.1 89.5,-118.1" />
</g>
<!-- write -->
<g id="node5" class="node">
<title>write</title>
<polygon fill="white" stroke="black" points="148.5,-36 23.5,-36 23.5,0 148.5,0 148.5,-36" />
<text text-anchor="middle" x="86" y="-14.3" font-family="Times,serif" font-size="14.00">Database write</text>
</g>
<!-- convert&#45;&gt;write -->
<g id="edge4" class="edge">
<title>convert&#45;&gt;write</title>
<path fill="none" stroke="black" d="M86,-71.7C86,-63.98 86,-54.71 86,-46.11" />
<polygon fill="black" stroke="black" points="89.5,-46.1 86,-36.1 82.5,-46.1 89.5,-46.1" />
</g>
</g>
</svg>

<p>A later version of pantab which required a minimum of pandas 1.3 ended up hacking into the internals of pandas, calling something like <code class="language-plaintext highlighter-rouge">df._mgr.column_arrays</code> to get a <code class="language-plaintext highlighter-rouge">NumPy</code> array for each column in the DataFrame. Combined with the <a href="https://numpy.org/doc/stable/reference/c-api/iterator.html">NumPy Array Iterator API</a>, pantab could iterate over raw NumPy arrays instead of doing a loop in Python.</p>

<!--
digraph G {
  node [
    shape=box
    style=filled
    color=black
    fillcolor=white
  ]

  rawdata [
    label = "df._mgr.column_arrays"
    color="#b20100"
    fillcolor="#edd5d5"
  ]
  df -> rawdata;

  forloop [
    label = "NumPy Array Iterator API"
  ]
  rawdata -> forloop;

  string [
    label = "Is string?"
    shape = diamond
    color=black
    fillcolor=white
  ]

  forloop -> string;

  convert [
    label = "PyObject -> primitive"
    color="#b20100"
    fillcolor="#edd5d5"
  ]
  string -> convert [
    label="yes"
  ]

  write [
    label = "Database write"
  ]
  string -> write [
    label="no"
  ]
  convert -> write;
}
-->

<svg width="257pt" height="423pt" viewBox="0.00 0.00 256.50 423.00" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
<g id="graph0" class="graph" transform="scale(1 1) rotate(0) translate(4 419)">
<title>G</title>
<polygon fill="white" stroke="transparent" points="-4,4 -4,-419 252.5,-419 252.5,4 -4,4" />
<!-- rawdata -->
<g id="node1" class="node">
<title>rawdata</title>
<polygon fill="#edd5d5" stroke="#b20100" points="236,-342 58,-342 58,-306 236,-306 236,-342" />
<text text-anchor="middle" x="147" y="-320.3" font-family="Times,serif" font-size="14.00">df._mgr.column_arrays</text>
</g>
<!-- forloop -->
<g id="node3" class="node">
<title>forloop</title>
<polygon fill="white" stroke="black" points="248.5,-269 45.5,-269 45.5,-233 248.5,-233 248.5,-269" />
<text text-anchor="middle" x="147" y="-247.3" font-family="Times,serif" font-size="14.00">NumPy Array Iterator API</text>
</g>
<!-- rawdata&#45;&gt;forloop -->
<g id="edge2" class="edge">
<title>rawdata&#45;&gt;forloop</title>
<path fill="none" stroke="black" d="M147,-305.81C147,-297.79 147,-288.05 147,-279.07" />
<polygon fill="black" stroke="black" points="150.5,-279.03 147,-269.03 143.5,-279.03 150.5,-279.03" />
</g>
<!-- df -->
<g id="node2" class="node">
<title>df</title>
<polygon fill="white" stroke="black" points="174,-415 120,-415 120,-379 174,-379 174,-415" />
<text text-anchor="middle" x="147" y="-393.3" font-family="Times,serif" font-size="14.00">df</text>
</g>
<!-- df&#45;&gt;rawdata -->
<g id="edge1" class="edge">
<title>df&#45;&gt;rawdata</title>
<path fill="none" stroke="black" d="M147,-378.81C147,-370.79 147,-361.05 147,-352.07" />
<polygon fill="black" stroke="black" points="150.5,-352.03 147,-342.03 143.5,-352.03 150.5,-352.03" />
</g>
<!-- string -->
<g id="node4" class="node">
<title>string</title>
<polygon fill="white" stroke="black" points="147,-196 69.58,-178 147,-160 224.42,-178 147,-196" />
<text text-anchor="middle" x="147" y="-174.3" font-family="Times,serif" font-size="14.00">Is string?</text>
</g>
<!-- forloop&#45;&gt;string -->
<g id="edge3" class="edge">
<title>forloop&#45;&gt;string</title>
<path fill="none" stroke="black" d="M147,-232.81C147,-224.79 147,-215.05 147,-206.07" />
<polygon fill="black" stroke="black" points="150.5,-206.03 147,-196.03 143.5,-206.03 150.5,-206.03" />
</g>
<!-- convert -->
<g id="node5" class="node">
<title>convert</title>
<polygon fill="#edd5d5" stroke="#b20100" points="172,-109 0,-109 0,-73 172,-73 172,-109" />
<text text-anchor="middle" x="86" y="-87.3" font-family="Times,serif" font-size="14.00">PyObject &#45;&gt; primitive</text>
</g>
<!-- string&#45;&gt;convert -->
<g id="edge4" class="edge">
<title>string&#45;&gt;convert</title>
<path fill="none" stroke="black" d="M136.37,-162.19C127.53,-149.87 114.73,-132.04 104.25,-117.43" />
<polygon fill="black" stroke="black" points="107.07,-115.36 98.39,-109.27 101.38,-119.44 107.07,-115.36" />
<text text-anchor="middle" x="133.5" y="-130.8" font-family="Times,serif" font-size="14.00">yes</text>
</g>
<!-- write -->
<g id="node6" class="node">
<title>write</title>
<polygon fill="white" stroke="black" points="209.5,-36 84.5,-36 84.5,0 209.5,0 209.5,-36" />
<text text-anchor="middle" x="147" y="-14.3" font-family="Times,serif" font-size="14.00">Database write</text>
</g>
<!-- string&#45;&gt;write -->
<g id="edge5" class="edge">
<title>string&#45;&gt;write</title>
<path fill="none" stroke="black" d="M157.6,-162.29C170.72,-142.23 190.27,-105.12 181,-73 178.13,-63.05 172.82,-53.16 167.23,-44.62" />
<polygon fill="black" stroke="black" points="169.99,-42.46 161.4,-36.26 164.25,-46.46 169.99,-42.46" />
<text text-anchor="middle" x="192" y="-87.3" font-family="Times,serif" font-size="14.00">no</text>
</g>
<!-- convert&#45;&gt;write -->
<g id="edge6" class="edge">
<title>convert&#45;&gt;write</title>
<path fill="none" stroke="black" d="M100.77,-72.81C108.26,-64.09 117.5,-53.34 125.74,-43.75" />
<polygon fill="black" stroke="black" points="128.51,-45.89 132.37,-36.03 123.2,-41.33 128.51,-45.89" />
</g>
</g>
</svg>

<p>This helped a lot with performance, and while the NumPy Array Iterator API was solid, the pandas internals <a href="https://github.com/innobi/pantab/issues/190">would change across releases</a>, so it took a lot of developer time to maintain.</p>

<p>The images and comments above assume we are writing a DataFrame to a Hyper file. Going the other way around, pantab would create a Python list of PyObjects and convert to more appropriate data types after everything was read. If we were to graph that process, it would be even more red - not good!</p>

<h2 id="initial-redesign-attempt---python-dataframe-interchange-protocol">Initial Redesign Attempt - Python DataFrame Interchange Protocol</h2>

<p>Before I ever considered the Arrow C Data Interface, my first try at getting high performance and easy data exchange from pandas to Hyper was through the <a href="https://data-apis.org/dataframe-protocol/latest/purpose_and_scope.html">Python DataFrame interchange protocol</a>. While initially promising, this soon became problematic.</p>

<p>For starters, <em>Memory ownership and lifetime</em> is listed as something in scope of the protocol, but is not actually defined. Implementers are free to choose how long a particular buffer should last, and it is up the client to just know this. After many unexpected segfaults, I started to grow weary of this solution.</p>

<p>Another major issue for the interchange protocol is that <em>Non-Python API standardization (e.g., C/C++ APIs)</em> is explicitly a non-goal. With pantab being a consumer of raw data, this meant I had to know how to manage those raw buffers for every type I wished to consume.  While that may not be a huge deal for simple primitive types like sized integers, it leaves much to be desired when you try to work with more complex types like decimals.</p>

<p>Next topic - nullability! Here is the enumeration the protocol specified:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">ColumnNullType</span><span class="p">(</span><span class="n">enum</span><span class="p">.</span><span class="n">IntEnum</span><span class="p">):</span>
    <span class="s">"""
    Integer enum for null type representation.

    Attributes
    ----------
    NON_NULLABLE : int
        Non-nullable column.
    USE_NAN : int
        Use explicit float NaN value.
    USE_SENTINEL : int
        Sentinel value besides NaN.
    USE_BITMASK : int
        The bit is set/unset representing a null on a certain position.
    USE_BYTEMASK : int
        The byte is set/unset representing a null on a certain position.
    """</span>

    <span class="n">NON_NULLABLE</span> <span class="o">=</span> <span class="mi">0</span>
    <span class="n">USE_NAN</span> <span class="o">=</span> <span class="mi">1</span>
    <span class="n">USE_SENTINEL</span> <span class="o">=</span> <span class="mi">2</span>
    <span class="n">USE_BITMASK</span> <span class="o">=</span> <span class="mi">3</span>
    <span class="n">USE_BYTEMASK</span> <span class="o">=</span> <span class="mi">4</span>
</code></pre></div></div>

<p>The way the DataFrame Interchange Protocol decided to handle nullability is an area where trying to be inclusive of many different strategies ended up as a detriment to all. Requiring developers to integrate all of these methods across any type they may consume is a lot of effort (particularly for <code class="language-plaintext highlighter-rouge">USE_SENTINEL</code>).</p>

<p>Another limitation with the DataFrame Interchange Protocol is the fact that it only talks about how to consume data, but offers no guidance on how to produce it. If starting from your extension, you have no tools or library to manually build buffers. Much like the <a href="#status-quo">status quo</a>, this meant reading from a Hyper database to a pandas DataFrame would likely be going through Python objects.</p>

<p>Finally, and related to all of the issues above, the pandas implementation of the DataFrame Interchange Protocol left a lot to be desired. While started with good intentions, it never got the attention needed to make it really effective. I already mentioned the lifetime issues across various data types, but nullability handling was all over the place across types. Metadata was often passed along incorrectly from pandas down through the interface…essentially making it a very high effort for consumers to try and use it.</p>

<h2 id="arrow-c-data-interface-to-the-rescue">Arrow C Data Interface to the Rescue</h2>

<p>After stumbling around the DataFrame Protocol Interface for a few weeks, <a href="https://jorisvandenbossche.github.io/pages/about.html">Joris Van den Bossche</a> asked me why I didn’t look at the <a href="https://arrow.apache.org/docs/format/CDataInterface.html">Arrow C Data Interface</a>. The answer of course was that I was just not very familiar with it. Joris knows a ton about pandas and Arrow, so I figured it best to take his word for it and try it out.</p>

<p>Almost immediately my issues went away. To wit:</p>

<ol>
  <li>Memory ownership and lifetime - <a href="https://arrow.apache.org/docs/format/CDataInterface.html#memory-management">well defined</a> at low levels</li>
  <li>Non-Python API - for this there is <a href="https://arrow.apache.org/nanoarrow/latest/index.html">nanoarrow</a></li>
  <li>Nullability handling - uses Arrow bitmasks</li>
  <li>Producing buffers - can create (not just read) data</li>
  <li>pandas implementation - it <em>just works</em> via <a href="https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html#arrowstream-export">PyCapsules</a></li>
</ol>

<p>With well defined memory semantics, a low-level API and clean nullability handling, the amount of extension code I had to write was drastically reduced. I felt more confident in the implementation and had to deal with less memory corruption / crashes than before. And, perhaps most importantly, I saved a lot of time.</p>

<p>See the image below for a high level overview of the process. Note the lack of any red compared to the <a href="#status-quo">status quo</a> - this has a very limited interaction with the Python runtime:</p>

<!--
digraph G {
  node [
    shape=box
    style=filled
    color=black
    fillcolor=white
  ]

  rawdata [
    label = "df.__arrow_c_stream__()"
  ]
  df -> rawdata;

  forloop [
    label = "Arrow C API / nanoarrow"
  ]
  rawdata -> forloop;

  write [
    label = "Database write"
  ]
  forloop -> write
}
-->

<svg width="202pt" height="260pt" viewBox="0.00 0.00 202.00 260.00" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
<g id="graph0" class="graph" transform="scale(1 1) rotate(0) translate(4 256)">
<title>G</title>
<polygon fill="white" stroke="transparent" points="-4,4 -4,-256 198,-256 198,4 -4,4" />
<!-- rawdata -->
<g id="node1" class="node">
<title>rawdata</title>
<polygon fill="white" stroke="black" points="189.5,-180 4.5,-180 4.5,-144 189.5,-144 189.5,-180" />
<text text-anchor="middle" x="97" y="-158.3" font-family="Times,serif" font-size="14.00">df.__arrow_c_stream__()</text>
</g>
<!-- forloop -->
<g id="node3" class="node">
<title>forloop</title>
<polygon fill="white" stroke="black" points="194,-108 0,-108 0,-72 194,-72 194,-108" />
<text text-anchor="middle" x="97" y="-86.3" font-family="Times,serif" font-size="14.00">Arrow C API / nanoarrow</text>
</g>
<!-- rawdata&#45;&gt;forloop -->
<g id="edge2" class="edge">
<title>rawdata&#45;&gt;forloop</title>
<path fill="none" stroke="black" d="M97,-143.7C97,-135.98 97,-126.71 97,-118.11" />
<polygon fill="black" stroke="black" points="100.5,-118.1 97,-108.1 93.5,-118.1 100.5,-118.1" />
</g>
<!-- df -->
<g id="node2" class="node">
<title>df</title>
<polygon fill="white" stroke="black" points="124,-252 70,-252 70,-216 124,-216 124,-252" />
<text text-anchor="middle" x="97" y="-230.3" font-family="Times,serif" font-size="14.00">df</text>
</g>
<!-- df&#45;&gt;rawdata -->
<g id="edge1" class="edge">
<title>df&#45;&gt;rawdata</title>
<path fill="none" stroke="black" d="M97,-215.7C97,-207.98 97,-198.71 97,-190.11" />
<polygon fill="black" stroke="black" points="100.5,-190.1 97,-180.1 93.5,-190.1 100.5,-190.1" />
</g>
<!-- write -->
<g id="node4" class="node">
<title>write</title>
<polygon fill="white" stroke="black" points="159.5,-36 34.5,-36 34.5,0 159.5,0 159.5,-36" />
<text text-anchor="middle" x="97" y="-14.3" font-family="Times,serif" font-size="14.00">Database write</text>
</g>
<!-- forloop&#45;&gt;write -->
<g id="edge3" class="edge">
<title>forloop&#45;&gt;write</title>
<path fill="none" stroke="black" d="M97,-71.7C97,-63.98 97,-54.71 97,-46.11" />
<polygon fill="black" stroke="black" points="100.5,-46.1 97,-36.1 93.5,-46.1 100.5,-46.1" />
</g>
</g>
</svg>

<p>Without going too deep in the benchmarks game, the Arrow C Data Interface implementation yielded a 25% performance improvement for me when writing strings. When reading data, it was more like a 500% improvement than what had been previously implemented. Not bad…</p>

<p>My code is no longer tied to the potentially fragile internals of pandas, and with the stability of the Arrow C Data Interface things are far less likely to break when new versions are released.</p>

<h2 id="bonus-feature---bring-your-own-library">Bonus Feature - Bring Your Own Library</h2>

<p>While it wasn’t my goal at the outset, implementing the Arrow C Data Interface had the benefit of decoupling a dependency on pandas. pandas was the de facto library when pantab was first written, but since then many high quality Arrow-based libraries have popped up.</p>

<p>With the Arrow C Data Interface, pantab now has a <em>bring your own DataFrame library mentality</em>.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">&gt;&gt;&gt;</span> <span class="kn">import</span> <span class="nn">pantab</span> <span class="k">as</span> <span class="n">pt</span>
<span class="o">&gt;&gt;&gt;</span> <span class="kn">import</span> <span class="nn">pantab</span> <span class="k">as</span> <span class="n">pd</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">({</span><span class="s">"col"</span><span class="p">:</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">]})</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">pt</span><span class="p">.</span><span class="n">frame_to_hyper</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="s">"example.hyper"</span><span class="p">,</span> <span class="n">table</span><span class="o">=</span><span class="s">"test"</span><span class="p">)</span>

<span class="o">&gt;&gt;&gt;</span> <span class="kn">import</span> <span class="nn">polars</span> <span class="k">as</span> <span class="n">pl</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">df</span> <span class="o">=</span> <span class="n">pl</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">({</span><span class="s">"col"</span><span class="p">:</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">]})</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">pt</span><span class="p">.</span><span class="n">frame_to_hyper</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="s">"example.hyper"</span><span class="p">,</span> <span class="n">table</span><span class="o">=</span><span class="s">"test"</span><span class="p">)</span>

<span class="o">&gt;&gt;&gt;</span> <span class="kn">import</span> <span class="nn">pyarrow</span> <span class="k">as</span> <span class="n">pa</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">tbl</span> <span class="o">=</span> <span class="n">pa</span><span class="p">.</span><span class="n">Table</span><span class="p">.</span><span class="n">from_pydict</span><span class="p">({</span><span class="s">"col"</span><span class="p">:</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">]})</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">pt</span><span class="p">.</span><span class="n">frame_to_hyper</span><span class="p">(</span><span class="n">tbl</span><span class="p">,</span> <span class="s">"example.hyper"</span><span class="p">,</span> <span class="n">table</span><span class="o">=</span><span class="s">"test"</span><span class="p">)</span>
</code></pre></div></div>

<p>These all produce the same results, and as the author of pantab I did not have to do anything extra to accommodate the various libraries - everything <em>just works</em>.</p>

<h2 id="closing-thoughts">Closing Thoughts</h2>

<p>The Arrow specification is simply put…awesome. While initiatives like the Python DataFrame Protocol have tried to solve the issue of interchange, I don’t believe that goal was ever achieved…until now. The Arrow C Data Interface is the tool developers have always needed to make analytics integrations <em>easy</em>.</p>

<p>pantab is not the first library to take advantage of these features. The Arrow ADBC drivers I <a href="/leveraging-the-adbc-driver-in-analytics-workflows.html">previously blogged about</a> are also huge users of nanoarrow / the Arrow C Data Interface, and heavily influenced the design of pantab. The <a href="https://arrow.apache.org/powered_by/">Powered By Apache Arrow</a> project page is the best resource to find others as they get developed in the future.</p>

<p>I, for one, am excited to see Arrow-based tooling grow and make open-source data integrations more powerful than ever before.</p>]]></content><author><name>Will Ayd</name></author><category term="arrow" /><category term="python" /><category term="arrow" /><summary type="html"><![CDATA[This blog post describes how the Arrow C Data interface works, as witnessed through transformation of the pantab library.]]></summary></entry><entry><title type="html">Leveraging the ADBC driver in Analytics Workflows</title><link href="https://willayd.com/leveraging-the-adbc-driver-in-analytics-workflows.html" rel="alternate" type="text/html" title="Leveraging the ADBC driver in Analytics Workflows" /><published>2023-06-16T00:00:00+00:00</published><updated>2023-06-16T00:00:00+00:00</updated><id>https://willayd.com/leveraging-the-adbc-driver-in-analytics-workflows</id><content type="html" xml:base="https://willayd.com/leveraging-the-adbc-driver-in-analytics-workflows.html"><![CDATA[<p>The <a href="https://arrow.apache.org/docs/format/ADBC.html">ADBC: Arrow Database Connectivity</a> client API standard is new standard <a href="https://arrow.apache.org/blog/2023/01/05/introducing-arrow-adbc/">introduced in January 2023</a>. Sparing some technical details, traditional formats like ODBC/JDBC has operated on data in a <em>row-oriented</em> manner. This made sense at the time those standards were created (in the 1990s) as the databases they targeted were pre-dominantly row-oriented as well. The past decade of analytics has shown a strong inclination towards <em>column-oriented</em> database storage, so using ODBC/JDBC to transfer data means you at a minimum always have to spend resources to translate to/from row- and column-oriented formats.</p>

<p>Many column databases solve the row-&gt;column transposition issue by ingesting or exporting columnar file formats like <a href="https://parquet.apache.org/docs/file-format/">Apache Parquet</a>. This can be an indispensable tool for achieving high throughput, but in going this route you often sacrifice the ecosystem benefits of standard tooling like ODBC/JDBC. Using pandas as an example, I can very easily read/write from almost any database using <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql.html">pd.read_sql</a> and <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html">pd.DataFrame.to_sql</a>. This works well for smallish datasets, but when you run into scalability issues you often end up exporting/importing via CSV/parquet, adding more potential points of breakage to your pipelines.</p>

<p>Its worth nothing that even if your source/target database is not columnar, ADBC has an advantage of being implemented at a low level. ADBC is tightly integrated with the <a href="https://arrow.apache.org/docs/format/Columnar.html">Arrow Columnar Format</a>, which essentially means that ADBC can optimally work with the data using its primitive layout in memory. Pandas by contrast does NOT have this, so all of the <code class="language-plaintext highlighter-rouge">to_sql</code> and <code class="language-plaintext highlighter-rouge">read_sql</code> calls you make in pandas have to do a lot of extra work at runtime to have database communications fit into the pandas data model. This is by no means free and one of the reasons why SQL interaction in pandas is slow, not to mention all the extra hoops pandas has to jump through to (oftentimes unsuccessfully) manage data types.</p>

<p>To see how much ADBC could help my workflows I decided to test things out against the Python ADBC Postgres Driver and compare it to the functional equivalent in pandas. As of writing the ADBC Postgres driver is still <a href="https://arrow.apache.org/adbc/main/driver/status.html">considered experimental</a>, but I encourage you to <a href="https://arrow.apache.org/adbc/main/driver/installation.html">install it on your own</a> and try it out!</p>

<h2 id="performance-benchmarking">Performance Benchmarking</h2>

<p>The following code serves as a crude benchmark for performance. If you’d like to run this on your end, simply tweak <code class="language-plaintext highlighter-rouge">PG_URI</code> to match your database configuration.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">functools</span>
<span class="kn">import</span> <span class="nn">time</span>
<span class="kn">from</span> <span class="nn">collections.abc</span> <span class="kn">import</span> <span class="n">Callable</span>

<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">pyarrow</span> <span class="k">as</span> <span class="n">pa</span>
<span class="kn">import</span> <span class="nn">sqlalchemy</span> <span class="k">as</span> <span class="n">sa</span>
<span class="kn">from</span> <span class="nn">adbc_driver_postgresql</span> <span class="kn">import</span> <span class="n">dbapi</span>


<span class="n">PG_URI</span> <span class="o">=</span> <span class="s">"postgresql://"</span>


<span class="k">def</span> <span class="nf">print_runtime</span><span class="p">(</span><span class="n">func</span><span class="p">:</span> <span class="n">Callable</span><span class="p">):</span>

    <span class="o">@</span><span class="n">functools</span><span class="p">.</span><span class="n">wraps</span><span class="p">(</span><span class="n">func</span><span class="p">)</span>
    <span class="k">def</span> <span class="nf">wrapper</span><span class="p">(</span><span class="o">*</span><span class="n">args</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">):</span>
        <span class="n">start</span> <span class="o">=</span> <span class="n">time</span><span class="p">.</span><span class="n">time</span><span class="p">()</span>
        <span class="n">result</span> <span class="o">=</span> <span class="n">func</span><span class="p">(</span><span class="o">*</span><span class="n">args</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">)</span>
        <span class="n">end</span> <span class="o">=</span> <span class="n">time</span><span class="p">.</span><span class="n">time</span><span class="p">()</span>
        <span class="n">runtime</span> <span class="o">=</span> <span class="n">end</span> <span class="o">-</span> <span class="n">start</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"function </span><span class="si">{</span><span class="n">func</span><span class="p">.</span><span class="n">__name__</span><span class="si">}</span><span class="s"> took </span><span class="si">{</span><span class="n">runtime</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>

        <span class="k">return</span> <span class="n">result</span>

    <span class="k">return</span> <span class="n">wrapper</span>


<span class="o">@</span><span class="n">print_runtime</span>
<span class="k">def</span> <span class="nf">write_pandas</span><span class="p">(</span><span class="n">df</span><span class="p">:</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">):</span>
    <span class="n">table_name</span> <span class="o">=</span> <span class="s">"pandas_data"</span>
    <span class="n">engine</span> <span class="o">=</span> <span class="n">sa</span><span class="p">.</span><span class="n">create_engine</span><span class="p">(</span><span class="n">PG_URI</span><span class="p">)</span>
    <span class="n">df</span><span class="p">.</span><span class="n">to_sql</span><span class="p">(</span><span class="n">table_name</span><span class="p">,</span> <span class="n">engine</span><span class="p">,</span> <span class="n">if_exists</span><span class="o">=</span><span class="s">"replace"</span><span class="p">,</span> <span class="n">method</span><span class="o">=</span><span class="s">"multi"</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>


<span class="o">@</span><span class="n">print_runtime</span>
<span class="k">def</span> <span class="nf">write_arrow</span><span class="p">(</span><span class="n">tbl</span><span class="p">:</span> <span class="n">pa</span><span class="p">.</span><span class="n">Table</span><span class="p">):</span>
    <span class="n">table_name</span> <span class="o">=</span> <span class="s">"arrow_data"</span>
    <span class="k">with</span> <span class="n">dbapi</span><span class="p">.</span><span class="n">connect</span><span class="p">(</span><span class="n">PG_URI</span><span class="p">)</span> <span class="k">as</span> <span class="n">conn</span><span class="p">:</span>
        <span class="k">with</span> <span class="n">conn</span><span class="p">.</span><span class="n">cursor</span><span class="p">()</span> <span class="k">as</span> <span class="n">cur</span><span class="p">:</span>
            <span class="n">cur</span><span class="p">.</span><span class="n">execute</span><span class="p">(</span><span class="sa">f</span><span class="s">"DROP TABLE IF EXISTS </span><span class="si">{</span><span class="n">table_name</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>

        <span class="k">with</span> <span class="n">conn</span><span class="p">.</span><span class="n">cursor</span><span class="p">()</span> <span class="k">as</span> <span class="n">cur</span><span class="p">:</span>
            <span class="n">cur</span><span class="p">.</span><span class="n">adbc_ingest</span><span class="p">(</span><span class="n">table_name</span><span class="p">,</span> <span class="n">tbl</span><span class="p">)</span>

        <span class="n">conn</span><span class="p">.</span><span class="n">commit</span><span class="p">()</span>

<span class="o">@</span><span class="n">print_runtime</span>
<span class="k">def</span> <span class="nf">read_pandas</span><span class="p">()</span> <span class="o">-&gt;</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">:</span>
    <span class="n">table_name</span> <span class="o">=</span> <span class="s">"pandas_data"</span>
    <span class="n">engine</span> <span class="o">=</span> <span class="n">sa</span><span class="p">.</span><span class="n">create_engine</span><span class="p">(</span><span class="n">PG_URI</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_sql</span><span class="p">(</span><span class="sa">f</span><span class="s">"SELECT * FROM </span><span class="si">{</span><span class="n">table_name</span><span class="si">}</span><span class="s">"</span><span class="p">,</span> <span class="n">engine</span><span class="p">)</span>


<span class="o">@</span><span class="n">print_runtime</span>
<span class="k">def</span> <span class="nf">read_arrow</span><span class="p">()</span> <span class="o">-&gt;</span> <span class="n">pa</span><span class="p">.</span><span class="n">Table</span><span class="p">:</span>
    <span class="n">table_name</span> <span class="o">=</span> <span class="s">"arrow_data"</span>
    <span class="k">with</span> <span class="n">dbapi</span><span class="p">.</span><span class="n">connect</span><span class="p">(</span><span class="n">PG_URI</span><span class="p">)</span> <span class="k">as</span> <span class="n">conn</span><span class="p">:</span>
        <span class="k">with</span> <span class="n">conn</span><span class="p">.</span><span class="n">cursor</span><span class="p">()</span> <span class="k">as</span> <span class="n">cur</span><span class="p">:</span>
            <span class="n">cur</span><span class="p">.</span><span class="n">execute</span><span class="p">(</span><span class="sa">f</span><span class="s">"SELECT * FROM </span><span class="si">{</span><span class="n">table_name</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
            <span class="k">return</span> <span class="n">cur</span><span class="p">.</span><span class="n">fetch_arrow_table</span><span class="p">()</span>


<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
    <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">seed</span><span class="p">(</span><span class="mi">42</span><span class="p">)</span>
    <span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">10_000</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="p">(</span><span class="mi">100_000</span><span class="p">,</span> <span class="mi">10</span><span class="p">)),</span> <span class="n">columns</span><span class="o">=</span><span class="nb">list</span><span class="p">(</span><span class="s">"abcdefghij"</span><span class="p">))</span>
    <span class="n">tbl</span> <span class="o">=</span> <span class="n">pa</span><span class="p">.</span><span class="n">Table</span><span class="p">.</span><span class="n">from_pandas</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>

    <span class="n">write_pandas</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
    <span class="n">write_arrow</span><span class="p">(</span><span class="n">tbl</span><span class="p">)</span>

    <span class="n">df_new</span> <span class="o">=</span> <span class="n">read_pandas</span><span class="p">()</span>
    <span class="n">tbl_new</span> <span class="o">=</span> <span class="n">read_arrow</span><span class="p">()</span>
</code></pre></div></div>

<p>Executing this very unscientific benchmark yields the following results on my machine:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">function </span>write_pandas took 11.065816879272461
<span class="k">function </span>write_arrow took 1.1672894954681396
<span class="k">function </span>read_pandas took 0.2586965560913086
<span class="k">function </span>read_arrow took 0.0703287124633789
</code></pre></div></div>

<p>From this we can see the ADBC connector is significantly faster on both read and write. Keep in mind that Postgres is a <em>row-oriented</em> database; my expectation is that the performance benefits would be even bigger for a <em>column-oriented</em> database!</p>

<h2 id="better-data-types">Better Data Types</h2>

<p>If you’ve worked with pandas in an ETL workflow, chances are high that you’ve had to do some post-processing on numeric data. This happens often with nullable integral data (which the NumPy backend to pandas physically cannot store), but can also happen for many other reasons that differ across databases / driver implementations. For the sake of illustration, let’s append a row of NULL values to our tables.</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">arrow_data</span> <span class="k">VALUES</span> <span class="p">(</span><span class="k">NULL</span><span class="p">);</span>
<span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">pandas_data</span> <span class="k">VALUES</span> <span class="p">(</span><span class="k">NULL</span><span class="p">);</span>
</code></pre></div></div>

<p>This has no impact on the arrow code we wrote previously</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">&gt;&gt;&gt;</span> <span class="n">tbl_new</span> <span class="o">=</span> <span class="n">read_arrow</span><span class="p">()</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">tbl</span><span class="p">.</span><span class="n">schema</span> <span class="o">==</span> <span class="n">tbl_new</span><span class="p">.</span><span class="n">schema</span>
<span class="bp">True</span>
</code></pre></div></div>

<p>But will impact the pandas code</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">&gt;&gt;&gt;</span> <span class="n">df_new</span> <span class="o">=</span> <span class="n">read_pandas</span><span class="p">()</span>
<span class="o">&gt;&gt;&gt;</span> <span class="p">(</span><span class="n">df</span><span class="p">.</span><span class="n">dtypes</span> <span class="o">==</span> <span class="n">df_new</span><span class="p">.</span><span class="n">dtypes</span><span class="p">).</span><span class="nb">all</span><span class="p">()</span>
<span class="bp">False</span>
</code></pre></div></div>

<p>Even though nothing changed with the data type in the database, we’ve gone from using integral data in pandas / postgres to now introducing float data in pandas, solely due to the introduction of <code class="language-plaintext highlighter-rouge">NULL</code> values in postgres. This can come up unexpectedly and be very surprising. To prevent this on the pandas side, you will see things like:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">&gt;&gt;&gt;</span> <span class="n">df_new</span> <span class="o">=</span> <span class="n">df_new</span><span class="p">.</span><span class="n">astype</span><span class="p">(</span><span class="s">"Int32"</span><span class="p">)</span>
</code></pre></div></div>

<p>OR</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">&gt;&gt;&gt;</span> <span class="n">table_name</span> <span class="o">=</span> <span class="s">"pandas_data"</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">engine</span> <span class="o">=</span> <span class="n">sa</span><span class="p">.</span><span class="n">create_engine</span><span class="p">(</span><span class="n">PG_URI</span><span class="p">)</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">df_new</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_sql_query</span><span class="p">(</span><span class="sa">f</span><span class="s">"SELECT * FROM </span><span class="si">{</span><span class="n">table_name</span><span class="si">}</span><span class="s">"</span><span class="p">,</span> <span class="n">engine</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="s">"Int32"</span><span class="p">)</span>
</code></pre></div></div>

<p>OR (starting in pandas 2.0)</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">&gt;&gt;&gt;</span> <span class="n">table_name</span> <span class="o">=</span> <span class="s">"pandas_data"</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">engine</span> <span class="o">=</span> <span class="n">sa</span><span class="p">.</span><span class="n">create_engine</span><span class="p">(</span><span class="n">PG_URI</span><span class="p">)</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">df_new</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_sql</span><span class="p">(</span><span class="sa">f</span><span class="s">"SELECT * FROM </span><span class="si">{</span><span class="n">table_name</span><span class="si">}</span><span class="s">"</span><span class="p">,</span> <span class="n">engine</span><span class="p">,</span>
<span class="p">...</span>   <span class="n">dtype_backend</span><span class="o">=</span><span class="s">"pyarrow"</span><span class="p">)</span>
</code></pre></div></div>

<p>These are 3 different ways to solve the problem, each introducing their own subsequent nuance. If you already knew about the issues with nullable integral data and the NumPy backend in pandas then maybe this isn’t surprising, but not every user has or needs to have that low-level of an understanding of pandas. This was also a controlled example; in the real world you either need to be overly defensive or open to surprise when minor changes in your database data change your pandas data types and subsequent workflows. With the ADBC driver you do not have this issue; the data type you read is simply inferred from the database metadata.</p>

<h2 id="closing-thoughts">Closing Thoughts</h2>

<p>I for one am really excited to see how ADBC continues to evolve. Moving data from one database to another takes up a significant amount of my time as a data engineer, and the ability to do that faster with cleaner data types will be powerful. As more databases (particularly columnar ones) implement <a href="https://arrow.apache.org/docs/format/FlightSql.html">Arrow Flight SQL</a> or at least provide ADBC clients I expect a lot of ETL tools to start leveraging ADBC drivers in turn.</p>]]></content><author><name>Will Ayd</name></author><category term="performance" /><category term="python" /><category term="adbc" /><summary type="html"><![CDATA[This blog post describes the actively developed ADBC driver and what it means for libraries like pandas.]]></summary></entry><entry><title type="html">Comparing Cython to Rust - Evaluating Python Extensions</title><link href="https://willayd.com/comparing-cython-to-rust-evaluating-python-extensions.html" rel="alternate" type="text/html" title="Comparing Cython to Rust - Evaluating Python Extensions" /><published>2023-05-17T00:00:00+00:00</published><updated>2023-05-17T00:00:00+00:00</updated><id>https://willayd.com/comparing-cython-to-rust-evaluating-python-extensions</id><content type="html" xml:base="https://willayd.com/comparing-cython-to-rust-evaluating-python-extensions.html"><![CDATA[<p><a href="https://www.rust-lang.org/">Rust</a> as a language has had tremendous growth in recent years. With no intention of starting a language war, Rust has a much stronger type checking system than a language like C, and arguably feels more approachable than a language like C++. It also includes thread safety as part of the language, which can be immensely useful for those looking to optimize their system.</p>

<p>Rust is also growing in usage as an extension language for Python. <a href="https://github.com/PyO3/pyo3">PyO3</a> makes writing extensions relatively easy, especially when compared to the same toolchain(s) for C/C++ extensions. While not as “pythonic” as Cython, you can argue that Rust is more approachable to Python-developers than C/C++ are as languages. To see it in action, let’s compare a Cython written extension to a Rust-written extension.</p>

<p>For demonstration purposes we are taking a trivial example of a custom-implemented <code class="language-plaintext highlighter-rouge">max</code> function along the columns of a NumPy array. The example is admittedly naive (NumPy natively can handle this), but as a developer you may find yourself following a similar pattern for custom algorithms.</p>

<p>The source code for these exercises is available on my <a href="https://github.com/WillAyd/rustpy">GitHub</a>.</p>

<h2 id="coding-the-example-in-cython">Coding the example in Cython</h2>

<p>Here is our <code class="language-plaintext highlighter-rouge">find_max</code> function with a relatively optimized Cython implementation. Within a <code class="language-plaintext highlighter-rouge">cdef</code> function, we determine the bounds of a 2D int64 array, loop over the columns / rows and evaluate each member of the array, looking for the largest value in each column.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">cimport</span> <span class="n">cython</span>
<span class="k">from</span> <span class="n">libc</span><span class="p">.</span><span class="n">limits</span> <span class="n">cimport</span> <span class="n">LLONG_MIN</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="k">from</span> <span class="n">numpy</span> <span class="n">cimport</span> <span class="n">ndarray</span><span class="p">,</span> <span class="n">int64_t</span>
<span class="kn">import</span> <span class="nn">time</span>

<span class="o">@</span><span class="n">cython</span><span class="p">.</span><span class="n">boundscheck</span><span class="p">(</span><span class="bp">False</span><span class="p">)</span>
<span class="o">@</span><span class="n">cython</span><span class="p">.</span><span class="n">wraparound</span><span class="p">(</span><span class="bp">False</span><span class="p">)</span>
<span class="n">cdef</span> <span class="n">ndarray</span><span class="p">[</span><span class="n">int64_t</span><span class="p">,</span> <span class="n">ndim</span><span class="o">=</span><span class="mi">1</span><span class="p">]</span> <span class="n">_find_max</span><span class="p">(</span><span class="n">ndarray</span><span class="p">[</span><span class="n">int64_t</span><span class="p">,</span> <span class="n">ndim</span><span class="o">=</span><span class="mi">2</span><span class="p">]</span> <span class="n">values</span><span class="p">):</span>
    <span class="n">cdef</span><span class="p">:</span>
        <span class="n">ndarray</span><span class="p">[</span><span class="n">int64_t</span><span class="p">,</span> <span class="n">ndim</span><span class="o">=</span><span class="mi">1</span><span class="p">]</span> <span class="n">out</span>
        <span class="n">int64_t</span> <span class="n">val</span><span class="p">,</span> <span class="n">colnum</span><span class="p">,</span> <span class="n">rownum</span><span class="p">,</span> <span class="n">new_val</span>
        <span class="n">Py_ssize_t</span> <span class="n">N</span><span class="p">,</span> <span class="n">K</span>

    <span class="n">N</span><span class="p">,</span> <span class="n">K</span> <span class="o">=</span> <span class="p">(</span><span class="o">&lt;</span><span class="nb">object</span><span class="o">&gt;</span><span class="n">values</span><span class="p">).</span><span class="n">shape</span>
    <span class="n">out</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">K</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="p">.</span><span class="n">int64</span><span class="p">)</span>
    <span class="k">for</span> <span class="n">colnum</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">K</span><span class="p">):</span>
        <span class="n">val</span> <span class="o">=</span> <span class="n">LLONG_MIN</span>  <span class="c1"># imperfect assumption, but no INT64_T_MIN from numpy
</span>        <span class="k">for</span> <span class="n">rownum</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">N</span><span class="p">):</span>
            <span class="n">new_val</span> <span class="o">=</span> <span class="n">values</span><span class="p">[</span><span class="n">rownum</span><span class="p">,</span> <span class="n">colnum</span><span class="p">]</span>
            <span class="k">if</span> <span class="n">val</span> <span class="o">&lt;</span> <span class="n">new_val</span><span class="p">:</span>
                <span class="n">val</span> <span class="o">=</span> <span class="n">new_val</span>

        <span class="n">out</span><span class="p">[</span><span class="n">colnum</span><span class="p">]</span> <span class="o">=</span> <span class="n">val</span>

    <span class="k">return</span> <span class="n">out</span>


<span class="k">def</span> <span class="nf">find_max</span><span class="p">(</span><span class="n">ndarray</span><span class="p">[</span><span class="n">int64_t</span><span class="p">,</span> <span class="n">ndim</span><span class="o">=</span><span class="mi">2</span><span class="p">]</span> <span class="n">values</span><span class="p">):</span>
    <span class="n">cdef</span> <span class="n">ndarray</span><span class="p">[</span><span class="n">int64_t</span><span class="p">,</span> <span class="n">ndim</span><span class="o">=</span><span class="mi">1</span><span class="p">]</span> <span class="n">result</span>
    <span class="n">start</span> <span class="o">=</span> <span class="n">time</span><span class="p">.</span><span class="n">time_ns</span><span class="p">()</span>
    <span class="n">result</span> <span class="o">=</span> <span class="n">_find_max</span><span class="p">(</span><span class="n">values</span><span class="p">)</span>
    <span class="n">end</span> <span class="o">=</span> <span class="n">time</span><span class="p">.</span><span class="n">time_ns</span><span class="p">()</span>
    <span class="n">duration</span> <span class="o">=</span> <span class="p">(</span><span class="n">end</span> <span class="o">-</span> <span class="n">start</span><span class="p">)</span> <span class="o">/</span> <span class="mi">1_000_000</span>
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"cypy took </span><span class="si">{</span><span class="n">duration</span><span class="si">}</span><span class="s"> milliseconds"</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">result</span>
</code></pre></div></div>

<p>For brevity I won’t be listing out the instructions to cythonize and build a shared library, but if you need you can follow similar instructions from the previous article on <a href="/fundamental-python-debugging-part-3-cython-extensions.html">debugging Cython extensions with gdb</a>. For this article, assume that this gets built to a shared library named <code class="language-plaintext highlighter-rouge">cypy</code>.</p>

<h2 id="building-the-same-in-rust">Building the same in Rust</h2>

<p>PyO3 will be our tool for setting up Rust &lt;&gt; Python interoperability. Per their <a href="https://pyo3.rs/v0.18.3/module">documentation on building modules</a> we could choose to build manually or use <a href="https://github.com/PyO3/maturin">maturin</a>. For ease of demonstration we will use the latter.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>maturin new rustpy
<span class="nv">$ </span><span class="nb">cd </span>rustpy
</code></pre></div></div>

<p>Within our newly created project, add <code class="language-plaintext highlighter-rouge">numpy == "0.18"</code> to the dependencies section. This will let us use the <a href="https://github.com/PyO3/rust-numpy">rust-numpy</a> crate to pass numpy arrows between Python and Rust. Afterwards, open <code class="language-plaintext highlighter-rouge">lib.rs</code> an insert the following code:</p>

<div class="language-rs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">use</span> <span class="nn">numpy</span><span class="p">::</span><span class="nn">ndarray</span><span class="p">::{</span><span class="n">Array1</span><span class="p">,</span> <span class="n">ArrayView2</span><span class="p">,</span> <span class="n">Axis</span><span class="p">};</span>
<span class="k">use</span> <span class="nn">numpy</span><span class="p">::{</span><span class="n">PyArray1</span><span class="p">,</span> <span class="n">PyReadonlyArray2</span><span class="p">};</span>
<span class="k">use</span> <span class="nn">pyo3</span><span class="p">::{</span><span class="n">pymodule</span><span class="p">,</span> <span class="nn">types</span><span class="p">::</span><span class="n">PyModule</span><span class="p">,</span> <span class="n">PyResult</span><span class="p">,</span> <span class="n">Python</span><span class="p">};</span>
<span class="k">use</span> <span class="nn">std</span><span class="p">::</span><span class="nn">time</span><span class="p">::</span><span class="n">SystemTime</span><span class="p">;</span>

<span class="nd">#[pymodule]</span>
<span class="nd">#[pyo3(name</span> <span class="nd">=</span> <span class="s">"rustpy"</span><span class="nd">)]</span>
<span class="k">fn</span> <span class="nf">rust_ext</span><span class="p">(</span><span class="n">_py</span><span class="p">:</span> <span class="n">Python</span><span class="o">&lt;</span><span class="nv">'_</span><span class="o">&gt;</span><span class="p">,</span> <span class="n">m</span><span class="p">:</span> <span class="o">&amp;</span><span class="n">PyModule</span><span class="p">)</span> <span class="k">-&gt;</span> <span class="n">PyResult</span><span class="o">&lt;</span><span class="p">()</span><span class="o">&gt;</span> <span class="p">{</span>
    <span class="k">fn</span> <span class="nf">find_max</span><span class="p">(</span><span class="n">arr</span><span class="p">:</span> <span class="n">ArrayView2</span><span class="o">&lt;</span><span class="nv">'_</span><span class="p">,</span> <span class="nb">i64</span><span class="o">&gt;</span><span class="p">)</span> <span class="k">-&gt;</span> <span class="n">Array1</span><span class="o">&lt;</span><span class="nb">i64</span><span class="o">&gt;</span> <span class="p">{</span>
        <span class="k">let</span> <span class="k">mut</span> <span class="n">out</span> <span class="o">=</span> <span class="nn">Array1</span><span class="p">::</span><span class="nf">default</span><span class="p">(</span><span class="n">arr</span><span class="nf">.ncols</span><span class="p">());</span>

        <span class="k">for</span> <span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="n">col</span><span class="p">)</span> <span class="k">in</span> <span class="n">arr</span><span class="nf">.axis_iter</span><span class="p">(</span><span class="nf">Axis</span><span class="p">(</span><span class="mi">1</span><span class="p">))</span><span class="nf">.enumerate</span><span class="p">()</span> <span class="p">{</span>
            <span class="k">let</span> <span class="k">mut</span> <span class="n">val</span> <span class="o">=</span> <span class="nn">i64</span><span class="p">::</span><span class="n">MIN</span><span class="p">;</span>
            <span class="k">for</span> <span class="n">x</span> <span class="k">in</span> <span class="n">col</span> <span class="p">{</span>
                <span class="k">if</span> <span class="n">val</span> <span class="o">&lt;</span> <span class="o">*</span><span class="n">x</span> <span class="p">{</span>
                    <span class="n">val</span> <span class="o">=</span> <span class="o">*</span><span class="n">x</span><span class="p">;</span>
                <span class="p">}</span>
            <span class="p">}</span>

            <span class="n">out</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">val</span><span class="p">;</span>
        <span class="p">}</span>

        <span class="n">out</span>
    <span class="p">}</span>

    <span class="nd">#[pyfn(m)]</span>
    <span class="nd">#[pyo3(name</span> <span class="nd">=</span> <span class="s">"find_max"</span><span class="nd">)]</span>
    <span class="k">fn</span> <span class="n">find_max_py</span><span class="o">&lt;</span><span class="nv">'py</span><span class="o">&gt;</span><span class="p">(</span><span class="n">py</span><span class="p">:</span> <span class="n">Python</span><span class="o">&lt;</span><span class="nv">'py</span><span class="o">&gt;</span><span class="p">,</span> <span class="n">x</span><span class="p">:</span> <span class="n">PyReadonlyArray2</span><span class="o">&lt;</span><span class="nv">'_</span><span class="p">,</span> <span class="nb">i64</span><span class="o">&gt;</span><span class="p">)</span> <span class="k">-&gt;</span> <span class="o">&amp;</span><span class="nv">'py</span> <span class="n">PyArray1</span><span class="o">&lt;</span><span class="nb">i64</span><span class="o">&gt;</span> <span class="p">{</span>
        <span class="k">let</span> <span class="n">start</span> <span class="o">=</span> <span class="nn">SystemTime</span><span class="p">::</span><span class="nf">now</span><span class="p">();</span>
        <span class="k">let</span> <span class="n">result</span> <span class="o">=</span> <span class="nf">find_max</span><span class="p">(</span><span class="n">x</span><span class="nf">.as_array</span><span class="p">())</span><span class="nf">.into_pyarray</span><span class="p">(</span><span class="n">py</span><span class="p">);</span>
        <span class="k">let</span> <span class="n">end</span> <span class="o">=</span> <span class="nn">SystemTime</span><span class="p">::</span><span class="nf">now</span><span class="p">();</span>
        <span class="k">let</span> <span class="n">duration</span> <span class="o">=</span> <span class="n">end</span><span class="nf">.duration_since</span><span class="p">(</span><span class="n">start</span><span class="p">)</span><span class="nf">.unwrap</span><span class="p">();</span>
        <span class="nd">println!</span><span class="p">(</span><span class="s">"rustpy took {} milliseconds"</span><span class="p">,</span> <span class="n">duration</span><span class="nf">.as_millis</span><span class="p">());</span>
        <span class="n">result</span>
    <span class="p">}</span>

    <span class="nf">Ok</span><span class="p">(())</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Studying the above closely, the <code class="language-plaintext highlighter-rouge">find_max_py</code> function is the bridge between Rust and Python, and it ultimately dispatches to the <code class="language-plaintext highlighter-rouge">find_max</code> function. That function accepts a 2 dimensional view of an array, and returns a newly created 1D array full of 64 bit integers. Within the function body, you see the dynamic creation of the return value, as well as iteration by column. While the semantics vary, you should see that this follows the same general outline as our Cython implementation.</p>

<p>With this in place, run <code class="language-plaintext highlighter-rouge">maturin develop --release</code> from the project root. This will take care of installing the local source code into a Python package with optimizations.</p>

<h2 id="comparing-results">Comparing Results</h2>

<p>Both implementations above include not-very-scientific timers to give us an idea of general performance. Let’s set up with the following code:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">&gt;&gt;&gt;</span> <span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">seed</span><span class="p">(</span><span class="mi">42</span><span class="p">)</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">arr</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">100_000</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="p">(</span><span class="mi">100</span><span class="p">,</span> <span class="mi">1_000_000</span><span class="p">))</span>
</code></pre></div></div>

<p>Let’s check our cypy performance:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">&gt;&gt;&gt;</span> <span class="kn">import</span> <span class="nn">cypy</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">result1</span> <span class="o">=</span> <span class="n">cypy</span><span class="p">.</span><span class="n">find_max</span><span class="p">(</span><span class="n">arr</span><span class="p">)</span>
<span class="n">cypy</span> <span class="n">took</span> <span class="mf">273.319301</span> <span class="n">milliseconds</span>
</code></pre></div></div>

<p>Versus the same function implemented in Rust:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">&gt;&gt;&gt;</span> <span class="kn">import</span> <span class="nn">rustpy</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">result2</span> <span class="o">=</span> <span class="n">rustpy</span><span class="p">.</span><span class="n">find_max</span><span class="p">(</span><span class="n">arr</span><span class="p">)</span>
<span class="n">rustpy</span> <span class="n">took</span> <span class="mi">116</span> <span class="n">milliseconds</span>
<span class="o">&gt;&gt;&gt;</span> <span class="p">(</span><span class="n">result1</span> <span class="o">==</span> <span class="n">result2</span><span class="p">).</span><span class="nb">all</span><span class="p">()</span>
<span class="bp">True</span>
</code></pre></div></div>

<p>The rust implementation only took ~45% of the time - not bad!</p>

<h2 id="parallelization">Parallelization</h2>

<p>Another area where Rust extensions can really shine is in parallelization, due to the aforementioned language guarantees of thread safety. Cython offers <a href="https://cython.readthedocs.io/en/latest/src/userguide/parallelism.html">parallelization</a> using OpenMP, but as <a href="https://github.com/pandas-dev/pandas/pull/53149">I recently discovered</a> there are quite a few downsides to that when it comes to packaging, usability and cross-platform behavior.</p>

<p>Since Rust handles this more natively, let’s see how it would tackle the above code but in a parallel way. For this purpose we are going to use the <a href="https://docs.rs/rayon/latest/rayon/">rayon</a> feature that comes bundled with the <a href="https://docs.rs/ndarray/latest/ndarray/">Rust ndarray crate</a>. To enable that, go ahead and add <code class="language-plaintext highlighter-rouge">ndarray = {version = "0.15", features=["rayon"]}</code> to your dependencies in Cargo.toml.</p>

<p>Afterwards we are going to add 2 new functions to our rustpy library - one to handle the internals and the other to serve as the bridge to Python. For starters, let us update the imports at the top of our module:</p>

<div class="language-rs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">use</span> <span class="nn">ndarray</span><span class="p">::</span><span class="nn">parallel</span><span class="p">::</span><span class="nn">prelude</span><span class="p">::</span><span class="o">*</span><span class="p">;</span>
<span class="k">use</span> <span class="nn">numpy</span><span class="p">::</span><span class="nn">ndarray</span><span class="p">::{</span><span class="n">Array1</span><span class="p">,</span> <span class="n">ArrayView2</span><span class="p">,</span> <span class="n">Axis</span><span class="p">,</span> <span class="n">Zip</span><span class="p">};</span>
<span class="k">use</span> <span class="nn">numpy</span><span class="p">::{</span><span class="n">IntoPyArray</span><span class="p">,</span> <span class="n">PyArray1</span><span class="p">,</span> <span class="n">PyReadonlyArray2</span><span class="p">};</span>
<span class="k">use</span> <span class="nn">pyo3</span><span class="p">::{</span><span class="n">pymodule</span><span class="p">,</span> <span class="nn">types</span><span class="p">::</span><span class="n">PyModule</span><span class="p">,</span> <span class="n">PyResult</span><span class="p">,</span> <span class="n">Python</span><span class="p">};</span>
<span class="k">use</span> <span class="nn">std</span><span class="p">::</span><span class="nn">sync</span><span class="p">::{</span><span class="nb">Arc</span><span class="p">,</span> <span class="n">Mutex</span><span class="p">};</span>
<span class="k">use</span> <span class="nn">std</span><span class="p">::</span><span class="nn">time</span><span class="p">::</span><span class="n">SystemTime</span><span class="p">;</span>
</code></pre></div></div>

<p>Then go ahead and all the following code below the <code class="language-plaintext highlighter-rouge">find_max_py</code> function.</p>

<div class="language-rs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">fn</span> <span class="nf">find_max_parallel</span><span class="p">(</span><span class="n">arr</span><span class="p">:</span> <span class="n">ArrayView2</span><span class="o">&lt;</span><span class="nv">'_</span><span class="p">,</span> <span class="nb">i64</span><span class="o">&gt;</span><span class="p">)</span> <span class="k">-&gt;</span> <span class="n">Array1</span><span class="o">&lt;</span><span class="nb">i64</span><span class="o">&gt;</span> <span class="p">{</span>
    <span class="k">let</span> <span class="n">mutex</span> <span class="o">=</span> <span class="nn">Arc</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="nn">Mutex</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="nn">Array1</span><span class="p">::</span><span class="nf">default</span><span class="p">(</span><span class="n">arr</span><span class="nf">.ncols</span><span class="p">())));</span>

    <span class="c1">// parallel iterator is not implemented, so some hacks</span>
    <span class="c1">// https://github.com/rust-ndarray/ndarray/issues/1043</span>
    <span class="c1">// https://github.com/rust-ndarray/ndarray/issues/1093</span>
    <span class="nn">Zip</span><span class="p">::</span><span class="nf">indexed</span><span class="p">(</span><span class="n">arr</span><span class="nf">.axis_iter</span><span class="p">(</span><span class="nf">Axis</span><span class="p">(</span><span class="mi">1</span><span class="p">)))</span>
        <span class="nf">.into_par_iter</span><span class="p">()</span>
        <span class="nf">.for_each</span><span class="p">(|(</span><span class="n">i</span><span class="p">,</span> <span class="n">col</span><span class="p">)|</span> <span class="p">{</span>
            <span class="k">let</span> <span class="k">mut</span> <span class="n">val</span> <span class="o">=</span> <span class="nn">i64</span><span class="p">::</span><span class="n">MIN</span><span class="p">;</span>
            <span class="k">for</span> <span class="n">x</span> <span class="k">in</span> <span class="n">col</span> <span class="p">{</span>
                <span class="k">if</span> <span class="n">val</span> <span class="o">&lt;</span> <span class="o">*</span><span class="n">x</span> <span class="p">{</span>
                    <span class="n">val</span> <span class="o">=</span> <span class="o">*</span><span class="n">x</span><span class="p">;</span>
                <span class="p">}</span>
            <span class="p">}</span>

            <span class="k">let</span> <span class="k">mut</span> <span class="n">guard</span> <span class="o">=</span> <span class="n">mutex</span><span class="nf">.lock</span><span class="p">()</span><span class="nf">.unwrap</span><span class="p">();</span>
            <span class="n">guard</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">val</span><span class="p">;</span>
        <span class="p">});</span>

    <span class="c1">// https://stackoverflow.com/questions/29177449/how-to-take-ownership-of-t-from-arcmutext</span>
    <span class="k">let</span> <span class="n">lock</span> <span class="o">=</span> <span class="nn">Arc</span><span class="p">::</span><span class="nf">try_unwrap</span><span class="p">(</span><span class="n">mutex</span><span class="p">)</span><span class="nf">.expect</span><span class="p">(</span><span class="s">"Lock still have multiple owners"</span><span class="p">);</span>
    <span class="n">lock</span><span class="nf">.into_inner</span><span class="p">()</span><span class="nf">.expect</span><span class="p">(</span><span class="s">"Mutex cannot be locked"</span><span class="p">)</span>
<span class="p">}</span>

<span class="c1">// wrapper of `find_max`</span>
<span class="nd">#[pyfn(m)]</span>
<span class="nd">#[pyo3(name</span> <span class="nd">=</span> <span class="s">"find_max_parallel"</span><span class="nd">)]</span>
<span class="k">fn</span> <span class="n">find_max_py_parallel</span><span class="o">&lt;</span><span class="nv">'py</span><span class="o">&gt;</span><span class="p">(</span>
    <span class="n">py</span><span class="p">:</span> <span class="n">Python</span><span class="o">&lt;</span><span class="nv">'py</span><span class="o">&gt;</span><span class="p">,</span>
    <span class="n">x</span><span class="p">:</span> <span class="n">PyReadonlyArray2</span><span class="o">&lt;</span><span class="nv">'_</span><span class="p">,</span> <span class="nb">i64</span><span class="o">&gt;</span><span class="p">,</span>
<span class="p">)</span> <span class="k">-&gt;</span> <span class="o">&amp;</span><span class="nv">'py</span> <span class="n">PyArray1</span><span class="o">&lt;</span><span class="nb">i64</span><span class="o">&gt;</span> <span class="p">{</span>
    <span class="k">let</span> <span class="n">start</span> <span class="o">=</span> <span class="nn">SystemTime</span><span class="p">::</span><span class="nf">now</span><span class="p">();</span>
    <span class="k">let</span> <span class="n">result</span> <span class="o">=</span> <span class="nf">find_max_parallel</span><span class="p">(</span><span class="n">x</span><span class="nf">.as_array</span><span class="p">())</span><span class="nf">.into_pyarray</span><span class="p">(</span><span class="n">py</span><span class="p">);</span>
    <span class="k">let</span> <span class="n">end</span> <span class="o">=</span> <span class="nn">SystemTime</span><span class="p">::</span><span class="nf">now</span><span class="p">();</span>
    <span class="k">let</span> <span class="n">duration</span> <span class="o">=</span> <span class="n">end</span><span class="nf">.duration_since</span><span class="p">(</span><span class="n">start</span><span class="p">)</span><span class="nf">.unwrap</span><span class="p">();</span>
    <span class="nd">println!</span><span class="p">(</span><span class="s">"rustpy parallel took {} milliseconds"</span><span class="p">,</span> <span class="n">duration</span><span class="nf">.as_millis</span><span class="p">());</span>
    <span class="n">result</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Within the comments I’ve linked some StackOverflow articles that you may find of interest. At a high level, now that we want to execute things in parallel we need to implement a <a href="https://doc.rust-lang.org/std/sync/struct.Mutex.html">Mutex</a> to prevent data races. We also use a thread-safe reference counter <a href="https://doc.rust-lang.org/std/sync/struct.Arc.html">Arc</a>; using these in tandem is a common pattern in Rust.</p>

<p>So how does this compare performance-wise to our examples above?</p>

<div class="language-rs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">&gt;&gt;&gt;</span> <span class="n">import</span> <span class="n">rustpy</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">result3</span> <span class="o">=</span> <span class="n">rustpy</span><span class="nf">.find_max_parallel</span><span class="p">(</span><span class="n">arr</span><span class="p">)</span>
<span class="n">rustpy</span> <span class="n">parallel</span> <span class="n">took</span> <span class="mi">234</span> <span class="n">milliseconds</span>
<span class="o">&gt;&gt;&gt;</span> <span class="p">(</span><span class="n">result2</span> <span class="o">==</span> <span class="n">result3</span><span class="p">)</span><span class="nf">.all</span><span class="p">()</span>
<span class="n">True</span>
</code></pre></div></div>

<p>We get the same results which is great, but compared to the non-parallel implementation we are now slower - almost twice as slow. What gives?!?</p>

<p>Without peering into every detail, it goes without saying that there is “no such thing as a free lunch”. Using the mutex to synchronize parallel code above is no exception, and likely the cost of that synchronization far exceeds the benefit of it. Keep in mind that we are dealing with an array of 100 x 1_000_000 and attempting to synchronize a thread per column. That’s a lot of threads to operate on rows of 100 records!</p>

<p>What happens if we transpose the array?</p>

<div class="language-rs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">&gt;&gt;&gt;</span> <span class="n">arr2</span> <span class="o">=</span> <span class="n">arr</span><span class="py">.T</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">arr2</span><span class="nf">.shape</span>
<span class="p">(</span><span class="mi">1000000</span><span class="p">,</span> <span class="mi">100</span><span class="p">)</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">rustpy</span><span class="nf">.find_max</span><span class="p">(</span><span class="n">arr2</span><span class="p">)</span>
<span class="n">rustpy</span> <span class="n">took</span> <span class="mi">67</span> <span class="n">milliseconds</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">rustpy</span><span class="nf">.find_max_parallel</span><span class="p">(</span><span class="n">arr2</span><span class="p">)</span>
<span class="n">rustpy</span> <span class="n">parallel</span> <span class="n">took</span> <span class="mi">38</span> <span class="n">milliseconds</span>
</code></pre></div></div>

<p>That’s more like it! Whereas before we created 1_000_000 threads to operate on arrays of 100 records, now we use 100 threads to operate on arrays of 1_000_000 records. The relative cost of starting / stopping threads and synchronizing access via the mutex in this case is far lower than the relative performance gain we get from allowing threads to operate on large arrays in parallel.</p>

<h2 id="even-faster-parallelization">Even Faster Parallelization</h2>

<p><a href="https://github.com/Dr-Irv">Irv Lustig</a> had an idea that we could do away with the mutex, which would reduce the parallelization overhead of synchronizing access to the <code class="language-plaintext highlighter-rouge">out</code> variable. Internally the NumPy array manages its data in a contiguous array of memory, and indexing methods like <code class="language-plaintext highlighter-rouge">out[i]</code> just points to a location in memory that is <code class="language-plaintext highlighter-rouge">i</code> steps away from the start of that array. Because each thread manages its own value of <code class="language-plaintext highlighter-rouge">i</code>, each thread also writes to a unique memory location without any overlap. Careful attention paid to this fact makes the synchronization unnecessary.</p>

<p>Rust by default is skeptical of this, so we have to jump through a few hoops to make it work. Stepwise the first thing we wanted to do was get rid of the Mutex. However, Rust will reject the following code:</p>

<div class="language-rs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">let</span> <span class="k">mut</span> <span class="n">out</span> <span class="o">=</span> <span class="nn">Array1</span><span class="p">::</span><span class="nf">default</span><span class="p">(</span><span class="n">arr</span><span class="nf">.ncols</span><span class="p">());</span>

<span class="nn">Zip</span><span class="p">::</span><span class="nf">indexed</span><span class="p">(</span><span class="n">arr</span><span class="nf">.axis_iter</span><span class="p">(</span><span class="nf">Axis</span><span class="p">(</span><span class="mi">1</span><span class="p">)))</span>
    <span class="nf">.into_par_iter</span><span class="p">()</span>
    <span class="nf">.for_each</span><span class="p">(|(</span><span class="n">i</span><span class="p">,</span> <span class="n">col</span><span class="p">)|</span> <span class="p">{</span>
        <span class="k">let</span> <span class="k">mut</span> <span class="n">val</span> <span class="o">=</span> <span class="nn">i64</span><span class="p">::</span><span class="n">MIN</span><span class="p">;</span>
        <span class="k">for</span> <span class="n">x</span> <span class="k">in</span> <span class="n">col</span> <span class="p">{</span>
            <span class="k">if</span> <span class="n">val</span> <span class="o">&lt;</span> <span class="o">*</span><span class="n">x</span> <span class="p">{</span>
                <span class="n">val</span> <span class="o">=</span> <span class="o">*</span><span class="n">x</span><span class="p">;</span>
            <span class="p">}</span>
        <span class="p">}</span>

        <span class="n">out</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">val</span><span class="p">;</span>
    <span class="p">});</span>
<span class="n">out</span>
</code></pre></div></div>

<p>With the following error</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>error[E0596]: cannot borrow <span class="sb">`</span>out<span class="sb">`</span> as mutable, as it is a captured variable <span class="k">in </span>a <span class="sb">`</span>Fn<span class="sb">`</span> closure
</code></pre></div></div>

<p>As explained in <a href="https://users.rust-lang.org/t/cannot-borrow-write-as-mutable-as-it-is-a-captured-variable-in-a-fn-closure/78560">this link</a> the closure cannot use a mutable reference (here the <code class="language-plaintext highlighter-rouge">out</code> variable) defined outside of its scope. To make this possible we use the <a href="https://doc.rust-lang.org/std/cell/struct.UnsafeCell.html">UnsafeCell</a> primitive. Our first attempt to do so could look something like this:</p>

<div class="language-rs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">let</span> <span class="k">mut</span> <span class="n">out</span> <span class="o">=</span> <span class="nn">Array1</span><span class="p">::</span><span class="nf">default</span><span class="p">(</span><span class="n">arr</span><span class="nf">.ncols</span><span class="p">());</span>
<span class="k">let</span> <span class="n">uout</span> <span class="o">=</span> <span class="nn">UnsafeCell</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="o">&amp;</span><span class="k">mut</span> <span class="n">out</span><span class="p">);</span>

<span class="o">...</span>
<span class="c1">// Let's assume we are within the closure</span>
   <span class="p">(</span><span class="o">*</span><span class="n">uout</span><span class="nf">.get</span><span class="p">())[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">val</span><span class="p">;</span>
<span class="p">});</span>

<span class="n">out</span>
</code></pre></div></div>

<p>Alas things aren’t so simple. This will in turn yield another error</p>

<div class="language-rs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">error</span><span class="p">[</span><span class="n">E0277</span><span class="p">]:</span> <span class="err">`</span><span class="n">UnsafeCell</span><span class="o">&lt;&amp;</span><span class="k">mut</span> <span class="n">ArrayBase</span><span class="o">&lt;</span><span class="n">OwnedRepr</span><span class="o">&lt;</span><span class="nb">i64</span><span class="o">&gt;</span><span class="p">,</span> <span class="n">Dim</span><span class="o">&lt;</span><span class="p">[</span><span class="nb">usize</span><span class="p">;</span> <span class="mi">1</span><span class="p">]</span><span class="o">&gt;&gt;&gt;</span><span class="err">`</span> <span class="n">cannot</span> <span class="n">be</span> <span class="n">shared</span> <span class="n">between</span> <span class="n">threads</span> <span class="n">safely</span>

<span class="o">...</span>

 <span class="o">=</span> <span class="n">help</span><span class="p">:</span> <span class="n">within</span> <span class="err">`</span><span class="p">[</span><span class="n">closure</span><span class="o">@</span><span class="n">src</span><span class="o">/</span><span class="n">lib</span><span class="py">.rs</span><span class="p">:</span><span class="mi">56</span><span class="p">:</span><span class="mi">23</span><span class="p">:</span> <span class="mi">56</span><span class="p">:</span><span class="mi">33</span><span class="p">]</span><span class="err">`</span><span class="p">,</span> <span class="n">the</span> <span class="k">trait</span> <span class="err">`</span><span class="nb">Sync</span><span class="err">`</span> <span class="n">is</span> <span class="n">not</span> <span class="n">implemented</span> <span class="k">for</span> <span class="err">`</span><span class="n">UnsafeCell</span><span class="o">&lt;&amp;</span><span class="k">mut</span> <span class="n">ArrayBase</span><span class="o">&lt;</span><span class="n">OwnedRepr</span><span class="o">&lt;</span><span class="nb">i64</span><span class="o">&gt;</span><span class="p">,</span> <span class="n">Dim</span><span class="o">&lt;</span><span class="p">[</span><span class="nb">usize</span><span class="p">;</span> <span class="mi">1</span><span class="p">]</span><span class="o">&gt;&gt;&gt;</span><span class="err">`</span>
</code></pre></div></div>

<p>If you look carefully the note that the trait <code class="language-plaintext highlighter-rouge">Sync is not implemented...</code> means Rust isn’t happy we are trying to use that object across threads without the <code class="language-plaintext highlighter-rouge">Sync</code> trait being implemented on it. Some research will take us to the <a href="https://doc.rust-lang.org/std/cell/struct.SyncUnsafeCell.html">SyncUnsafeCell</a>. This object implements the <code class="language-plaintext highlighter-rouge">Sync</code> trait, but as of writing is only available in nightly builds. While it is something to track, it does not help us today.</p>

<p>To work around this, user <a href="https://stackoverflow.com/users/1704411/alice-ryhl">Alice Ryhl</a> over at StackOverflow came up with <a href="https://stackoverflow.com/a/65182786/621736">this nifty solution</a>. Alice’s code works generically for slices; the implementation we have specializes only to <code class="language-plaintext highlighter-rouge">Array1&lt;i64&gt;</code> types, but keeps the same structure in place.</p>

<p>At a high level, instead of using the <code class="language-plaintext highlighter-rouge">UnsafeCell</code> directly, we create our own structure that uses the <code class="language-plaintext highlighter-rouge">UnsafeCell</code> as a field member. The custom structure provides blank trait implementations for <code class="language-plaintext highlighter-rouge">Send</code> and <code class="language-plaintext highlighter-rouge">Sync</code> so the compiler is happy to let it work across threads. With that in place, we can call the <code class="language-plaintext highlighter-rouge">write</code> member function from within our threads.</p>

<div class="language-rs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// https://stackoverflow.com/questions/65178245/how-do-i-write-to-a-mutable-slice-from-multiple-threads-at-arbitrary-indexes-wit</span>
<span class="nd">#[derive(Copy,</span> <span class="nd">Clone)]</span>
<span class="k">struct</span> <span class="n">UnsafeArray1</span><span class="o">&lt;</span><span class="nv">'a</span><span class="o">&gt;</span> <span class="p">{</span>
    <span class="n">array</span><span class="p">:</span> <span class="o">&amp;</span><span class="nv">'a</span> <span class="n">UnsafeCell</span><span class="o">&lt;</span><span class="n">Array1</span><span class="o">&lt;</span><span class="nb">i64</span><span class="o">&gt;&gt;</span><span class="p">,</span>
<span class="p">}</span>

<span class="k">unsafe</span> <span class="k">impl</span><span class="o">&lt;</span><span class="nv">'a</span><span class="o">&gt;</span> <span class="nb">Send</span> <span class="k">for</span> <span class="n">UnsafeArray1</span><span class="o">&lt;</span><span class="nv">'a</span><span class="o">&gt;</span> <span class="p">{}</span>
<span class="k">unsafe</span> <span class="k">impl</span><span class="o">&lt;</span><span class="nv">'a</span><span class="o">&gt;</span> <span class="nb">Sync</span> <span class="k">for</span> <span class="n">UnsafeArray1</span><span class="o">&lt;</span><span class="nv">'a</span><span class="o">&gt;</span> <span class="p">{}</span>

<span class="k">impl</span><span class="o">&lt;</span><span class="nv">'a</span><span class="o">&gt;</span> <span class="n">UnsafeArray1</span><span class="o">&lt;</span><span class="nv">'a</span><span class="o">&gt;</span> <span class="p">{</span>
    <span class="k">pub</span> <span class="k">fn</span> <span class="nf">new</span><span class="p">(</span><span class="n">array</span><span class="p">:</span> <span class="o">&amp;</span><span class="nv">'a</span> <span class="k">mut</span> <span class="n">Array1</span><span class="o">&lt;</span><span class="nb">i64</span><span class="o">&gt;</span><span class="p">)</span> <span class="k">-&gt;</span> <span class="k">Self</span> <span class="p">{</span>
        <span class="k">let</span> <span class="n">ptr</span> <span class="o">=</span> <span class="n">array</span> <span class="k">as</span> <span class="o">*</span><span class="k">mut</span> <span class="n">Array1</span><span class="o">&lt;</span><span class="nb">i64</span><span class="o">&gt;</span> <span class="k">as</span> <span class="o">*</span><span class="k">const</span> <span class="n">UnsafeCell</span><span class="o">&lt;</span><span class="n">Array1</span><span class="o">&lt;</span><span class="nb">i64</span><span class="o">&gt;&gt;</span><span class="p">;</span>
        <span class="k">Self</span> <span class="p">{</span>
            <span class="n">array</span><span class="p">:</span> <span class="k">unsafe</span> <span class="p">{</span> <span class="o">&amp;*</span><span class="n">ptr</span> <span class="p">},</span>
        <span class="p">}</span>
    <span class="p">}</span>

    <span class="cd">/// SAFETY: It is UB if two threads write to the same index without</span>
    <span class="cd">/// synchronization.</span>
    <span class="k">pub</span> <span class="k">unsafe</span> <span class="k">fn</span> <span class="nf">write</span><span class="p">(</span><span class="o">&amp;</span><span class="k">self</span><span class="p">,</span> <span class="n">i</span><span class="p">:</span> <span class="nb">usize</span><span class="p">,</span> <span class="n">value</span><span class="p">:</span> <span class="nb">i64</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">let</span> <span class="n">ptr</span> <span class="o">=</span> <span class="k">self</span><span class="py">.array</span><span class="nf">.get</span><span class="p">();</span>
        <span class="p">(</span><span class="o">*</span><span class="n">ptr</span><span class="p">)[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">value</span><span class="p">;</span>
    <span class="p">}</span>
<span class="p">}</span>

<span class="k">fn</span> <span class="nf">find_max_unsafe</span><span class="p">(</span><span class="n">arr</span><span class="p">:</span> <span class="n">ArrayView2</span><span class="o">&lt;</span><span class="nv">'_</span><span class="p">,</span> <span class="nb">i64</span><span class="o">&gt;</span><span class="p">)</span> <span class="k">-&gt;</span> <span class="n">Array1</span><span class="o">&lt;</span><span class="nb">i64</span><span class="o">&gt;</span> <span class="p">{</span>
    <span class="k">let</span> <span class="k">mut</span> <span class="n">out</span> <span class="o">=</span> <span class="nn">Array1</span><span class="p">::</span><span class="nf">default</span><span class="p">(</span><span class="n">arr</span><span class="nf">.ncols</span><span class="p">());</span>
    <span class="k">let</span> <span class="n">uout</span> <span class="o">=</span> <span class="nn">UnsafeArray1</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="o">&amp;</span><span class="k">mut</span> <span class="n">out</span><span class="p">);</span>

    <span class="nn">Zip</span><span class="p">::</span><span class="nf">indexed</span><span class="p">(</span><span class="n">arr</span><span class="nf">.axis_iter</span><span class="p">(</span><span class="nf">Axis</span><span class="p">(</span><span class="mi">1</span><span class="p">)))</span>
        <span class="nf">.into_par_iter</span><span class="p">()</span>
        <span class="nf">.for_each</span><span class="p">(|(</span><span class="n">i</span><span class="p">,</span> <span class="n">col</span><span class="p">)|</span> <span class="p">{</span>
            <span class="k">let</span> <span class="k">mut</span> <span class="n">val</span> <span class="o">=</span> <span class="nn">i64</span><span class="p">::</span><span class="n">MIN</span><span class="p">;</span>
            <span class="k">for</span> <span class="n">x</span> <span class="k">in</span> <span class="n">col</span> <span class="p">{</span>
                <span class="k">if</span> <span class="n">val</span> <span class="o">&lt;</span> <span class="o">*</span><span class="n">x</span> <span class="p">{</span>
                    <span class="n">val</span> <span class="o">=</span> <span class="o">*</span><span class="n">x</span><span class="p">;</span>
                <span class="p">}</span>
            <span class="p">}</span>

            <span class="k">unsafe</span> <span class="p">{</span> <span class="n">uout</span><span class="nf">.write</span><span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="n">val</span><span class="p">)</span> <span class="p">};</span>
        <span class="p">});</span>

    <span class="n">out</span>
<span class="p">}</span>

<span class="nd">#[pyfn(m)]</span>
<span class="nd">#[pyo3(name</span> <span class="nd">=</span> <span class="s">"find_max_unsafe"</span><span class="nd">)]</span>
<span class="k">fn</span> <span class="n">find_max_py_unsafe</span><span class="o">&lt;</span><span class="nv">'py</span><span class="o">&gt;</span><span class="p">(</span><span class="n">py</span><span class="p">:</span> <span class="n">Python</span><span class="o">&lt;</span><span class="nv">'py</span><span class="o">&gt;</span><span class="p">,</span> <span class="n">x</span><span class="p">:</span> <span class="n">PyReadonlyArray2</span><span class="o">&lt;</span><span class="nv">'_</span><span class="p">,</span> <span class="nb">i64</span><span class="o">&gt;</span><span class="p">)</span> <span class="k">-&gt;</span> <span class="o">&amp;</span><span class="nv">'py</span> <span class="n">PyArray1</span><span class="o">&lt;</span><span class="nb">i64</span><span class="o">&gt;</span> <span class="p">{</span>
    <span class="k">let</span> <span class="n">start</span> <span class="o">=</span> <span class="nn">SystemTime</span><span class="p">::</span><span class="nf">now</span><span class="p">();</span>
    <span class="k">let</span> <span class="n">result</span> <span class="o">=</span> <span class="nf">find_max_unsafe</span><span class="p">(</span><span class="n">x</span><span class="nf">.as_array</span><span class="p">())</span><span class="nf">.into_pyarray</span><span class="p">(</span><span class="n">py</span><span class="p">);</span>
    <span class="k">let</span> <span class="n">end</span> <span class="o">=</span> <span class="nn">SystemTime</span><span class="p">::</span><span class="nf">now</span><span class="p">();</span>
    <span class="k">let</span> <span class="n">duration</span> <span class="o">=</span> <span class="n">end</span><span class="nf">.duration_since</span><span class="p">(</span><span class="n">start</span><span class="p">)</span><span class="nf">.unwrap</span><span class="p">();</span>
    <span class="nd">println!</span><span class="p">(</span><span class="s">"rustpy unsafe took {} milliseconds"</span><span class="p">,</span> <span class="n">duration</span><span class="nf">.as_millis</span><span class="p">());</span>
    <span class="n">result</span>
<span class="p">}</span>
</code></pre></div></div>

<h2 id="turning-off-bounds-checking">Turning off bounds checking</h2>

<p>Since we are running <code class="language-plaintext highlighter-rouge">unsafe</code> code blocks, we also have the ability to disable bounds checking our arrays. In Cython you would typically do this with the <code class="language-plaintext highlighter-rouge">@cython boundscheck(False)</code> decorator. With the <a href="https://docs.rs/ndarray/latest/ndarray/">ndarray rust crate</a> you would replace the index operator <code class="language-plaintext highlighter-rouge">[]</code> with <a href="https://docs.rs/ndarray/latest/ndarray/struct.ArrayBase.html#method.uget">uget</a> or <a href="https://docs.rs/ndarray/latest/ndarray/struct.ArrayBase.html#method.uget_mut">uget_mut</a>. For us, this means changing our write implementation for the <code class="language-plaintext highlighter-rouge">UnsafeArray1</code> class to:</p>

<div class="language-rs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">pub</span> <span class="k">unsafe</span> <span class="k">fn</span> <span class="nf">write</span><span class="p">(</span><span class="o">&amp;</span><span class="k">self</span><span class="p">,</span> <span class="n">i</span><span class="p">:</span> <span class="nb">usize</span><span class="p">,</span> <span class="n">value</span><span class="p">:</span> <span class="nb">i64</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">let</span> <span class="n">ptr</span> <span class="o">=</span> <span class="k">self</span><span class="py">.array</span><span class="nf">.get</span><span class="p">();</span>
    <span class="o">*</span><span class="p">(</span><span class="o">*</span><span class="n">ptr</span><span class="p">)</span><span class="nf">.uget_mut</span><span class="p">(</span><span class="n">i</span><span class="p">)</span> <span class="o">=</span> <span class="n">value</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>So how does this compare function wise?</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">&gt;&gt;&gt;</span> <span class="n">res1</span> <span class="o">=</span> <span class="n">cypy</span><span class="p">.</span><span class="n">find_max</span><span class="p">(</span><span class="n">arr</span><span class="p">)</span>
<span class="n">cypy</span> <span class="n">took</span> <span class="mf">284.153331</span> <span class="n">milliseconds</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">res2</span> <span class="o">=</span> <span class="n">rustpy</span><span class="p">.</span><span class="n">find_max</span><span class="p">(</span><span class="n">arr</span><span class="p">)</span>
<span class="n">rustpy</span> <span class="n">took</span> <span class="mi">113</span> <span class="n">milliseconds</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">res3</span> <span class="o">=</span> <span class="n">rustpy</span><span class="p">.</span><span class="n">find_max_parallel</span><span class="p">(</span><span class="n">arr</span><span class="p">)</span>
<span class="n">rustpy</span> <span class="n">parallel</span> <span class="n">took</span> <span class="mi">223</span> <span class="n">milliseconds</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">res4</span> <span class="o">=</span> <span class="n">rustpy</span><span class="p">.</span><span class="n">find_max_unsafe</span><span class="p">(</span><span class="n">arr</span><span class="p">)</span>
<span class="n">rustpy</span> <span class="n">unsafe</span> <span class="n">took</span> <span class="mi">47</span> <span class="n">milliseconds</span>
<span class="o">&gt;&gt;&gt;</span> <span class="p">((</span><span class="n">res1</span> <span class="o">==</span> <span class="n">res2</span><span class="p">)</span> <span class="o">&amp;</span> <span class="p">(</span><span class="n">res1</span> <span class="o">==</span> <span class="n">res3</span><span class="p">)</span> <span class="o">&amp;</span> <span class="p">(</span><span class="n">res1</span> <span class="o">==</span> <span class="n">res4</span><span class="p">)).</span><span class="nb">all</span><span class="p">()</span>
<span class="bp">True</span>
</code></pre></div></div>

<p>Compared to our initial Cython implementation, our unsafe threaded implementation takes about 16.5% of the same runtime. Not bad.</p>

<p>The benchmarks above were recorded on a Lemur Pro laptop with a 12th Gen Intel(R) Core(TM) i7-1255U processor and 12 logical cores. Results will vary depending on your hardware and OS. If you want more control over the degree of parallelization than that which comes out of the box, be advised that this all dispatches to <a href="https://docs.rs/rayon/latest/rayon/">rayon</a> under the hood. Rayon uses <a href="https://github.com/rayon-rs/rayon/blob/master/FAQ.md#how-many-threads-will-rayon-spawn">one thread per CPU</a> by default. You could accept an argument into your extension function that limits the number of threads being spawned at one time, or alternately you can set the <code class="language-plaintext highlighter-rouge">RAYON_NUM_THREADS</code> environment variable.</p>

<p>From my machine if I run <code class="language-plaintext highlighter-rouge">RAYON_NUM_THREADS=2 python</code> and within the interpreter execute <code class="language-plaintext highlighter-rouge">rustpy.find_max_parallel(arr)</code>, I get the response that <code class="language-plaintext highlighter-rouge">rustpy parallel took 71 seconds</code>. This is an improvement over the default parallel implementation, which as we noted in the previous section introduced a lot of overhead with thread synchronization when arrays had a large number of columns and a relatively small amount of rows.</p>

<h2 id="closing-thoughts">Closing Thoughts</h2>

<p>From my initial trials I was very surprised by how good Rust was for building extensions. The language itself is pretty natural in a way that I think could be useful to higher-level programmers, while offering great performance at the same time. Not pictured in the above analysis were a ton of mistakes in trying to get code parallelized via Rust. In C/C++ I likely would have made a very buggy program; the Rust compiler prevented me from doing so here. In all, I think Rust can creep into the same realm that Cython occupies today and become a serious competitor for easy extension authoring.</p>

<p>I also want to mention <a href="https://github.com/Dr-Irv">Irv Lustig</a>, <a href="https://github.com/jbrockmendel">Brock Mendel</a>, <a href="https://github.com/datapythonista">Marc Garcia</a> and <a href="https://github.com/ngoldbaum">Nathan Goldblum</a> for their help in implementing and improving this article. Thanks all for your help and support!</p>]]></content><author><name>Will Ayd</name></author><category term="performance" /><category term="python" /><category term="rust" /><summary type="html"><![CDATA[This blog post compares the process of creating a Python in extension in Rust versus Cython.]]></summary></entry><entry><title type="html">Fundamental Python Debugging Part 3 - Cython Extensions</title><link href="https://willayd.com/fundamental-python-debugging-part-3-cython-extensions.html" rel="alternate" type="text/html" title="Fundamental Python Debugging Part 3 - Cython Extensions" /><published>2023-03-10T00:00:00+00:00</published><updated>2023-03-10T00:00:00+00:00</updated><id>https://willayd.com/fundamental-python-debugging-part-3-cython-extensions</id><content type="html" xml:base="https://willayd.com/fundamental-python-debugging-part-3-cython-extensions.html"><![CDATA[<p>For the unaware, Cython is a transpiler from a Python-like syntax into C files. This gets you close to C performance while writing files that aren’t <em>that</em> dissimilar from Python. It is used extensively in the scientific Python community to generate high-performance extensions. A common approach to optimize Python libraries is to make sure you are as efficient as possible in pure Python, before building your code in Cython, and commonly as a last resort writing your C/C++ extensions by hand.</p>

<p>In spite of this pattern we are introducing Cython as the third part of the debugging series, after already having debugged C extensions. Why is that? Well, it turns out that the Cython debugger is in fact a <a href="https://sourceware.org/gdb/onlinedocs/gdb/Python.html#Python">gdb python extension</a>, which we saw CPython also leverage in the last chapter. We aren’t doing anything novel in this chapter but just walking through some of the conveniences the ``cygdb` extension provides (interested users can find the source code <a href="https://github.com/cython/cython/blob/master/Cython/Debugger/Cygdb.py">here</a>).</p>

<p>If you haven’t read the <a href="/fundamental-python-debugging-part-2-python-extensions.html">previous article on debugging Python extensions with gdb</a>, I highly recommend that you do so before continuing here. Although writing Cython can be thought of as a stepping stone to writing C/C++ extensions, the inverse is true when it comes to debugging.</p>

<h2 id="setting-up-our-environment">Setting up our environment</h2>

<p>For this chapter we will leverage the same image as in the last, so start with:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>docker pull willayd/cpython-debugging
</code></pre></div></div>

<p>In addition to the items outlined in the previous chapter, this image also includes Cython as a pip-installed package. If you don’t care to use the docker image you can also follow the instructions in the <a href="https://cython.readthedocs.io/en/latest/src/userguide/debugging.html">Debugging your Cython program documentation</a>, but be aware that some of the interactions between Cython, gdb and Python aren’t very intuitive, especially if using Python installed as a virtual image.</p>

<p>If using the docker image above, be sure to run it as a container and mount a local directory for development into the container at <code class="language-plaintext highlighter-rouge">/host</code>. As in the previous section, I will be putting my work in a directory called <code class="language-plaintext highlighter-rouge">~/code-demos</code>.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>willayd@willayd:~<span class="nv">$ </span>docker run <span class="nt">--rm</span> <span class="nt">-it</span> <span class="nt">-w</span> /data <span class="nt">-v</span> <span class="k">${</span><span class="nv">HOME</span><span class="k">}</span>/code-demos:/data willayd/cpython-debugging
</code></pre></div></div>

<h2 id="build-our-first-cython-extension">Build our first Cython extension</h2>

<p>We are going to start with the same extension we created in the previous chapter. Let’s create a file named <code class="language-plaintext highlighter-rouge">debugging_cython.pyx</code> in the folder on your computer that you mounted into docker and insert these contents:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">say_hello_and_return_none</span><span class="p">():</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"Hello from the Cython extension"</span><span class="p">)</span>
</code></pre></div></div>

<p>That’s it! From here we now have two steps we need to follow to get this converted into an importable extension:</p>

<ol>
  <li>Transpile the Cython file into a C module</li>
  <li>Build a shared library from the C module</li>
</ol>

<p>The <code class="language-plaintext highlighter-rouge">cython</code> command can help us with Step 1; Step 2 builds on a lot of knowledge from the previous chapter. Here are the commands:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>root@f241800d6a12:/data# cython <span class="nt">--gdb</span> debugging_cython.pyx
root@f241800d6a12:/data# gcc <span class="nt">-g3</span> <span class="nt">-Wall</span> <span class="nt">-Werror</span> <span class="nt">-std</span><span class="o">=</span>c17 <span class="nt">-shared</span> <span class="nt">-fPIC</span> <span class="nt">-I</span>/usr/local/include/python3.10d debugging_cython.c <span class="nt">-o</span> debugging_cython.so
</code></pre></div></div>

<p>With the extension built, you can import the module and call the function.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>root@f241800d6a12:/data# python3
<span class="o">&gt;&gt;&gt;</span> import debugging_cython
<span class="o">&gt;&gt;&gt;</span> debugging_cython.say_hello_and_return_none<span class="o">()</span>
Hello from the Cython extension
</code></pre></div></div>

<h2 id="using-cygdb">Using cygdb</h2>

<p>If you inspect the output of <code class="language-plaintext highlighter-rouge">debugging_cython.c</code> which was generated in the previous section, you could debug it using <code class="language-plaintext highlighter-rouge">gdb</code> as if it were a normal C module, because it is. It certainly doesn’t look that anything that you would have written by hand, but there isn’t any real magic to what is happening here; Cython takes Python-like code and transpiles a C file out of it. The rest of the tooling that we’ve seen in the previous chapter can pick things up from there. However, because the file was auto-generated you lose a lot of the abstractions that you get from writing Python-like code, and end up stepping through a tangled web of variables you aren’t familiar with in gdb. <code class="language-plaintext highlighter-rouge">pdb</code> cannot debug Cython files for us, so we need to use <code class="language-plaintext highlighter-rouge">cygdb</code>. We can then set a breakpoint at our function using the <code class="language-plaintext highlighter-rouge">cy break</code> command and open up a Python interpreter with <code class="language-plaintext highlighter-rouge">cy run</code>.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>root@fad66408f996:/data# cygdb
<span class="o">(</span>gdb<span class="o">)</span> cy <span class="nb">break </span>say_hello_and_return_none
Function <span class="s2">"__pyx_pw_16debugging_cython_1say_hello_and_return_none"</span> not defined.
Breakpoint 1 <span class="o">(</span>__pyx_pw_16debugging_cython_1say_hello_and_return_none<span class="o">)</span> pending.
<span class="o">(</span>gdb<span class="o">)</span> cy run
Python 3.10.10+ <span class="o">(</span>heads/3.10:bac3fe7, Feb 22 2023, 05:56:35<span class="o">)</span> <span class="o">[</span>GCC 11.3.0] on linux
Type <span class="s2">"help"</span>, <span class="s2">"copyright"</span>, <span class="s2">"credits"</span> or <span class="s2">"license"</span> <span class="k">for </span>more information.
<span class="o">&gt;&gt;&gt;</span>
</code></pre></div></div>

<p>With the Python interpreter running let us import and execute our function.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">&gt;&gt;&gt;</span> import debugging_cython
<span class="o">&gt;&gt;&gt;</span> debugging_cython.say_hello_and_return_none<span class="o">()</span>

Breakpoint 1, __pyx_pw_16debugging_cython_1say_hello_and_return_none <span class="o">(</span><span class="nv">__pyx_self</span><span class="o">=</span>0x0, <span class="nv">unused</span><span class="o">=</span>0x0<span class="o">)</span> at debugging_cython.c:1202
1202   PyObject <span class="k">*</span>__pyx_r <span class="o">=</span> 0<span class="p">;</span>
1    def say_hello_and_return_none<span class="o">()</span>:
</code></pre></div></div>

<p>We’ve hit a breakpoint at line 1202 of the generated <code class="language-plaintext highlighter-rouge">debugging_cython.c</code> file. The commands the Cython debugger exposes are not really that different from what we saw with <code class="language-plaintext highlighter-rouge">gdb</code> in the previous chapter. The difference is that the <code class="language-plaintext highlighter-rouge">gdb</code> built-in commands will work as if you are debugging <code class="language-plaintext highlighter-rouge">debugging_cython.c</code>, whereas the <code class="language-plaintext highlighter-rouge">cygdb</code> commands will work as if you are debugging <code class="language-plaintext highlighter-rouge">debugging_cython.pyx</code>. Inputting <code class="language-plaintext highlighter-rouge">list</code> and then <code class="language-plaintext highlighter-rouge">cy list</code> will help us see this in action:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">(</span>gdb<span class="o">)</span> list
         1    def say_hello_and_return_none<span class="o">()</span>:1197
1198 /<span class="k">*</span> Python wrapper <span class="k">*</span>/
1199 static PyObject <span class="k">*</span>__pyx_pw_16debugging_cython_1say_hello_and_return_none<span class="o">(</span>PyObject <span class="k">*</span>__pyx_self, CYTHON_UNUSED PyObject <span class="k">*</span>unused<span class="o">)</span><span class="p">;</span> /<span class="k">*</span>proto<span class="k">*</span>/
1200 static PyMethodDef __pyx_mdef_16debugging_cython_1say_hello_and_return_none <span class="o">=</span> <span class="o">{</span><span class="s2">"say_hello_and_return_none"</span>, <span class="o">(</span>PyCFunction<span class="o">)</span>__pyx_pw_16debugging_cython_1say_hello_and_return_none, METH_NOARGS, 0<span class="o">}</span><span class="p">;</span>
1201 static PyObject <span class="k">*</span>__pyx_pw_16debugging_cython_1say_hello_and_return_none<span class="o">(</span>PyObject <span class="k">*</span>__pyx_self, CYTHON_UNUSED PyObject <span class="k">*</span>unused<span class="o">)</span> <span class="o">{</span>
1202   PyObject <span class="k">*</span>__pyx_r <span class="o">=</span> 0<span class="p">;</span>
1203   __Pyx_RefNannyDeclarations
1204   __Pyx_RefNannySetupContext<span class="o">(</span><span class="s2">"say_hello_and_return_none (wrapper)"</span>, 0<span class="o">)</span><span class="p">;</span>
1205   __pyx_r <span class="o">=</span> __pyx_pf_16debugging_cython_say_hello_and_return_none<span class="o">(</span>__pyx_self<span class="o">)</span><span class="p">;</span>
1206
<span class="o">(</span>gdb<span class="o">)</span> cy list
<span class="o">&gt;</span>    1    def say_hello_and_return_none<span class="o">()</span>:
     2        print<span class="o">(</span><span class="s2">"Hello from the Cython extension"</span><span class="o">)</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">help cy</code> gives a nice overview within <code class="language-plaintext highlighter-rouge">gdb</code> of the available commands. It is a much smaller set of commands than what <code class="language-plaintext highlighter-rouge">gdb</code> offers, but should cover the majority of needs in normal development.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">(</span>gdb<span class="o">)</span> <span class="nb">help </span>cy

    Invoke a Cython command. Available commands are:

        cy import
        cy <span class="nb">break
        </span>cy step
        cy next
        cy run
        cy cont
        cy finish
        cy up
        cy down
        cy <span class="k">select
        </span>cy bt / cy backtrace
        cy list
        cy print
        cy <span class="nb">set
        </span>cy locals
        cy globals
        cy <span class="nb">exec</span>

...
Type <span class="s2">"help cy"</span> followed by cy subcommand name <span class="k">for </span>full documentation.
Type <span class="s2">"apropos word"</span> to search <span class="k">for </span>commands related to <span class="s2">"word"</span><span class="nb">.</span>
Type <span class="s2">"apropos -v word"</span> <span class="k">for </span>full documentation of commands related to <span class="s2">"word"</span><span class="nb">.</span>
Command name abbreviations are allowed <span class="k">if </span>unambiguous.
</code></pre></div></div>

<h2 id="cpdef-functions">cpdef functions</h2>

<p>Our previous program leveraged a <code class="language-plaintext highlighter-rouge">def</code> function, which Cython makes callable from the Python interpreter. Cython also offers <code class="language-plaintext highlighter-rouge">cdef</code> functions (not callable from Python) and <code class="language-plaintext highlighter-rouge">cpdef</code> functions, which essentially generate a <code class="language-plaintext highlighter-rouge">def</code> and a <code class="language-plaintext highlighter-rouge">cdef</code> for you. A detailed explanation of why you would choose those is outside the scope of this article; if you need a primer be sure to check out the wonderful <a href="https://cython.readthedocs.io/en/latest/src/userguide/language_basics.html">Cython language basics documentation</a>.</p>

<p>For debugging purposes, let’s create <code class="language-plaintext highlighter-rouge">debugging_cython2.pyx</code> and change our function from <code class="language-plaintext highlighter-rouge">def</code> to <code class="language-plaintext highlighter-rouge">cpdef</code>.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">cpdef</span> <span class="n">say_hello_from_cpdef</span><span class="p">():</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"Hello from the cpdef function"</span><span class="p">)</span>
</code></pre></div></div>

<p>If you are still running <code class="language-plaintext highlighter-rouge">cygdb</code> from the previous section, go ahead and <code class="language-plaintext highlighter-rouge">exit</code> to get back to your standard terminal. From there, we want to transpile and create our new shared library:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>root@f241800d6a12:/data# cython <span class="nt">--gdb</span> debugging_cython2.pyx
root@f241800d6a12:/data# gcc <span class="nt">-g3</span> <span class="nt">-Wall</span> <span class="nt">-Werror</span> <span class="nt">-std</span><span class="o">=</span>c17 <span class="nt">-shared</span> <span class="nt">-fPIC</span> <span class="nt">-I</span>/usr/local/include/python3.10d debugging_cython2.c <span class="nt">-o</span> debugging_cython2.so
</code></pre></div></div>

<p>Fire up <code class="language-plaintext highlighter-rouge">cygdb</code> again and set another breakpoint on that function:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">(</span>gdb<span class="o">)</span> cy <span class="nb">break </span>say_hello_from_cpdef
Function <span class="s2">"__pyx_f_17debugging_cython2_say_hello_from_cpdef"</span> not defined.
Breakpoint 1 <span class="o">(</span>__pyx_f_17debugging_cython2_say_hello_from_cpdef<span class="o">)</span> pending.
Function <span class="s2">"__pyx_pw_17debugging_cython2_1say_hello_from_cpdef"</span> not defined.
Breakpoint 2 <span class="o">(</span>__pyx_pw_17debugging_cython2_1say_hello_from_cpdef<span class="o">)</span> pending.
</code></pre></div></div>

<p>What is interesting here is that we now have 2 breakpoints! The reason for this again is that <code class="language-plaintext highlighter-rouge">cpdef</code> generates two functions for us - one purely accessible from C and one accessible from Python. Go ahead and <code class="language-plaintext highlighter-rouge">cy run</code> to get the Python interpreter started; we will then run <code class="language-plaintext highlighter-rouge">cy cont</code> to continue past each breakpoint.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">(</span>gdb<span class="o">)</span> cy run
Python 3.10.10+ <span class="o">(</span>heads/3.10:bac3fe7, Feb 22 2023, 05:56:35<span class="o">)</span> <span class="o">[</span>GCC 11.3.0] on linux
Type <span class="s2">"help"</span>, <span class="s2">"copyright"</span>, <span class="s2">"credits"</span> or <span class="s2">"license"</span> <span class="k">for </span>more information.
<span class="o">&gt;&gt;&gt;</span> import debugging_cython2
<span class="o">&gt;&gt;&gt;</span> debugging_cython2.say_hello_from_cpdef<span class="o">()</span>

Breakpoint 2, __pyx_pw_17debugging_cython2_1say_hello_from_cpdef <span class="o">(</span><span class="nv">__pyx_self</span><span class="o">=</span>&lt;module at remote 0x7f1da030d6d0&gt;, <span class="nv">unused</span><span class="o">=</span>0x0<span class="o">)</span> at debugging_cython2.c:1227
1227   PyObject <span class="k">*</span>__pyx_r <span class="o">=</span> 0<span class="p">;</span>
<span class="o">(</span>gdb<span class="o">)</span> cy list
  1222    <span class="o">}</span>
  1223
  1224    /<span class="k">*</span> Python wrapper <span class="k">*</span>/
  1225    static PyObject <span class="k">*</span>__pyx_pw_17debugging_cython2_1say_hello_from_cpdef<span class="o">(</span>PyObject <span class="k">*</span>__pyx_self, CYTHON_UNUSED PyObject <span class="k">*</span>unused<span class="o">)</span><span class="p">;</span> /<span class="k">*</span>proto<span class="k">*</span>/
  1226    static PyObject <span class="k">*</span>__pyx_pw_17debugging_cython2_1say_hello_from_cpdef<span class="o">(</span>PyObject <span class="k">*</span>__pyx_self, CYTHON_UNUSED PyObject <span class="k">*</span>unused<span class="o">)</span> <span class="o">{</span>
<span class="o">&gt;</span> 1227      PyObject <span class="k">*</span>__pyx_r <span class="o">=</span> 0<span class="p">;</span>
  1228      __Pyx_RefNannyDeclarations
  1229      __Pyx_RefNannySetupContext<span class="o">(</span><span class="s2">"say_hello_from_cpdef (wrapper)"</span>, 0<span class="o">)</span><span class="p">;</span>
  1230      __pyx_r <span class="o">=</span> __pyx_pf_17debugging_cython2_say_hello_from_cpdef<span class="o">(</span>__pyx_self<span class="o">)</span><span class="p">;</span>
  1231
<span class="o">(</span>gdb<span class="o">)</span> cy cont

Breakpoint 1, __pyx_f_17debugging_cython2_say_hello_from_cpdef <span class="o">(</span><span class="nv">__pyx_skip_dispatch</span><span class="o">=</span>0<span class="o">)</span> at debugging_cython2.c:1194
1194   PyObject <span class="k">*</span>__pyx_r <span class="o">=</span> NULL<span class="p">;</span>
1    cpdef say_hello_from_cpdef<span class="o">()</span>:
<span class="o">(</span>gdb<span class="o">)</span> cy list
<span class="o">&gt;</span>    1    cpdef say_hello_from_cpdef<span class="o">()</span>:
     2        print<span class="o">(</span><span class="s2">"Hello from the cpdef function"</span><span class="o">)</span>
<span class="o">(</span>gdb<span class="o">)</span> cy cont
Hello from the cpdef <span class="k">function</span>
<span class="o">&gt;&gt;&gt;</span> quit<span class="o">()</span>
<span class="o">[</span>Inferior 1 <span class="o">(</span>process 105<span class="o">)</span> exited normally]
</code></pre></div></div>

<p>Note that the <code class="language-plaintext highlighter-rouge">cy list</code> in the first breakpoint lists C source code, whereas the second <code class="language-plaintext highlighter-rouge">cy list</code> shows the Cython source code. Given the purpose of <code class="language-plaintext highlighter-rouge">cpdef</code> this may not be too surprising, but it may be confusing to new users.</p>

<h2 id="managing-cy-break-breakpoints">Managing cy break breakpoints</h2>

<p>While <code class="language-plaintext highlighter-rouge">cy break</code> lets you create breakpoints, it does not give you any tools to delete, enable, disable, etc… However, you can work around this issue by using <code class="language-plaintext highlighter-rouge">gdb's</code> native commands for managing breakpoints, which we detailed on in the previous debugging article. Continuing with our example above, an <code class="language-plaintext highlighter-rouge">info break</code> yields the following:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">(</span>gdb<span class="o">)</span> info <span class="nb">break
</span>Num     Type           Disp Enb Address            What
1       breakpoint     keep y   0x00007f1da010581d <span class="k">in </span>__pyx_f_17debugging_cython2_say_hello_from_cpdef at debugging_cython2.c:1194
     breakpoint already hit 2 <span class="nb">times
</span>2       breakpoint     keep y   0x00007f1da01058c6 <span class="k">in </span>__pyx_pw_17debugging_cython2_1say_hello_from_cpdef at debugging_cython2.c:1227
     breakpoint already hit 3 <span class="nb">times</span>
</code></pre></div></div>

<p>If you didn’t want the first breakpoint to be hit from Cython, you <code class="language-plaintext highlighter-rouge">delete 1</code> or <code class="language-plaintext highlighter-rouge">disable 1</code>.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">(</span>gdb<span class="o">)</span> disable 1
<span class="o">(</span>gdb<span class="o">)</span> cy run
Python 3.10.10+ <span class="o">(</span>heads/3.10:bac3fe7, Feb 22 2023, 05:56:35<span class="o">)</span> <span class="o">[</span>GCC 11.3.0] on linux
Type <span class="s2">"help"</span>, <span class="s2">"copyright"</span>, <span class="s2">"credits"</span> or <span class="s2">"license"</span> <span class="k">for </span>more information.
<span class="o">&gt;&gt;&gt;</span> import debugging_cython2
<span class="o">&gt;&gt;&gt;</span> debugging_cython2.say_hello_from_cpdef<span class="o">()</span>

Breakpoint 2, __pyx_pw_17debugging_cython2_1say_hello_from_cpdef <span class="o">(</span><span class="nv">__pyx_self</span><span class="o">=</span>&lt;module at remote 0x7f8825188650&gt;, <span class="nv">unused</span><span class="o">=</span>0x0<span class="o">)</span> at debugging_cython2.c:1227
1227   PyObject <span class="k">*</span>__pyx_r <span class="o">=</span> 0<span class="p">;</span>
1227      PyObject <span class="k">*</span>__pyx_r <span class="o">=</span> 0<span class="p">;</span>
<span class="o">(</span>gdb<span class="o">)</span> cy cont
Hello from the cpdef <span class="k">function</span>
<span class="o">&gt;&gt;&gt;</span> debugging_cython2.say_hello_from_cpdef<span class="o">()</span>

Breakpoint 2, __pyx_pw_17debugging_cython2_1say_hello_from_cpdef <span class="o">(</span><span class="nv">__pyx_self</span><span class="o">=</span>&lt;module at remote 0x7f8825188650&gt;, <span class="nv">unused</span><span class="o">=</span>0x0<span class="o">)</span> at debugging_cython2.c:1227
1227   PyObject <span class="k">*</span>__pyx_r <span class="o">=</span> 0<span class="p">;</span>
1227      PyObject <span class="k">*</span>__pyx_r <span class="o">=</span> 0<span class="p">;</span>
<span class="o">(</span>gdb<span class="o">)</span> cy cont
Hello from the cpdef <span class="k">function</span>
<span class="o">&gt;&gt;&gt;</span> quit<span class="o">()</span>
<span class="o">[</span>Inferior 1 <span class="o">(</span>process 105<span class="o">)</span> exited normally]
</code></pre></div></div>

<h2 id="closing-thoughts">Closing Thoughts</h2>

<p>If you’ve made it this far - congratulations! Debugging as we’ve done in this three part series is not going to be the flashiest thing you do as a developer. However, I can guarantee that working with these tools at such a level will give you a critical foundation with which you can build upon. Whether you are a Python developer looking to go <em>lower level</em> for performance reasons, or you are a C/C++ developer looking to go <em>higher level</em> to work with good abstractions, having these debuggers at your disposal will let you move up and down your computing stack with relative ease. Now go forth and have fun!</p>]]></content><author><name>Will Ayd</name></author><category term="debugging" /><category term="python" /><category term="cython" /><summary type="html"><![CDATA[This blog post teaches you how to debug Cython extensions to Python. It is part 3 of a 3 part series.]]></summary></entry><entry><title type="html">Fundamental Python Debugging Part 2 - Python Extensions</title><link href="https://willayd.com/fundamental-python-debugging-part-2-python-extensions.html" rel="alternate" type="text/html" title="Fundamental Python Debugging Part 2 - Python Extensions" /><published>2023-02-22T00:00:00+00:00</published><updated>2023-02-22T00:00:00+00:00</updated><id>https://willayd.com/fundamental-python-debugging-part-2-python-extensions</id><content type="html" xml:base="https://willayd.com/fundamental-python-debugging-part-2-python-extensions.html"><![CDATA[<p><a href="https://docs.python.org/3/extending/index.html">Python extensions</a> are a key component in making Python libraries fast. With an extension, you have the ability to write code in a lower-level language like C or C++ but still interact with that code via the Python runtime. Many high-performance scientific Python libraries use this type of architecture, whether through hand-writing a C/C++ extension(s) and/or generating them using a Python to C/C++ <em>transpiler</em> like <a href="https://cython.org/">Cython</a>.</p>

<p>This has tradeoffs for a library author. While Python is an interpreted language, extensions are typically written in languages that need to be compiled. Extensions also cannot be debugged with pdb. However, as you’ll see in the following sections, pdb is heavily influenced by a lot of the tooling used for extension debugging, so if you worked through <a href="/fundamental-python-debugging-part-1-python.html">the first article in this debugging series</a> you should have a solid foundation to build off of.</p>

<h2 id="setting-up-our-environment">Setting up our environment</h2>

<p>A challenge we didn’t face in the previous article was cross-platform tooling. pdb works regardless of your OS and architecture, but as we move further down into the stack we have to use tools more tailored to our environment.</p>

<p>Writing installation and usage instructions for all platforms would be quite the task. To abstract all of the nuances and make following through this guide easier, this guide assumes you will be using the docker image hosted at <a href="https://hub.docker.com/r/willayd/cpython-debugging">willayd/cpython-debugging</a>. This docker image contains the following items:</p>

<ul>
  <li><a href="https://gcc.gnu.org/">gcc</a>, which we use to build extensions</li>
  <li><a href="https://github.com/python/cpython">CPython</a> source code located in /clones/cpython</li>
  <li>A development build of Python pre-installed</li>
  <li>A custom build of <code class="language-plaintext highlighter-rouge">gdb</code> which knows about the development Python installation</li>
</ul>

<p>Not all of these elements are required, but they all make debugging easier.</p>

<p>To get started with the image, be sure to first install the <a href="https://docs.docker.com/engine/install">docker engine</a>, at which point you can then:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>docker pull willayd/cpython-debugging
</code></pre></div></div>

<p>A quick <code class="language-plaintext highlighter-rouge">docker image</code> should show that same image on your local machine. Once you have the image installed, you will want to choose a location on your host computer to mount into the container you will run based off of that image, so something like:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>docker run <span class="nt">--rm</span> <span class="nt">-it</span> <span class="nt">-w</span> /data <span class="nt">-v</span> &lt;PATH_TO_YOUR_WORK&gt;:/data willayd/cpython-debugging
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">-v</code> flag here maps the part of its argument preceding the <code class="language-plaintext highlighter-rouge">:</code> and locates it on your host computer. It then mounts that location from your host computer to the path specified after the <code class="language-plaintext highlighter-rouge">:</code> within the container, which we’ve chosen above as <code class="language-plaintext highlighter-rouge">/data</code>. Note that you can use shell expansion of environment variables like <code class="language-plaintext highlighter-rouge">-v ${HOME}/code:/data</code> if you have your work locally in a <code class="language-plaintext highlighter-rouge">code</code> subdirectory of your home directory. Even simpler, you could do <code class="language-plaintext highlighter-rouge">-v ${PWD}:/data</code> if your shell is already within the directory you want to mount.</p>

<h2 id="building-our-first-extension">Building our first extension</h2>

<p>Let’s start with the following code in a file named <code class="language-plaintext highlighter-rouge">debugging_demo.c</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define PY_SSIZE_T_CLEAN
#include</span> <span class="cpf">&lt;Python.h&gt;</span><span class="cp">
</span>
<span class="k">static</span> <span class="n">PyObject</span> <span class="o">*</span>
<span class="nf">say_hello_and_return_none</span> <span class="p">(</span><span class="n">PyObject</span> <span class="o">*</span><span class="n">self</span><span class="p">,</span> <span class="n">PyObject</span> <span class="o">*</span><span class="n">args</span><span class="p">)</span>
<span class="p">{</span>
  <span class="n">printf</span> <span class="p">(</span><span class="s">"Hello from the extension</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>
  <span class="n">Py_RETURN_NONE</span><span class="p">;</span>
<span class="p">}</span>

<span class="k">static</span> <span class="n">PyMethodDef</span> <span class="n">debugging_demo_methods</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span>
  <span class="p">{</span><span class="s">"say_hello_and_return_none"</span><span class="p">,</span> <span class="n">say_hello_and_return_none</span><span class="p">,</span> <span class="n">METH_VARARGS</span><span class="p">,</span>
   <span class="s">"Says hello and returns none."</span><span class="p">},</span>
  <span class="p">{</span><span class="nb">NULL</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">}</span>  <span class="cm">/* Sentinel */</span>
<span class="p">};</span>

<span class="k">static</span> <span class="k">struct</span> <span class="n">PyModuleDef</span> <span class="n">debugging_demo_module</span> <span class="o">=</span> <span class="p">{</span>
  <span class="n">PyModuleDef_HEAD_INIT</span><span class="p">,</span>
  <span class="p">.</span><span class="n">m_name</span> <span class="o">=</span> <span class="s">"debugging_demo"</span><span class="p">,</span>
  <span class="p">.</span><span class="n">m_doc</span> <span class="o">=</span> <span class="s">"A simple extension to showcase debugging"</span><span class="p">,</span>
  <span class="p">.</span><span class="n">m_size</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span>
  <span class="p">.</span><span class="n">m_methods</span> <span class="o">=</span> <span class="n">debugging_demo_methods</span>
<span class="p">};</span>

<span class="n">PyMODINIT_FUNC</span> <span class="nf">PyInit_debugging_demo</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
  <span class="k">return</span> <span class="n">PyModuleDef_Init</span><span class="p">(</span><span class="o">&amp;</span><span class="n">debugging_demo_module</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I’ve saved this locally under <code class="language-plaintext highlighter-rouge">~/code-demos</code>, so I’m going to launch my docker container with <code class="language-plaintext highlighter-rouge">docker run --rm -it -w /data -v ${HOME}/code-demos:/data willayd/cpython-debugging</code>. A quick <code class="language-plaintext highlighter-rouge">ls</code> should confirm you have mounted everything properly:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>willayd@willayd:~<span class="nv">$ </span>docker run <span class="nt">--rm</span> <span class="nt">-it</span> <span class="nt">-w</span> /data <span class="nt">-v</span> <span class="k">${</span><span class="nv">HOME</span><span class="k">}</span>/code-demos:/data willayd/cpython-debugging
root@4a6161a82673:/data# <span class="nb">ls
</span>debugging_demo.c
root@4a6161a82673:/data#
</code></pre></div></div>

<p>We can build our C module into a shared library, after which we will be able to load it into Python.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>root@12a481d4fa4c:/data# gcc <span class="nt">-g3</span> <span class="nt">-Wall</span> <span class="nt">-Werror</span> <span class="nt">-std</span><span class="o">=</span>c17 <span class="nt">-shared</span> <span class="nt">-fPIC</span> <span class="nt">-I</span>/usr/local/include/python3.10d debugging_demo.c <span class="nt">-o</span> debugging_demo.so
root@12a481d4fa4c:/data# <span class="nb">ls
</span>debugging_demo.c  debugging_demo.so
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">gcc</code> is our tool for building the code, and all of the flags we provide here are documented in the <a href="https://gcc.gnu.org/onlinedocs/gcc/Invoking-GCC.html">gcc Command Options</a>.</p>

<p><code class="language-plaintext highlighter-rouge">-g3</code> instructs gcc to insert debugging information into the target, including macros. Without this, you may not properly be able to debug your application, may be unable to inspect source code, and may see things like <code class="language-plaintext highlighter-rouge">optimized out</code> when inspecting variables in gcc.</p>

<p><code class="language-plaintext highlighter-rouge">-Wall</code> turns on a lot of warnings (not all) and pairs well with <code class="language-plaintext highlighter-rouge">-Werror</code>. For new C developers, I always suggest using these two. Coming from higher level languages like Python you may be used to ignoring warnings, but in C most warnings you get as a new developer really are critical coding errors.</p>

<p><code class="language-plaintext highlighter-rouge">-shared</code> and <code class="language-plaintext highlighter-rouge">-fPIC</code> are both required for building a shared library, and <code class="language-plaintext highlighter-rouge">-I/usr/local/include/python3.10d</code> allows gcc to find our <code class="language-plaintext highlighter-rouge">Python.h</code> header file. All of these are necessary to make our extension loadable from Python.</p>

<p><code class="language-plaintext highlighter-rouge">-o debugging_demo.so</code> created our shared library with an <code class="language-plaintext highlighter-rouge">.so</code> extension, which is common on GNU/Linux platforms. On macOS you may see a similar concept with a <code class="language-plaintext highlighter-rouge">.dylib</code> extension, whereas Windows has <code class="language-plaintext highlighter-rouge">.dll</code>.</p>

<p>Now that this shared library is available, it can be loaded, inspected and executed from the Python interpreter.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>root@12a481d4fa4c:/data# python3
Python 3.10.10+ <span class="o">(</span>heads/3.10:bac3fe7, Feb 22 2023, 05:56:35<span class="o">)</span> <span class="o">[</span>GCC 11.3.0] on linux
Type <span class="s2">"help"</span>, <span class="s2">"copyright"</span>, <span class="s2">"credits"</span> or <span class="s2">"license"</span> <span class="k">for </span>more information.
<span class="o">&gt;&gt;&gt;</span> import debugging_demo
<span class="o">&gt;&gt;&gt;</span> debugging_demo.__doc__
<span class="s1">'A simple extension to showcase debugging'</span>
<span class="o">&gt;&gt;&gt;</span> <span class="nb">dir</span><span class="o">(</span>debugging_demo<span class="o">)</span>
<span class="o">[</span><span class="s1">'__doc__'</span>, <span class="s1">'__file__'</span>, <span class="s1">'__loader__'</span>, <span class="s1">'__name__'</span>, <span class="s1">'__package__'</span>, <span class="s1">'__spec__'</span>, <span class="s1">'say_hello_and_return_none'</span><span class="o">]</span>
<span class="o">&gt;&gt;&gt;</span> debugging_demo.say_hello_and_return_none.__doc__
<span class="s1">'Says hello and returns none.'</span>
<span class="o">&gt;&gt;&gt;</span> debugging_demo.say_hello_and_return_none<span class="o">()</span>
Hello from the extension
</code></pre></div></div>

<h2 id="inspecting-things-with-gdb">Inspecting things with gdb</h2>

<p>If we wanted to look at the intermediate state of things, we can pause execution and move around the stack like we did with <code class="language-plaintext highlighter-rouge">pdb</code> in the previous article, but this time we will be using <code class="language-plaintext highlighter-rouge">gdb</code>. To get started, simply run <code class="language-plaintext highlighter-rouge">gdb python3</code>. Thereafter, <code class="language-plaintext highlighter-rouge">help</code> is a good place to start.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">(</span>gdb<span class="o">)</span> <span class="nb">help
</span>List of classes of commands:

aliases <span class="nt">--</span> User-defined aliases of other commands.
breakpoints <span class="nt">--</span> Making program stop at certain points.
data <span class="nt">--</span> Examining data.
files <span class="nt">--</span> Specifying and examining files.
internals <span class="nt">--</span> Maintenance commands.
obscure <span class="nt">--</span> Obscure features.
running <span class="nt">--</span> Running the program.
stack <span class="nt">--</span> Examining the stack.
status <span class="nt">--</span> Status inquiries.
support <span class="nt">--</span> Support facilities.
text-user-interface <span class="nt">--</span> TUI is the GDB text based interface.
tracepoints <span class="nt">--</span> Tracing of program execution without stopping the program.
user-defined <span class="nt">--</span> User-defined commands.

Type <span class="s2">"help"</span> followed by a class name <span class="k">for </span>a list of commands <span class="k">in </span>that class.
Type <span class="s2">"help all"</span> <span class="k">for </span>the list of all commands.
Type <span class="s2">"help"</span> followed by <span class="nb">command </span>name <span class="k">for </span>full documentation.
Type <span class="s2">"apropos word"</span> to search <span class="k">for </span>commands related to <span class="s2">"word"</span><span class="nb">.</span>
Type <span class="s2">"apropos -v word"</span> <span class="k">for </span>full documentation of commands related to <span class="s2">"word"</span><span class="nb">.</span>
Command name abbreviations are allowed <span class="k">if </span>unambiguous.
<span class="o">(</span>gdb<span class="o">)</span>
</code></pre></div></div>

<p>Compared to <code class="language-plaintext highlighter-rouge">pdb</code>, there are way more features within <code class="language-plaintext highlighter-rouge">gdb</code> to sift through. <code class="language-plaintext highlighter-rouge">apropos</code> or going through <code class="language-plaintext highlighter-rouge">help all</code> may be a good place to start. The help menu by default uses a very simple pager; instead you may find it useful to open the help in something like <code class="language-plaintext highlighter-rouge">less</code> using a pipe, i.e. <code class="language-plaintext highlighter-rouge">pipe help all | less</code>.</p>

<p>The <code class="language-plaintext highlighter-rouge">help status</code> subsection introduces us to the <code class="language-plaintext highlighter-rouge">info</code> command. <code class="language-plaintext highlighter-rouge">info breakpoint</code> always lists your breakpoints, of which we have none right now.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">(</span>gdb<span class="o">)</span> info breakpoint
No breakpoints or watchpoints.
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">help break</code> gives great details on how to set a breakpoint. For now, let’s go ahead and enter <code class="language-plaintext highlighter-rouge">break say_hello_and_return_none</code> to enter the debugger when our function starts to execute.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">(</span>gdb<span class="o">)</span> <span class="nb">break </span>say_hello_and_return_none
Function <span class="s2">"say_hello_and_return_none"</span> not defined.
Make breakpoint pending on future shared library load? <span class="o">(</span>y or <span class="o">[</span>n]<span class="o">)</span> y
Breakpoint 1 <span class="o">(</span>say_hello_and_return_none<span class="o">)</span> pending.
</code></pre></div></div>

<p>Python has not yet loaded our shared library, so gdb isn’t sure yet that this function exists. It will become available when we start running our program, so you can enter <code class="language-plaintext highlighter-rouge">y</code> when prompted above.</p>

<p>At this point go ahead and <code class="language-plaintext highlighter-rouge">run</code> to start the Python interpreter that gdb attached to. You can then import the module and execute the function, at which point gdb will come back to the forefront:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">(</span>gdb<span class="o">)</span> run
Starting program: /usr/local/bin/python3
warning: Error disabling address space randomization: Operation not permitted
<span class="o">[</span>Thread debugging using libthread_db enabled]
Using host libthread_db library <span class="s2">"/lib/x86_64-linux-gnu/libthread_db.so.1"</span><span class="nb">.</span>
Python 3.10.10+ <span class="o">(</span>heads/3.10:bac3fe7, Feb 22 2023, 05:56:35<span class="o">)</span> <span class="o">[</span>GCC 11.3.0] on linux
Type <span class="s2">"help"</span>, <span class="s2">"copyright"</span>, <span class="s2">"credits"</span> or <span class="s2">"license"</span> <span class="k">for </span>more information.
<span class="o">&gt;&gt;&gt;</span> import debugging_demo
<span class="o">&gt;&gt;&gt;</span> debugging_demo.say_hello_and_return_none<span class="o">()</span>

Breakpoint 1, say_hello_and_return_none <span class="o">(</span><span class="nv">self</span><span class="o">=</span>0x7f4de7360230, <span class="nv">args</span><span class="o">=</span>0x7f4de7674250<span class="o">)</span> at debugging_demo.c:7
7      <span class="nb">printf</span> <span class="o">(</span><span class="s2">"Hello from the extension</span><span class="se">\n</span><span class="s2">"</span><span class="o">)</span><span class="p">;</span>
</code></pre></div></div>

<p>Similar to <code class="language-plaintext highlighter-rouge">pdb</code> we have a <code class="language-plaintext highlighter-rouge">backtrace</code> command (or <code class="language-plaintext highlighter-rouge">bt</code> shortcut) to inspect the call stack. Unlike pdb, this shows the call sequence tracing from the bottom up rather than the top down.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">(</span>gdb<span class="o">)</span> run
<span class="c">#0  say_hello_and_return_none (self=0x7f4de7360230, args=0x7f4de7674250) at debugging_demo.c:7</span>
<span class="c">#1  0x0000558c96fed0cb in cfunction_call (func=0x7f4de73603b0, args=&lt;optimized out&gt;, kwargs=&lt;optimized out&gt;)</span>
    at Objects/methodobject.c:552
<span class="c">#2  0x0000558c96e0a1c3 in _PyObject_MakeTpCall (tstate=tstate@entry=0x558c97f030b0,</span>
    <span class="nv">callable</span><span class="o">=</span>callable@entry<span class="o">=</span>0x7f4de73603b0, <span class="nv">args</span><span class="o">=</span>args@entry<span class="o">=</span>0x7f4de76ea7c0, <span class="nv">nargs</span><span class="o">=</span>&lt;optimized out&gt;,
    <span class="nv">keywords</span><span class="o">=</span>keywords@entry<span class="o">=</span>0x0<span class="o">)</span> at Objects/call.c:215
<span class="c">#3  0x0000558c96ec1baa in _PyObject_VectorcallTstate (tstate=0x558c97f030b0, callable=0x7f4de73603b0, args=0x7f4de76ea7c0,</span>
    <span class="nv">nargsf</span><span class="o">=</span>&lt;optimized out&gt;, <span class="nv">kwnames</span><span class="o">=</span>0x0<span class="o">)</span> at ./Include/cpython/abstract.h:112
<span class="c">#4  0x0000558c96ec6185 in PyObject_Vectorcall (kwnames=0x0, nargsf=9223372036854775808, args=0x7f4de76ea7c0,</span>
    <span class="nv">callable</span><span class="o">=</span>0x7f4de73603b0<span class="o">)</span> at ./Include/cpython/abstract.h:123
<span class="c">#5  call_function (tstate=tstate@entry=0x558c97f030b0, trace_info=trace_info@entry=0x7fffe84db900,</span>
    <span class="nv">pp_stack</span><span class="o">=</span>pp_stack@entry<span class="o">=</span>0x7fffe84db8d0, <span class="nv">oparg</span><span class="o">=</span>oparg@entry<span class="o">=</span>0, <span class="nv">kwnames</span><span class="o">=</span>kwnames@entry<span class="o">=</span>0x0<span class="o">)</span> at Python/ceval.c:5893
<span class="c">#6  0x0000558c96ed355e in _PyEval_EvalFrameDefault (tstate=0x558c97f030b0, f=0x7f4de76ea650, throwflag=&lt;optimized out&gt;)</span>
    at Python/ceval.c:4181
</code></pre></div></div>

<p>Each frame is numbered on the left hand side from 0 (most-recent frame). You can use <code class="language-plaintext highlighter-rouge">up</code> and <code class="language-plaintext highlighter-rouge">down</code> to navigate the call stack, or you can use the <code class="language-plaintext highlighter-rouge">frame</code> command / <code class="language-plaintext highlighter-rouge">f</code> shortcut to jump to any particular frame.</p>

<p>Let us go ahead and jump to frame number 2, which is in the cpython source code at <code class="language-plaintext highlighter-rouge">Objects/call.c</code> on line 215. We can then use the <code class="language-plaintext highlighter-rouge">list</code> / <code class="language-plaintext highlighter-rouge">l</code> commands that pdb also has to look at that code.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">(</span>gdb<span class="o">)</span> f 2
<span class="c">#2  0x0000558c96e0a1c3 in _PyObject_MakeTpCall (tstate=tstate@entry=0x558c97f030b0,</span>
    <span class="nv">callable</span><span class="o">=</span>callable@entry<span class="o">=</span>0x7f4de73603b0, <span class="nv">args</span><span class="o">=</span>args@entry<span class="o">=</span>0x7f4de76ea7c0, <span class="nv">nargs</span><span class="o">=</span>&lt;optimized out&gt;,
    <span class="nv">keywords</span><span class="o">=</span>keywords@entry<span class="o">=</span>0x0<span class="o">)</span> at Objects/call.c:215
215          result <span class="o">=</span> call<span class="o">(</span>callable, argstuple, kwdict<span class="o">)</span><span class="p">;</span>
<span class="o">(</span>gdb<span class="o">)</span> l
210      <span class="o">}</span>
211
212      PyObject <span class="k">*</span>result <span class="o">=</span> NULL<span class="p">;</span>
213      <span class="k">if</span> <span class="o">(</span>_Py_EnterRecursiveCall<span class="o">(</span>tstate, <span class="s2">" while calling a Python object"</span><span class="o">)</span> <span class="o">==</span> 0<span class="o">)</span>
214      <span class="o">{</span>
215          result <span class="o">=</span> call<span class="o">(</span>callable, argstuple, kwdict<span class="o">)</span><span class="p">;</span>
216          _Py_LeaveRecursiveCall<span class="o">(</span>tstate<span class="o">)</span><span class="p">;</span>
217      <span class="o">}</span>
218
219      Py_DECREF<span class="o">(</span>argstuple<span class="o">)</span><span class="p">;</span>
</code></pre></div></div>

<p>Let’s do <code class="language-plaintext highlighter-rouge">f 0</code> to get back to the most current frame. There you can use <code class="language-plaintext highlighter-rouge">next</code> / <code class="language-plaintext highlighter-rouge">n</code> to advance to the next line, and then <code class="language-plaintext highlighter-rouge">continue</code> / <code class="language-plaintext highlighter-rouge">c</code> to let the program continue.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">(</span>gdb<span class="o">)</span> f 0
<span class="c">#0  say_hello_and_return_none (self=0x7f4de7360230, args=0x7f4de7674250) at debugging_demo.c:7</span>
7      <span class="nb">printf</span> <span class="o">(</span><span class="s2">"Hello from the extension</span><span class="se">\n</span><span class="s2">"</span><span class="o">)</span><span class="p">;</span>
<span class="o">(</span>gdb<span class="o">)</span> n
Hello from the extension
8      Py_RETURN_NONE<span class="p">;</span>
<span class="o">(</span>gdb<span class="o">)</span> c
Continuing.
<span class="o">&gt;&gt;&gt;</span>
</code></pre></div></div>

<p>At the very end we get back to our Python interpret. You can <code class="language-plaintext highlighter-rouge">quit()</code> out of this to get back to gdb, and then <code class="language-plaintext highlighter-rouge">exit</code> gdb to get back to the shell.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">&gt;&gt;&gt;</span> quit<span class="o">()</span>
<span class="o">[</span>Inferior 1 <span class="o">(</span>process 57<span class="o">)</span> exited normally]
<span class="o">(</span>gdb<span class="o">)</span> <span class="nb">exit
</span>root@ba83cd50f6ec:/data#
</code></pre></div></div>

<h2 id="debugging-segmentation-faults">Debugging Segmentation Faults</h2>

<p>Let’s introduce an off-by-one programming error into our source code. We can create a new <code class="language-plaintext highlighter-rouge">debugging_demo2.c</code> file with similar but updated content:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define PY_SSIZE_T_CLEAN
#include</span> <span class="cpf">&lt;Python.h&gt;</span><span class="cp">
</span>
<span class="cp">#define NUM_WORDS 4
</span>
<span class="k">static</span> <span class="n">PyObject</span> <span class="o">*</span>
<span class="nf">say_hello_and_return_none</span> <span class="p">(</span><span class="n">PyObject</span> <span class="o">*</span><span class="n">self</span><span class="p">,</span> <span class="n">PyObject</span> <span class="o">*</span><span class="n">args</span><span class="p">)</span>
<span class="p">{</span>
  <span class="k">const</span> <span class="kt">char</span><span class="o">*</span> <span class="n">words</span><span class="p">[</span><span class="n">NUM_WORDS</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span>
    <span class="s">"Hello"</span><span class="p">,</span>
    <span class="s">"from"</span><span class="p">,</span>
    <span class="s">"the"</span><span class="p">,</span>
    <span class="s">"extension"</span>
  <span class="p">};</span>

  <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;=</span> <span class="n">NUM_WORDS</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">printf</span> <span class="p">(</span><span class="s">"%s "</span><span class="p">,</span> <span class="n">words</span><span class="p">[</span><span class="n">i</span><span class="p">]);</span>
  <span class="p">}</span>

  <span class="n">printf</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>
  <span class="n">Py_RETURN_NONE</span><span class="p">;</span>
<span class="p">}</span>

<span class="k">static</span> <span class="n">PyMethodDef</span> <span class="n">debugging_demo2_methods</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span>
  <span class="p">{</span><span class="s">"say_hello_and_return_none"</span><span class="p">,</span> <span class="n">say_hello_and_return_none</span><span class="p">,</span> <span class="n">METH_VARARGS</span><span class="p">,</span>
   <span class="s">"Says hello and returns none."</span><span class="p">},</span>
  <span class="p">{</span><span class="nb">NULL</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">}</span>  <span class="cm">/* Sentinel */</span>
<span class="p">};</span>

<span class="k">static</span> <span class="k">struct</span> <span class="n">PyModuleDef</span> <span class="n">debugging_demo2_module</span> <span class="o">=</span> <span class="p">{</span>
  <span class="n">PyModuleDef_HEAD_INIT</span><span class="p">,</span>
  <span class="p">.</span><span class="n">m_name</span> <span class="o">=</span> <span class="s">"debugging_demo2"</span><span class="p">,</span>
  <span class="p">.</span><span class="n">m_doc</span> <span class="o">=</span> <span class="s">"A simple extension to showcase debugging"</span><span class="p">,</span>
  <span class="p">.</span><span class="n">m_size</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span>
  <span class="p">.</span><span class="n">m_methods</span> <span class="o">=</span> <span class="n">debugging_demo2_methods</span>
<span class="p">};</span>

<span class="n">PyMODINIT_FUNC</span> <span class="nf">PyInit_debugging_demo2</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
  <span class="k">return</span> <span class="n">PyModuleDef_Init</span><span class="p">(</span><span class="o">&amp;</span><span class="n">debugging_demo2_module</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Compile with <code class="language-plaintext highlighter-rouge">gcc -g3 -Wall -Werror -std=c17 -shared -fPIC -I/usr/local/include/python3.10d debugging_demo2.c -o debugging_demo2.so</code>. A quick <code class="language-plaintext highlighter-rouge">python3 -c "import debugging_demo2; debugging_demo2.say_hello_and_return_none()"</code> this time will likely give you a segmentation fault, with no real error message.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>root@ba83cd50f6ec:/data# python3 <span class="nt">-c</span> <span class="s2">"import debugging_demo2; debugging_demo2.say_hello_and_return_none()"</span>
Segmentation fault <span class="o">(</span>core dumped<span class="o">)</span>
</code></pre></div></div>

<p>Fortunately, gdb will automatically stop execution on a segfault and give you the ability to inspect your program. Let’s start this program using the <code class="language-plaintext highlighter-rouge">--args</code> argument to gdb, which will allow us to forward arguments like <code class="language-plaintext highlighter-rouge">-c "..."</code> to the program gdb attaches to (here python3):</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>root@ba83cd50f6ec:/data# gdb <span class="nt">--args</span> python3 <span class="nt">-c</span> <span class="s2">"import debugging_demo2; debugging_demo2.say_hello_and_return_none()"</span>
GNU gdb <span class="o">(</span>GDB<span class="o">)</span> 12.1
Copyright <span class="o">(</span>C<span class="o">)</span> 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later &lt;http://gnu.org/licenses/gpl.html&gt;
...
For <span class="nb">help</span>, <span class="nb">type</span> <span class="s2">"help"</span><span class="nb">.</span>
Type <span class="s2">"apropos word"</span> to search <span class="k">for </span>commands related to <span class="s2">"word"</span>...
Reading symbols from python3...
<span class="o">(</span>gdb<span class="o">)</span>
</code></pre></div></div>

<p>Enter <code class="language-plaintext highlighter-rouge">run</code> and things will pause when the segmentation fault occurs:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">(</span>gdb<span class="o">)</span> run
Starting program: /usr/local/bin/python3 <span class="nt">-c</span> import<span class="se">\ </span>debugging_demo2<span class="se">\;\ </span>debugging_demo2.say_hello_and_return_none<span class="se">\(\)</span>
warning: Error disabling address space randomization: Operation not permitted
<span class="o">[</span>Thread debugging using libthread_db enabled]
Using host libthread_db library <span class="s2">"/lib/x86_64-linux-gnu/libthread_db.so.1"</span><span class="nb">.</span>

Program received signal SIGSEGV, Segmentation fault.
0x00007ffb409dc97d <span class="k">in</span> ?? <span class="o">()</span> from /lib/x86_64-linux-gnu/libc.so.6
<span class="o">(</span>gdb<span class="o">)</span>
</code></pre></div></div>

<p>f we inspect the backtrace, we will see that the first three frames are from <code class="language-plaintext highlighter-rouge">/lib/x86_64-linux-gnu/libc.so</code>, which is the part of the standard library on GNU/Linux.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">(</span>gdb<span class="o">)</span> bt
<span class="c">#0  0x00007ffb409dc97d in ?? () from /lib/x86_64-linux-gnu/libc.so.6</span>
<span class="c">#1  0x00007ffb408b5db1 in ?? () from /lib/x86_64-linux-gnu/libc.so.6</span>
<span class="c">#2  0x00007ffb4089f81f in printf () from /lib/x86_64-linux-gnu/libc.so.6</span>
<span class="c">#3  0x00007ffb405b5245 in say_hello_and_return_none (self=0x7ffb4065dc10, args=0x7ffb406e4250) at debugging_demo.c:15</span>
<span class="c">#4  0x000055bed6af30cb in cfunction_call (func=0x7ffb406a2b10, args=&lt;optimized out&gt;, kwargs=&lt;optimized out&gt;)</span>
    at Objects/methodobject.c:552
</code></pre></div></div>

<p>In contrast to the last 2 frames, there is also barely any function information. This is because these libraries are heavily optimized without any debugging symbols (remember the <code class="language-plaintext highlighter-rouge">-g3</code> flag we using during compilation) so gdb can’t do much besides tell us the memory location of the calls. If you ever try to debug a program and can’t see the symbols you are looking for, keep this in mind.</p>

<p>In any case, we are going to assume there is no bug in the standard library and jump back to f 3 to inspect our code. There a quick <code class="language-plaintext highlighter-rouge">info locals</code> will tell us about the local variables.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">(</span>gdb<span class="o">)</span> f 3
<span class="c">#3  0x00007ffb405b5245 in say_hello_and_return_none (self=0x7ffb4065dc10, args=0x7ffb406e4250) at debugging_demo.c:15</span>
15       <span class="nb">printf</span> <span class="o">(</span><span class="s2">"%s "</span>, words[i]<span class="o">)</span><span class="p">;</span>
<span class="o">(</span>gdb<span class="o">)</span> info locals
i <span class="o">=</span> 4
words <span class="o">=</span> <span class="o">{</span>0x7ffb405b6000 <span class="s2">"Hello"</span>, 0x7ffb405b6006 <span class="s2">"from"</span>, 0x7ffb405b600b <span class="s2">"the"</span>, 0x7ffb405b600f <span class="s2">"extension"</span><span class="o">}</span>
</code></pre></div></div>

<p>Since C is a 0-indexed language, the expression <code class="language-plaintext highlighter-rouge">words[i]</code> tries to access memory that is out of bounds, which is the root cause of our segmentation fault:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">(</span>gdb<span class="o">)</span> p words[3]
<span class="nv">$1</span> <span class="o">=</span> 0x7ffb405b600f <span class="s2">"extension"</span>
<span class="o">(</span>gdb<span class="o">)</span> p words[i]
<span class="nv">$2</span> <span class="o">=</span> 0x2e &lt;error: Cannot access memory at address 0x2e&gt;
</code></pre></div></div>

<p>A quick <code class="language-plaintext highlighter-rouge">l</code> lists the code surrounding this function.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">(</span>gdb<span class="o">)</span> l
12       <span class="s2">"the"</span>,
13       <span class="s2">"extension"</span>
14     <span class="o">}</span><span class="p">;</span>
15
16     <span class="k">for</span> <span class="o">(</span>int i <span class="o">=</span> 0<span class="p">;</span> i &lt;<span class="o">=</span> NUM_WORDS<span class="p">;</span> i++<span class="o">)</span> <span class="o">{</span>
17       <span class="nb">printf</span> <span class="o">(</span><span class="s2">"%s "</span>, words[i]<span class="o">)</span><span class="p">;</span>
18     <span class="o">}</span>
19
20     <span class="nb">printf</span><span class="o">(</span><span class="s2">"</span><span class="se">\n</span><span class="s2">"</span><span class="o">)</span><span class="p">;</span>
21     Py_RETURN_NONE<span class="p">;</span>
</code></pre></div></div>

<p>The error is on line 14 and to have this program execute properly we would need to change <code class="language-plaintext highlighter-rouge">for (int i = 0; i &lt;= NUM_WORDS; i++)</code> to <code class="language-plaintext highlighter-rouge">for (int i = 0; i &lt; NUM_WORDS; i++)</code>, keeping our array access in bounds.</p>

<p>As an aside, if we had turned on optimization when compiling this via the <code class="language-plaintext highlighter-rouge">-O2</code> flag, gcc would have warned and then errored (as long as you use <code class="language-plaintext highlighter-rouge">-Werror</code>) up front. But that would have made debugging less fun.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>root@ba83cd50f6ec:/data# gcc <span class="nt">-g3</span> <span class="nt">-O2</span> <span class="nt">-Wall</span> <span class="nt">-Werror</span> <span class="nt">-std</span><span class="o">=</span>c17 <span class="nt">-shared</span> <span class="nt">-fPIC</span> <span class="nt">-I</span>/usr/local/include/python3.10d debugging_demo2.c <span class="nt">-o</span> debugging_demo2.so
debugging_demo2.c: In <span class="k">function</span> <span class="s1">'say_hello_and_return_none'</span>:
debugging_demo2.c:17:5: error: iteration 4 invokes undefined behavior <span class="o">[</span><span class="nt">-Werror</span><span class="o">=</span>aggressive-loop-optimizations]
   17 |     <span class="nb">printf</span> <span class="o">(</span><span class="s2">"%s "</span>, words[i]<span class="o">)</span><span class="p">;</span>
      |     ^~~~~~~~~~~~~~~~~~~~~~~~
debugging_demo2.c:16:21: note: within this loop
   16 |   <span class="k">for</span> <span class="o">(</span>int i <span class="o">=</span> 0<span class="p">;</span> i &lt;<span class="o">=</span> NUM_WORDS<span class="p">;</span> i++<span class="o">)</span> <span class="o">{</span>
      |                     ^
cc1: all warnings being treated as errors
</code></pre></div></div>

<h2 id="debugging-python-c-data-exchange">Debugging Python-&gt;C data exchange</h2>

<p>CPython distributes a <a href="https://sourceware.org/gdb/onlinedocs/gdb/Python.html#Python">gdb python extension</a> that bridges the gap between what you as a Python developer see at runtime versus what gdb knows about the objects it sees at a lower level. This extension is housed in the <a href="https://github.com/python/cpython/blob/main/Tools/gdb/libpython.py">CPython source code</a>, which we also have hanging around at <code class="language-plaintext highlighter-rouge">/clones</code> in our Docker image.</p>

<p>Let’s continue expanding on our previous extension, this time naming it <code class="language-plaintext highlighter-rouge">debugging_demo3.c</code>. Rather than being self contained, the new extension will print whatever name you pass to it through the Python interpreter. Our initial structure looks like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define PY_SSIZE_T_CLEAN
#include</span> <span class="cpf">&lt;Python.h&gt;</span><span class="cp">
</span>
<span class="cp">#define NUM_WORDS 4
</span>
<span class="k">static</span> <span class="n">PyObject</span> <span class="o">*</span>
<span class="nf">say_hello_and_return_none</span> <span class="p">(</span><span class="n">PyObject</span> <span class="o">*</span><span class="n">self</span><span class="p">,</span> <span class="n">PyObject</span> <span class="o">*</span><span class="n">args</span><span class="p">)</span>
<span class="p">{</span>
  <span class="n">PyObject</span> <span class="o">*</span><span class="n">name</span><span class="p">;</span>
  <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">PyArg_ParseTuple</span><span class="p">(</span><span class="n">args</span><span class="p">,</span> <span class="s">"O"</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">name</span><span class="p">))</span> <span class="p">{</span>
    <span class="k">return</span> <span class="nb">NULL</span><span class="p">;</span>
  <span class="p">}</span>

  <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">str</span> <span class="o">=</span> <span class="n">PyUnicode_AsUTF8</span><span class="p">(</span><span class="n">name</span><span class="p">);</span>
  <span class="n">printf</span><span class="p">(</span><span class="s">"Hello, %s</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">str</span><span class="p">);</span>
  <span class="n">Py_RETURN_NONE</span><span class="p">;</span>
<span class="p">}</span>

<span class="k">static</span> <span class="n">PyMethodDef</span> <span class="n">debugging_demo3_methods</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span>
  <span class="p">{</span><span class="s">"say_hello_and_return_none"</span><span class="p">,</span> <span class="n">say_hello_and_return_none</span><span class="p">,</span> <span class="n">METH_VARARGS</span><span class="p">,</span>
   <span class="s">"Says hello and returns none."</span><span class="p">},</span>
  <span class="p">{</span><span class="nb">NULL</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">}</span>  <span class="cm">/* Sentinel */</span>
<span class="p">};</span>

<span class="k">static</span> <span class="k">struct</span> <span class="n">PyModuleDef</span> <span class="n">debugging_demo3_module</span> <span class="o">=</span> <span class="p">{</span>
  <span class="n">PyModuleDef_HEAD_INIT</span><span class="p">,</span>
  <span class="p">.</span><span class="n">m_name</span> <span class="o">=</span> <span class="s">"debugging_demo3"</span><span class="p">,</span>
  <span class="p">.</span><span class="n">m_doc</span> <span class="o">=</span> <span class="s">"A simple extension to showcase debugging"</span><span class="p">,</span>
  <span class="p">.</span><span class="n">m_size</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span>
  <span class="p">.</span><span class="n">m_methods</span> <span class="o">=</span> <span class="n">debugging_demo3_methods</span>
<span class="p">};</span>

<span class="n">PyMODINIT_FUNC</span> <span class="nf">PyInit_debugging_demo3</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
  <span class="k">return</span> <span class="n">PyModuleDef_Init</span><span class="p">(</span><span class="o">&amp;</span><span class="n">debugging_demo3_module</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>We need to build this extension just as we have done before, this time using <code class="language-plaintext highlighter-rouge">gcc -g3 -Wall -Werror -std=c17 -shared -fPIC -I/usr/local/include/python3.10d debugging_demo3.c -o debugging_demo3.so</code>.</p>

<p>If you look closely at the source code above we have introduced <a href="https://docs.python.org/3/c-api/arg.html">PyArg_ParseTuple</a>, which handles unpacking function arguments into local variables. Our function takes 1 and only 1 argument in its current form; attempting to call it with anything else will set the global Python error indicator, hit the <code class="language-plaintext highlighter-rouge">return NULL;</code> statement, and propagate the error back to the Python runtime. That’s a lot of power packed into a few lines of code.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>root@ba83cd50f6ec:/data# python3
Python 3.10.10+ <span class="o">(</span>heads/3.10:bac3fe7, Feb 22 2023, 05:56:35<span class="o">)</span> <span class="o">[</span>GCC 11.3.0] on linux
Type <span class="s2">"help"</span>, <span class="s2">"copyright"</span>, <span class="s2">"credits"</span> or <span class="s2">"license"</span> <span class="k">for </span>more information.
<span class="o">&gt;&gt;&gt;</span> import debugging_demo3
<span class="o">&gt;&gt;&gt;</span> debugging_demo3.say_hello_and_return_none<span class="o">(</span><span class="s2">"Will"</span><span class="o">)</span>
Hello, Will
<span class="o">&gt;&gt;&gt;</span> debugging_demo3.say_hello_and_return_none<span class="o">()</span>
Traceback <span class="o">(</span>most recent call last<span class="o">)</span>:
  File <span class="s2">"&lt;stdin&gt;"</span>, line 1, <span class="k">in</span> &lt;module&gt;
TypeError: <span class="k">function </span>takes exactly 1 argument <span class="o">(</span>0 given<span class="o">)</span>
<span class="o">&gt;&gt;&gt;</span> debugging_demo3.say_hello_and_return_none<span class="o">(</span><span class="s2">"Will"</span>, <span class="s2">"Ayd"</span><span class="o">)</span>
Traceback <span class="o">(</span>most recent call last<span class="o">)</span>:
  File <span class="s2">"&lt;stdin&gt;"</span>, line 1, <span class="k">in</span> &lt;module&gt;
TypeError: <span class="k">function </span>takes exactly 1 argument <span class="o">(</span>2 given<span class="o">)</span>
</code></pre></div></div>

<p>Things work great until you try passing through something that is not a unicode object.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">&gt;&gt;&gt;</span> debugging_demo3.say_hello_and_return_none<span class="o">(</span>555<span class="o">)</span>
Hello, <span class="o">(</span>null<span class="o">)</span>
Fatal Python error: _Py_CheckFunctionResult: a <span class="k">function </span>returned a result with an exception <span class="nb">set
</span>Python runtime state: initialized
TypeError: bad argument <span class="nb">type </span><span class="k">for </span>built-in operation

The above exception was the direct cause of the following exception:

SystemError: &lt;built-in <span class="k">function </span>say_hello_and_return_none&gt; returned a result with an exception <span class="nb">set

</span>Current thread 0x00007f5dd4cbb740 <span class="o">(</span>most recent call first<span class="o">)</span>:
  File <span class="s2">"&lt;stdin&gt;"</span>, line 1 <span class="k">in</span> &lt;module&gt;

Extension modules: debugging_demo3 <span class="o">(</span>total: 1<span class="o">)</span>
Aborted <span class="o">(</span>core dumped<span class="o">)</span>
</code></pre></div></div>

<p>This time the program aborted instead of having a segmentation fault. That said, gdb will still allow you to jump in and inspect the state of things prior to termination.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>root@ba83cd50f6ec:/data# gdb <span class="nt">--args</span> python3 <span class="nt">-c</span> <span class="s2">"import debugging_demo3; debugging_demo3.say_hello_and_return_none(555)"</span>
Reading symbols from python3...
<span class="o">(</span>gdb<span class="o">)</span> run
Starting program: /usr/local/bin/python3 <span class="nt">-c</span> import<span class="se">\ </span>debugging_demo3<span class="se">\;\ </span>debugging_demo3.say_hello_and_return_none<span class="se">\(</span>555<span class="se">\)</span>
warning: Error disabling address space randomization: Operation not permitted
<span class="o">[</span>Thread debugging using libthread_db enabled]
Using host libthread_db library <span class="s2">"/lib/x86_64-linux-gnu/libthread_db.so.1"</span><span class="nb">.</span>
Hello, <span class="o">(</span>null<span class="o">)</span>
Fatal Python error: _Py_CheckFunctionResult: a <span class="k">function </span>returned a result with an exception <span class="nb">set
</span>Python runtime state: initialized
TypeError: bad argument <span class="nb">type </span><span class="k">for </span>built-in operation

The above exception was the direct cause of the following exception:

SystemError: &lt;built-in <span class="k">function </span>say_hello_and_return_none&gt; returned a result with an exception <span class="nb">set

</span>Current thread 0x00007f9b27e91740 <span class="o">(</span>most recent call first<span class="o">)</span>:
  File <span class="s2">"&lt;string&gt;"</span>, line 1 <span class="k">in</span> &lt;module&gt;

Extension modules: debugging_demo3 <span class="o">(</span>total: 1<span class="o">)</span>

Program received signal SIGABRT, Aborted.
0x00007f9b27f2aa7c <span class="k">in </span>pthread_kill <span class="o">()</span> from /lib/x86_64-linux-gnu/libc.so.6
</code></pre></div></div>

<p>When you look at the backtrace here, you won’t see any of our user code:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">(</span>gdb<span class="o">)</span> bt
<span class="c">#0  0x00007f9b27f2aa7c in pthread_kill () from /lib/x86_64-linux-gnu/libc.so.6</span>
<span class="c">#1  0x00007f9b27ed6476 in raise () from /lib/x86_64-linux-gnu/libc.so.6</span>
<span class="c">#2  0x00007f9b27ebc7f3 in abort () from /lib/x86_64-linux-gnu/libc.so.6</span>
<span class="c">#3  0x0000555c45a8505b in fatal_error_exit (status=&lt;optimized out&gt;) at Python/pylifecycle.c:2553</span>
<span class="c">#4  0x0000555c45a895c7 in fatal_error (fd=2, header=header@entry=1,</span>
    <span class="nv">prefix</span><span class="o">=</span>prefix@entry<span class="o">=</span>0x555c45c08a60 &lt;__func__.18&gt; <span class="s2">"_Py_CheckFunctionResult"</span>,
    <span class="nv">msg</span><span class="o">=</span>msg@entry<span class="o">=</span>0x555c45c08528 <span class="s2">"a function returned a result with an exception set"</span>, <span class="nv">status</span><span class="o">=</span>status@entry<span class="o">=</span><span class="nt">-1</span><span class="o">)</span>
    at Python/pylifecycle.c:2734
<span class="c">#5  0x0000555c45a89630 in _Py_FatalErrorFunc (func=func@entry=0x555c45c08a60 &lt;__func__.18&gt; "_Py_CheckFunctionResult",</span>
</code></pre></div></div>

<p>This is a bit unfortunate because we can’t directly trace back to our function. With that said, the message <code class="language-plaintext highlighter-rouge">a function returned a result with an exception set</code> clues us in on where we need to look. CPython manages one global error indicator queryable via <a href="https://docs.python.org/3/c-api/exceptions.html">PyErr_Occurred()</a>.</p>

<p>To do this, let’s set a <code class="language-plaintext highlighter-rouge">break say_hello_and_return_none</code> to pause execution when we enter our function. Then <code class="language-plaintext highlighter-rouge">run</code> to get to that point and add a <code class="language-plaintext highlighter-rouge">watch PyErr_Occurred()</code> to the mix.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">(</span>gdb<span class="o">)</span> <span class="nb">break </span>say_hello_and_return_none
Breakpoint 1 at 0x7f0a8fbf5200: file debugging_demo3.c, line 8.
<span class="o">(</span>gdb<span class="o">)</span> run
The program being debugged has been started already.
Start it from the beginning? <span class="o">(</span>y or n<span class="o">)</span> y
Starting program: /usr/local/bin/python3 <span class="nt">-c</span> import<span class="se">\ </span>debugging_demo3<span class="se">\;\ </span>debugging_demo3.say_hello_and_return_none<span class="se">\(</span>555<span class="se">\)</span>
warning: Error disabling address space randomization: Operation not permitted
<span class="o">[</span>Thread debugging using libthread_db enabled]
Using host libthread_db library <span class="s2">"/lib/x86_64-linux-gnu/libthread_db.so.1"</span><span class="nb">.</span>

Breakpoint 1, say_hello_and_return_none <span class="o">(</span><span class="nv">self</span><span class="o">=</span>0x7f58e60305f0, <span class="nv">args</span><span class="o">=</span>0x7f58e5fe98b0<span class="o">)</span> at debugging_demo3.c:8
8    <span class="o">{</span>
<span class="o">(</span>gdb<span class="o">)</span> watch PyErr_Occurred<span class="o">()</span>
Watchpoint 2: PyErr_Occurred<span class="o">()</span>
</code></pre></div></div>

<p>At this point <code class="language-plaintext highlighter-rouge">info break</code> should show us the two conditions under which gdb will pause, either on <code class="language-plaintext highlighter-rouge">say_hello_and_return_none</code> entry or when the <code class="language-plaintext highlighter-rouge">PyErr_Occurred()</code> value changes.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">(</span>gdb<span class="o">)</span> info <span class="nb">break
</span>Num     Type           Disp Enb Address            What
1       breakpoint     keep y   0x00007f58e5f87200 <span class="k">in </span>say_hello_and_return_none at debugging_demo3.c:8
     breakpoint already hit 1 <span class="nb">time
</span>2       watchpoint     keep y                      PyErr_Occurred<span class="o">()</span>
</code></pre></div></div>

<p>Type <code class="language-plaintext highlighter-rouge">c</code> to continue along and you will see that the watchpoint gets hit:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">(</span>gdb<span class="o">)</span> c
Continuing.

Watchpoint 2: PyErr_Occurred<span class="o">()</span>

Old value <span class="o">=</span> <span class="o">(</span>PyObject <span class="k">*</span><span class="o">)</span> 0x0
New value <span class="o">=</span> <span class="o">(</span>PyObject <span class="k">*</span><span class="o">)</span> 0x55ccfb73cc20 &lt;_PyExc_TypeError&gt;
_PyErr_Restore <span class="o">(</span><span class="nv">tstate</span><span class="o">=</span>tstate@entry<span class="o">=</span>0x55ccfcb82930, <span class="nb">type</span><span class="o">=</span><span class="nb">type</span>@entry<span class="o">=</span>0x55ccfb73cc20 &lt;_PyExc_TypeError&gt;, <span class="nv">value</span><span class="o">=</span>value@entry<span class="o">=</span>0x7f58e6057640, <span class="nv">traceback</span><span class="o">=</span>0x0<span class="o">)</span> at Python/errors.c:60
60       tstate-&gt;curexc_value <span class="o">=</span> value<span class="p">;</span>
</code></pre></div></div>

<p>The watchpoint wasn’t hit within our code, but internal to CPython. No matter - we can inspect the backtrace and see what point in our code base this might happen at.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">(</span>gdb<span class="o">)</span> bt
<span class="c">#0  _PyErr_Restore (tstate=tstate@entry=0x55ccfcb82930, type=type@entry=0x55ccfb73cc20 &lt;_PyExc_TypeError&gt;,</span>
    <span class="nv">value</span><span class="o">=</span>value@entry<span class="o">=</span>0x7f58e6057640, <span class="nv">traceback</span><span class="o">=</span>0x0<span class="o">)</span> at Python/errors.c:60
<span class="c">#1  0x000055ccfb455d60 in _PyErr_SetObject (tstate=tstate@entry=0x55ccfcb82930,</span>
    <span class="nv">exception</span><span class="o">=</span>exception@entry<span class="o">=</span>0x55ccfb73cc20 &lt;_PyExc_TypeError&gt;, <span class="nv">value</span><span class="o">=</span>value@entry<span class="o">=</span>0x7f58e6057640<span class="o">)</span> at Python/errors.c:189
<span class="c">#2  0x000055ccfb455f59 in _PyErr_SetString (tstate=0x55ccfcb82930, exception=0x55ccfb73cc20 &lt;_PyExc_TypeError&gt;,</span>
    <span class="nv">string</span><span class="o">=</span>string@entry<span class="o">=</span>0x55ccfb645698 <span class="s2">"bad argument type for built-in operation"</span><span class="o">)</span> at Python/errors.c:235
<span class="c">#3  0x000055ccfb455fdd in PyErr_BadArgument () at Python/errors.c:667</span>
<span class="c">#4  0x000055ccfb402060 in PyUnicode_AsUTF8AndSize (unicode=&lt;optimized out&gt;, psize=psize@entry=0x0)</span>
    at Objects/unicodeobject.c:4245
<span class="c">#5  0x000055ccfb402195 in PyUnicode_AsUTF8 (unicode=&lt;optimized out&gt;) at Objects/unicodeobject.c:4265</span>
<span class="c">#6  0x00007f58e5f87245 in say_hello_and_return_none (self=0x7f58e60305f0, args=0x7f58e5fe98b0) at debugging_demo3.c:14</span>
</code></pre></div></div>

<p>Frame 6 is <code class="language-plaintext highlighter-rouge">say_hello_and_return_none</code>, specifically on line 14. You can jump back to that and see the line being called.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">(</span>gdb<span class="o">)</span> f 6
<span class="c">#6  0x00007f58e5f87245 in say_hello_and_return_none (self=0x7f58e60305f0, args=0x7f58e5fe98b0) at debugging_demo3.c:14</span>
14     const char <span class="k">*</span>str <span class="o">=</span> PyUnicode_AsUTF8<span class="o">(</span>name<span class="o">)</span><span class="p">;</span>
</code></pre></div></div>

<p>We know from our function invocation that we are passed the value <code class="language-plaintext highlighter-rouge">555</code> as an argument to the function call. However, you wouldn’t know this by trying to print the object in gdb.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">(</span>gdb<span class="o">)</span> p name
<span class="nv">$1</span> <span class="o">=</span> <span class="o">(</span>PyObject <span class="k">*</span><span class="o">)</span> 0x7f58e6013bc0
<span class="o">(</span>gdb<span class="o">)</span> p <span class="k">*</span>name
<span class="nv">$2</span> <span class="o">=</span> <span class="o">{</span>ob_refcnt <span class="o">=</span> 4, ob_type <span class="o">=</span> 0x55ccfb73f180 &lt;PyLong_Type&gt;<span class="o">}</span>
</code></pre></div></div>

<p>We get <em>some</em> information when dereferencing this object about the basic <code class="language-plaintext highlighter-rouge">PyObject</code> struct members. But we’d have to muck around a bit more to see the members that are relevant to longs, or whatever object type it is we inspect.</p>

<p>This is where the gdb extension becomes a really powerful abstraction tool. First, we need to load the extension into gdb. This can be done at runtime with the <code class="language-plaintext highlighter-rouge">source</code> command pointing to the extension file. In our docker image, this would mean</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">(</span>gdb<span class="o">)</span> <span class="nb">source</span> /clones/cpython/Tools/gdb/libpython.py
</code></pre></div></div>

<p>Once you have loaded the extension, the default printing mechanism becomes a lot more familiar to Python users.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">(</span>gdb<span class="o">)</span> p name
<span class="nv">$3</span> <span class="o">=</span> 555
</code></pre></div></div>

<p>This confirms that the object we have on this line is the same we provided to the function call, so nothing way out of the ordinary is going on. Since the global <code class="language-plaintext highlighter-rouge">PyErr_Occurred()</code> indicator was set, we can use <code class="language-plaintext highlighter-rouge">PyErr_Print()</code> to get information from the Python runtime about what went wrong. Note that calling this clears the error indicator.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">(</span>gdb<span class="o">)</span> call PyErr_PrintEx<span class="o">(</span>0<span class="o">)</span>
TypeError
<span class="o">(</span>gdb<span class="o">)</span> p PyErr_Occurred<span class="o">()</span>
<span class="nv">$4</span> <span class="o">=</span> 0x0
</code></pre></div></div>

<p>We called <code class="language-plaintext highlighter-rouge">PyUnicode_AsUTF8</code> with a <code class="language-plaintext highlighter-rouge">PyLong</code> object even though it expected <code class="language-plaintext highlighter-rouge">PyUnicode</code>. In the Python runtime this would automatically trigger an exception and stop things immediately. C doesn’t have built-in error handling, so things continue unless you explicitly handle the issue.</p>

<p>Following the pattern of <a href="https://docs.python.org/3/c-api/exceptions.htmlpyerr">CPython Exception Handling</a>, we are going to slightly modify our source code to look like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define PY_SSIZE_T_CLEAN
#include</span> <span class="cpf">&lt;Python.h&gt;</span><span class="cp">
</span>
<span class="cp">#define NUM_WORDS 4
</span>
<span class="k">static</span> <span class="n">PyObject</span> <span class="o">*</span>
<span class="nf">say_hello_and_return_none</span> <span class="p">(</span><span class="n">PyObject</span> <span class="o">*</span><span class="n">self</span><span class="p">,</span> <span class="n">PyObject</span> <span class="o">*</span><span class="n">args</span><span class="p">)</span>
<span class="p">{</span>
  <span class="n">PyObject</span> <span class="o">*</span><span class="n">name</span><span class="p">;</span>
  <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">PyArg_ParseTuple</span><span class="p">(</span><span class="n">args</span><span class="p">,</span> <span class="s">"O"</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">name</span><span class="p">))</span> <span class="p">{</span>
    <span class="k">return</span> <span class="nb">NULL</span><span class="p">;</span>
  <span class="p">}</span>

  <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">str</span> <span class="o">=</span> <span class="n">PyUnicode_AsUTF8</span><span class="p">(</span><span class="n">name</span><span class="p">);</span>
  <span class="k">if</span> <span class="p">(</span><span class="n">str</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">return</span> <span class="nb">NULL</span><span class="p">;</span>
  <span class="p">}</span>

  <span class="n">printf</span><span class="p">(</span><span class="s">"Hello, %s</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">str</span><span class="p">);</span>
  <span class="n">Py_RETURN_NONE</span><span class="p">;</span>
<span class="p">}</span>

<span class="k">static</span> <span class="n">PyMethodDef</span> <span class="n">debugging_demo3_methods</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span>
  <span class="p">{</span><span class="s">"say_hello_and_return_none"</span><span class="p">,</span> <span class="n">say_hello_and_return_none</span><span class="p">,</span> <span class="n">METH_VARARGS</span><span class="p">,</span>
   <span class="s">"Says hello and returns none."</span><span class="p">},</span>
  <span class="p">{</span><span class="nb">NULL</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">}</span>  <span class="cm">/* Sentinel */</span>
<span class="p">};</span>

<span class="k">static</span> <span class="k">struct</span> <span class="n">PyModuleDef</span> <span class="n">debugging_demo3_module</span> <span class="o">=</span> <span class="p">{</span>
  <span class="n">PyModuleDef_HEAD_INIT</span><span class="p">,</span>
  <span class="p">.</span><span class="n">m_name</span> <span class="o">=</span> <span class="s">"debugging_demo3"</span><span class="p">,</span>
  <span class="p">.</span><span class="n">m_doc</span> <span class="o">=</span> <span class="s">"A simple extension to showcase debugging"</span><span class="p">,</span>
  <span class="p">.</span><span class="n">m_size</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span>
  <span class="p">.</span><span class="n">m_methods</span> <span class="o">=</span> <span class="n">debugging_demo3_methods</span>
<span class="p">};</span>

<span class="n">PyMODINIT_FUNC</span> <span class="nf">PyInit_debugging_demo3</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
  <span class="k">return</span> <span class="n">PyModuleDef_Init</span><span class="p">(</span><span class="o">&amp;</span><span class="n">debugging_demo3_module</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">if (str == NULL)</code> is our way of handling a failed <code class="language-plaintext highlighter-rouge">PyUnicode_AsUTF8</code> call. By propagating that <code class="language-plaintext highlighter-rouge">NULL</code> value up the call stack, CPython will gracefully handle the error for us when we get back to the Python runtime. To confirm, recompile and trying passing the same argument to the function.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">(</span>gdb<span class="o">)</span> <span class="nb">exit
</span>A debugging session is active.

     Inferior 1 <span class="o">[</span>process 515] will be killed.

Quit anyway? <span class="o">(</span>y or n<span class="o">)</span> y
root@ba83cd50f6ec:/data# gcc <span class="nt">-g3</span> <span class="nt">-Wall</span> <span class="nt">-Werror</span> <span class="nt">-std</span><span class="o">=</span>c17 <span class="nt">-shared</span> <span class="nt">-fPIC</span> <span class="nt">-I</span>/usr/local/include/python3.10d debugging_demo3.c <span class="nt">-o</span> debugging_demo3.so
root@ba83cd50f6ec:/data# python3 <span class="nt">-c</span> <span class="s2">"import debugging_demo3; debugging_demo3.say_hello_and_return_none(555)"</span>
Traceback <span class="o">(</span>most recent call last<span class="o">)</span>:
  File <span class="s2">"&lt;string&gt;"</span>, line 1, <span class="k">in</span> &lt;module&gt;
TypeError: bad argument <span class="nb">type </span><span class="k">for </span>built-in operation
</code></pre></div></div>

<p>We still have an error, but the error is the built-in <code class="language-plaintext highlighter-rouge">TypeError</code> that we can handle in our Python code if we wanted, instead of the <code class="language-plaintext highlighter-rouge">SIGABRT</code> that shut down the application previously.</p>

<p>While not in scope for this article, there are many ways you can improve the above function. You could either change the <a href="https://docs.python.org/3/c-api/arg.html#strings-and-buffers">format string</a> provided to <code class="language-plaintext highlighter-rouge">PyArg_ParseTuple</code> to map to something else besides a <code class="language-plaintext highlighter-rouge">PyObject</code>*, or alternately mix in a call to <code class="language-plaintext highlighter-rouge">PyObject_Str</code> to coerce any object to a unicode object prior to the <code class="language-plaintext highlighter-rouge">PyUnicode_AsUTF8</code> call.</p>

<h2 id="closing-thoughts">Closing Thoughts</h2>

<p>Understanding how C and Python interacted was something I struggled with for years. Once unlocked, I found knowledge of how to interact at the lower levels using gdb to be invaluable. I can only hope that this article lays a good foundation for you to build upon.</p>

<p>The only other advice I can offer is to be patient! I’ve been at this for years and still find myself learning something new every day. Therein lies the true art of programming.</p>

<p>My next article will focus on using the <a href="https://cython.readthedocs.io/en/latest/src/userguide/debugging.html">Cython debugger</a>, which is implemented as a gdb extension. The knowledge in this article is a hugely important stepping stone towards that. If you can understand how to control and debug all of these components, you are in a very good spot when it comes to Python development.</p>]]></content><author><name>Will Ayd</name></author><category term="debugging" /><category term="python" /><category term="c" /><summary type="html"><![CDATA[This blog post teaches you how to debug C extensions to Python. It is part 2 of a 3 part series.]]></summary></entry><entry><title type="html">Fundamental Python Debugging Part 1 - Python</title><link href="https://willayd.com/fundamental-python-debugging-part-1-python.html" rel="alternate" type="text/html" title="Fundamental Python Debugging Part 1 - Python" /><published>2023-02-08T00:00:00+00:00</published><updated>2023-02-08T00:00:00+00:00</updated><id>https://willayd.com/fundamental-python-debugging-part-1-python</id><content type="html" xml:base="https://willayd.com/fundamental-python-debugging-part-1-python.html"><![CDATA[<p>The topic of debugging Python is well-covered. Regardless of whether you want to use your IDE interactively or work from a console with <a href="https://docs.python.org/3/library/pdb.html">pdb</a>, chances are this is not the first article you have read on the topic.</p>

<p>In spite of the wealth of content, I’ve found that most articles on debugging Python are singularly focused on debugging Python. That may not seem like such a bad thing at face value, but developing Python at an advanced level requires not only knowledge of the language itself, but also of lower level languages like C/C++. Being an expert in all of these languages at one time is near impossible, so knowing how to debug them effectively is critical.</p>

<p>Luckily, when viewed through the proper lens, there is a lot of overlap in the debugging tooling for these languages. The built-in Python pdb debugger borrows much of its utility from <a href="https://sourceware.org/gdb/">gdb</a>,  which will help you debug C/C++/Rust/Fortran, etc… gdb itself is extendable <a href="https://sourceware.org/gdb/onlinedocs/gdb/Python.html#Python">using Python</a>, and this extensibility is the reason why things like the <a href="https://cython.readthedocs.io/en/latest/src/userguide/debugging.html">Cython debugger</a> exist.</p>

<p>Few if any other articles on debugging Python applications touch on these synchronicities. This and my next few blog posts attempt to highlight this for you and help you seamlessly transition across the aforementioned tools.</p>

<h2 id="setting-up-your-example">Setting up your example</h2>

<p>Let’s start with a buggy script. This code isn’t pythonic and you may be able to troubleshoot without even using a debugger, but that isn’t important for this exercise. Go ahead and save the below snippet as <code class="language-plaintext highlighter-rouge">buggy_program.py</code>:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">buggy_loop</span><span class="p">():</span>
    <span class="n">animals</span> <span class="o">=</span> <span class="p">[</span><span class="s">"dog"</span><span class="p">,</span> <span class="s">"cat"</span><span class="p">,</span> <span class="s">"turtle"</span><span class="p">]</span>
    <span class="n">index</span> <span class="o">=</span> <span class="mi">0</span>

    <span class="k">while</span> <span class="n">index</span> <span class="o">&lt;=</span> <span class="nb">len</span><span class="p">(</span><span class="n">animals</span><span class="p">):</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"The animal at index </span><span class="si">{</span><span class="n">index</span><span class="si">}</span><span class="s"> is </span><span class="si">{</span><span class="n">animals</span><span class="p">[</span><span class="n">index</span><span class="p">]</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        <span class="n">index</span> <span class="o">+=</span> <span class="mi">1</span>

<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
    <span class="n">buggy_loop</span><span class="p">()</span>
</code></pre></div></div>

<p>Executing this program with <code class="language-plaintext highlighter-rouge">python buggy_program.py</code> should yield:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>The animal at index 0 is dog
The animal at index 1 is <span class="nb">cat
</span>The animal at index 2 is turtle
Traceback <span class="o">(</span>most recent call last<span class="o">)</span>:
  File <span class="s2">"buggy_program.py"</span>, line 10, <span class="k">in</span> &lt;module&gt;
    buggy_loop<span class="o">()</span>
  File <span class="s2">"buggy_program.py"</span>, line 6, <span class="k">in </span>buggy_loop
    print<span class="o">(</span>f<span class="s2">"The animal at index {index} is {animals[index]}"</span><span class="o">)</span>
IndexError: list index out of range
</code></pre></div></div>

<h2 id="part-1-debugging-exceptions">Part 1: Debugging exceptions</h2>

<p>Changing our command from <code class="language-plaintext highlighter-rouge">python buggy_script.py</code> to <code class="language-plaintext highlighter-rouge">python -m pdb buggy_script.py</code> will launch pdb and load the script. pdb will not immediately execute anything, but instead wait for your input. We assume we don’t know any commands yet, so typing <code class="language-plaintext highlighter-rouge">help</code> is the best thing for us to start with.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">&gt;</span> /home/willayd/buggy_program.py<span class="o">(</span>1<span class="o">)</span>&lt;module&gt;<span class="o">()</span>
-&gt; def buggy_loop<span class="o">()</span>:
<span class="o">(</span>Pdb<span class="o">)</span> <span class="nb">help

</span>Documented commands <span class="o">(</span><span class="nb">type help</span> &lt;topic&gt;<span class="o">)</span>:
<span class="o">========================================</span>
EOF    c          d        h         list      q        rv       undisplay
a      cl         debug    <span class="nb">help      </span>ll        quit     s        unt
<span class="nb">alias  </span>clear      disable  ignore    longlist  r        <span class="nb">source   </span><span class="k">until
</span>args   commands   display  interact  n         restart  step     up
b      condition  down     j         next      <span class="k">return   </span>tbreak   w
<span class="nb">break  </span>cont       <span class="nb">enable   </span>jump      p         retval   u        whatis
bt     <span class="k">continue   </span><span class="nb">exit     </span>l         pp        run      <span class="nb">unalias  </span>where

Miscellaneous <span class="nb">help </span>topics:
<span class="o">==========================</span>
<span class="nb">exec  </span>pdb
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">help &lt;topic&gt;</code> allows you to navigate any of the items listed above. We can even input help help as a meta-command.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">(</span>Pdb<span class="o">)</span> <span class="nb">help help
</span>h<span class="o">(</span>elp<span class="o">)</span>
        Without argument, print the list of available commands.
        With a <span class="nb">command </span>name as argument, print <span class="nb">help </span>about that command.
        <span class="s2">"help pdb"</span> shows the full pdb documentation.
        <span class="s2">"help exec"</span> gives <span class="nb">help </span>on the <span class="o">!</span> command.
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">help</code> we have input so far is a pdb command and not the built-in <code class="language-plaintext highlighter-rouge">help</code> function that Python provides. If you wanted to execute the latter, you should prefix your input with <code class="language-plaintext highlighter-rouge">!</code>:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">(</span>Pdb<span class="o">)</span> <span class="o">!</span><span class="nb">help</span><span class="o">()</span>

Welcome to Python 3.8<span class="s1">'s help utility!

If this is your first time using Python, you should definitely check out
the tutorial on the Internet at https://docs.python.org/3.8/tutorial/.

Enter the name of any module, keyword, or topic to get help on writing
Python programs and using Python modules.  To quit this help utility and
return to the interpreter, just type "quit".

To get a list of available modules, keywords, symbols, or topics, type
"modules", "keywords", "symbols", or "topics".  Each module also comes
with a one-line summary of what it does; to list the modules whose name
or summary contain a given string such as "spam", type "modules spam".
</span></code></pre></div></div>

<p>If you executed the above <code class="language-plaintext highlighter-rouge">!help()</code> command be sure to input q and hit enter to quit the Python interactive help.</p>

<p>To actually get code executing we want to <code class="language-plaintext highlighter-rouge">continue</code>. <code class="language-plaintext highlighter-rouge">help continue</code> shows us more about this command.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">(</span>Pdb<span class="o">)</span> <span class="nb">help </span>c
c<span class="o">(</span>ont<span class="o">(</span>inue<span class="o">))</span>
        Continue execution, only stop when a breakpoint is encountered.
</code></pre></div></div>

<p>So <code class="language-plaintext highlighter-rouge">c</code>, <code class="language-plaintext highlighter-rouge">cont</code>, and <code class="language-plaintext highlighter-rouge">continue</code> would all do the same things for us. For now input <code class="language-plaintext highlighter-rouge">c</code> and press enter:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">(</span>Pdb<span class="o">)</span> c
The animal at index 0 is dog
The animal at index 1 is <span class="nb">cat
</span>The animal at index 2 is turtle
Traceback <span class="o">(</span>most recent call last<span class="o">)</span>:
  File <span class="s2">"/usr/lib/python3.10/pdb.py"</span>, line 1726, <span class="k">in </span>main
    pdb._runscript<span class="o">(</span>mainpyfile<span class="o">)</span>
  File <span class="s2">"/usr/lib/python3.10/pdb.py"</span>, line 1586, <span class="k">in </span>_runscript
    self.run<span class="o">(</span>statement<span class="o">)</span>
  File <span class="s2">"/usr/lib/python3.10/bdb.py"</span>, line 597, <span class="k">in </span>run
    <span class="nb">exec</span><span class="o">(</span>cmd, globals, locals<span class="o">)</span>
  File <span class="s2">"&lt;string&gt;"</span>, line 1, <span class="k">in</span> &lt;module&gt;
  File <span class="s2">"/home/willayd/buggy_program.py"</span>, line 10, <span class="k">in</span> &lt;module&gt;
    buggy_loop<span class="o">()</span>
  File <span class="s2">"/home/willayd/buggy_program.py"</span>, line 6, <span class="k">in </span>buggy_loop
    print<span class="o">(</span>f<span class="s2">"The animal at index {index} is {animals[index]}"</span><span class="o">)</span>
IndexError: list index out of range
Uncaught exception. Entering post mortem debugging
Running <span class="s1">'cont'</span> or <span class="s1">'step'</span> will restart the program
<span class="o">&gt;</span> /home/willayd/buggy_program.py<span class="o">(</span>6<span class="o">)</span>buggy_loop<span class="o">()</span>
-&gt; print<span class="o">(</span>f<span class="s2">"The animal at index {index} is {animals[index]}"</span><span class="o">)</span>
<span class="o">(</span>pdb<span class="o">)</span>
</code></pre></div></div>

<p>The program has executed and printed the same traceback we saw without using pdb. However, since we are running our script under pdb execution halts after an error occurs and allows us to inspect the state of the program.</p>

<p><code class="language-plaintext highlighter-rouge">l</code> (short for list) shows us where the execution halted (see -&gt; below) and a few lines around that.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">(</span>Pdb<span class="o">)</span> l
  1          def buggy_loop<span class="o">()</span>:
  2              animals <span class="o">=</span> <span class="o">[</span><span class="s2">"dog"</span>, <span class="s2">"cat"</span>, <span class="s2">"turtle"</span><span class="o">]</span>
  3              index <span class="o">=</span> 0
  4
  5              <span class="k">while </span>index &lt;<span class="o">=</span> len<span class="o">(</span>animals<span class="o">)</span>:
  6  -&gt;              print<span class="o">(</span>f<span class="s2">"The animal at index {index} is {animals[index]}"</span><span class="o">)</span>
  7                  index +<span class="o">=</span> 1
  8
  9          <span class="k">if </span>__name__ <span class="o">==</span> <span class="s2">"__main__"</span>:
 10              buggy_loop<span class="o">()</span>
<span class="o">[</span>EOF]
</code></pre></div></div>

<p>Typing <code class="language-plaintext highlighter-rouge">l</code> again interestingly does not give us the same result:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">(</span>Pdb<span class="o">)</span> l
<span class="o">[</span>EOF]
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">list</code> automatically iterates through the code every time the command is entered, and because our script is small we just reach the end-of-file. To continually display where execution halted you can enter <code class="language-plaintext highlighter-rouge">l .</code></p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">(</span>Pdb<span class="o">)</span> l <span class="nb">.</span>
  1          def buggy_loop<span class="o">()</span>:
  2              animals <span class="o">=</span> <span class="o">[</span><span class="s2">"dog"</span>, <span class="s2">"cat"</span>, <span class="s2">"turtle"</span><span class="o">]</span>
  3              index <span class="o">=</span> 0
  4
  5              <span class="k">while </span>index &lt;<span class="o">=</span> len<span class="o">(</span>animals<span class="o">)</span>:
  6  -&gt;              print<span class="o">(</span>f<span class="s2">"The animal at index {index} is {animals[index]}"</span><span class="o">)</span>
  7                  index +<span class="o">=</span> 1
  8
  9          <span class="k">if </span>__name__ <span class="o">==</span> <span class="s2">"__main__"</span>:
 10              buggy_loop<span class="o">()</span>
<span class="o">[</span>EOF]
</code></pre></div></div>

<p>Another nice feature of pdb is that you can enter expressions see the result printed back. For instance, we know we have a variable named <code class="language-plaintext highlighter-rouge">index</code> in the function we are debugging, so entering that into pdb will print the value of index.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">(</span>Pdb<span class="o">)</span> index
3
</code></pre></div></div>

<p>If you are debugging a longer function with a lot of variables, you may also be interested in the <code class="language-plaintext highlighter-rouge">dir()</code> or <code class="language-plaintext highlighter-rouge">locals()</code> functions. The former shows the names of all variables in the current scope; the latter gives you the names and values.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">(</span>Pdb<span class="o">)</span> <span class="nb">dir</span><span class="o">()</span>
<span class="o">[</span><span class="s1">'animals'</span>, <span class="s1">'index'</span><span class="o">]</span>
<span class="o">(</span>Pdb<span class="o">)</span> locals<span class="o">()</span>
<span class="o">{</span><span class="s1">'animals'</span>: <span class="o">[</span><span class="s1">'dog'</span>, <span class="s1">'cat'</span>, <span class="s1">'turtle'</span><span class="o">]</span>, <span class="s1">'index'</span>: 3<span class="o">}</span>
</code></pre></div></div>

<p>Let us step back now and talk about the problem we are trying to solve. The traceback tells us we have an <code class="language-plaintext highlighter-rouge">IndexError: list index out of range</code> on line 6, and the debugger paused us at that same line. Upon inspecting the <code class="language-plaintext highlighter-rouge">index</code> variable in the debugger we note it has a value of 3.</p>

<p>Line 6 attempts to do <code class="language-plaintext highlighter-rouge">animals[index]</code>, which fails because Python is a 0-based index language. One fix is for us to change line 5 from</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">while </span>index &lt;<span class="o">=</span> len<span class="o">(</span>animals<span class="o">)</span>:
</code></pre></div></div>

<p>to</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">while </span>index &lt; len<span class="o">(</span>animals<span class="o">)</span>:
</code></pre></div></div>

<p>If you make that change to the source code you can enter <code class="language-plaintext highlighter-rouge">restart</code> into pdb to start over with the updated script logic. From there input <code class="language-plaintext highlighter-rouge">c</code> and you will note the script executes without issue.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">(</span>Pdb<span class="o">)</span> restart
Restarting /home/willayd/buggy_program.py with arguments:

<span class="o">&gt;</span> /home/willayd/buggy_program.py<span class="o">(</span>1<span class="o">)</span>&lt;module&gt;<span class="o">()</span>
-&gt; def buggy_loop<span class="o">()</span>:
<span class="o">(</span>Pdb<span class="o">)</span> c
Post mortem debugger finished. The /home/willayd/buggy_program.py will be restarted
<span class="o">&gt;</span> /home/willayd/buggy_program.py<span class="o">(</span>1<span class="o">)</span>&lt;module&gt;<span class="o">()</span>
-&gt; def buggy_loop<span class="o">()</span>:
<span class="o">(</span>Pdb<span class="o">)</span> c
The animal at index 0 is dog
The animal at index 1 is <span class="nb">cat
</span>The animal at index 2 is turtle
The program finished and will be restarted
<span class="o">&gt;</span> /home/willayd/buggy_program.py<span class="o">(</span>1<span class="o">)</span>&lt;module&gt;<span class="o">()</span>
-&gt; def buggy_loop<span class="o">()</span>:
</code></pre></div></div>

<p>Since things are good to go now, you can type <code class="language-plaintext highlighter-rouge">quit()</code> into the debugger to close things out.</p>

<h2 id="part-2-debugging-logical-errors">Part 2: Debugging logical errors</h2>

<p>Getting an exception in Python is a clear indicator that things are wrong, but not every bug shows up as an error. The code below is inspired by pandas bug <a href="https://github.com/pandas-dev/pandas/issues/49861">#49861</a>. The code as originally written used a recursive function call that was roughly equivalent to:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">normalize_json</span><span class="p">(</span>
    <span class="n">data</span><span class="p">,</span>
    <span class="n">key_string</span><span class="p">,</span>
    <span class="n">normalized_dict</span><span class="p">,</span>
    <span class="n">separator</span>
<span class="p">):</span>
    <span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="nb">dict</span><span class="p">):</span>
        <span class="k">for</span> <span class="n">key</span><span class="p">,</span> <span class="n">value</span> <span class="ow">in</span> <span class="n">data</span><span class="p">.</span><span class="n">items</span><span class="p">():</span>
            <span class="n">new_key</span> <span class="o">=</span> <span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="n">key_string</span><span class="si">}{</span><span class="n">separator</span><span class="si">}{</span><span class="n">key</span><span class="si">}</span><span class="s">"</span>
            <span class="n">normalize_json</span><span class="p">(</span>
                <span class="n">data</span><span class="o">=</span><span class="n">value</span><span class="p">,</span>
                <span class="c1"># to avoid adding the separator to the start of every key
</span>                <span class="n">key_string</span><span class="o">=</span><span class="n">new_key</span>
                <span class="k">if</span> <span class="n">new_key</span><span class="p">[</span><span class="nb">len</span><span class="p">(</span><span class="n">separator</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">]</span> <span class="o">!=</span> <span class="n">separator</span>
                <span class="k">else</span> <span class="n">new_key</span><span class="p">[</span><span class="nb">len</span><span class="p">(</span><span class="n">separator</span><span class="p">)</span> <span class="p">:],</span>
                <span class="n">normalized_dict</span><span class="o">=</span><span class="n">normalized_dict</span><span class="p">,</span>
                <span class="n">separator</span><span class="o">=</span><span class="n">separator</span><span class="p">,</span>
            <span class="p">)</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="n">normalized_dict</span><span class="p">[</span><span class="n">key_string</span><span class="p">]</span> <span class="o">=</span> <span class="n">data</span>
    <span class="k">return</span> <span class="n">normalized_dict</span>
</code></pre></div></div>

<p>This function aims to take the keys of deeply nested dictionaries and combine them into one key with a separator. Note below how hierarchies like <code class="language-plaintext highlighter-rouge">a -&gt; b -&gt; c</code> get folded into one <code class="language-plaintext highlighter-rouge">a.b.c</code> key.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">&gt;&gt;&gt;</span> <span class="n">normalize_json</span><span class="p">({</span><span class="s">"a"</span><span class="p">:</span> <span class="p">{</span><span class="s">"b"</span><span class="p">:</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">]}},</span> <span class="s">""</span><span class="p">,</span> <span class="p">{},</span> <span class="s">"."</span><span class="p">)</span>
<span class="p">{</span><span class="s">'a.b'</span><span class="p">:</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">]}</span>

<span class="o">&gt;&gt;&gt;</span> <span class="n">normalize_json</span><span class="p">({</span><span class="s">"a"</span><span class="p">:</span> <span class="p">{</span><span class="s">"b"</span><span class="p">:</span> <span class="p">{</span><span class="s">"c"</span><span class="p">:</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">]}}},</span> <span class="s">""</span><span class="p">,</span> <span class="p">{},</span> <span class="s">"."</span><span class="p">)</span>
<span class="p">{</span><span class="s">'a.b.c'</span><span class="p">:</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">]}</span>
</code></pre></div></div>

<p>The OP of the pandas issue noticed that the function would incorrectly remove the start of the string at the top of the dictionary hierarchy <em>if</em> that key began with the <code class="language-plaintext highlighter-rouge">separator</code> argument. For instance, if you had a key at the top of the dictionary that began with an underscore and you used an underscore separator, the very first key would get mangled. This is visible below as the normalized key is shown as <code class="language-plaintext highlighter-rouge">a_b</code> when it should be <code class="language-plaintext highlighter-rouge">_a_b</code>.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">&gt;&gt;&gt;</span> <span class="n">normalize_json</span><span class="p">({</span><span class="s">"_a"</span><span class="p">:</span> <span class="p">{</span><span class="s">"b"</span><span class="p">:</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">]}},</span> <span class="s">""</span><span class="p">,</span> <span class="p">{},</span> <span class="s">"_"</span><span class="p">)</span>
<span class="p">{</span><span class="s">'a_b'</span><span class="p">:</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">]}</span>
</code></pre></div></div>

<p>To diagnose, go ahead and save the following code as <code class="language-plaintext highlighter-rouge">buggy_script2.py</code>:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">normalize_json</span><span class="p">(</span>
    <span class="n">data</span><span class="p">,</span>
    <span class="n">key_string</span><span class="p">,</span>
    <span class="n">normalized_dict</span><span class="p">,</span>
    <span class="n">separator</span>
<span class="p">):</span>
    <span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="nb">dict</span><span class="p">):</span>
        <span class="k">for</span> <span class="n">key</span><span class="p">,</span> <span class="n">value</span> <span class="ow">in</span> <span class="n">data</span><span class="p">.</span><span class="n">items</span><span class="p">():</span>
            <span class="n">new_key</span> <span class="o">=</span> <span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="n">key_string</span><span class="si">}{</span><span class="n">separator</span><span class="si">}{</span><span class="n">key</span><span class="si">}</span><span class="s">"</span>
            <span class="n">normalize_json</span><span class="p">(</span>
                <span class="n">data</span><span class="o">=</span><span class="n">value</span><span class="p">,</span>
                <span class="c1"># to avoid adding the separator to the start of every key
</span>                <span class="n">key_string</span><span class="o">=</span><span class="n">new_key</span>
                <span class="k">if</span> <span class="n">new_key</span><span class="p">[</span><span class="nb">len</span><span class="p">(</span><span class="n">separator</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">]</span> <span class="o">!=</span> <span class="n">separator</span>
                <span class="k">else</span> <span class="n">new_key</span><span class="p">[</span><span class="nb">len</span><span class="p">(</span><span class="n">separator</span><span class="p">)</span> <span class="p">:],</span>
                <span class="n">normalized_dict</span><span class="o">=</span><span class="n">normalized_dict</span><span class="p">,</span>
                <span class="n">separator</span><span class="o">=</span><span class="n">separator</span><span class="p">,</span>
            <span class="p">)</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="n">normalized_dict</span><span class="p">[</span><span class="n">key_string</span><span class="p">]</span> <span class="o">=</span> <span class="n">data</span>
    <span class="k">return</span> <span class="n">normalized_dict</span>


<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
    <span class="k">print</span><span class="p">(</span><span class="n">normalize_json</span><span class="p">({</span><span class="s">"_a"</span><span class="p">:</span> <span class="p">{</span><span class="s">"b"</span><span class="p">:</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">]}},</span> <span class="s">""</span><span class="p">,</span> <span class="p">{},</span> <span class="s">"_"</span><span class="p">))</span>
</code></pre></div></div>

<p>We can start the debugger and load the script using <code class="language-plaintext highlighter-rouge">python -m pdb buggy_script2.py</code>. However, since there is no bug this time the code will not stop unless we explicitly set a breakpoint. <code class="language-plaintext highlighter-rouge">help break</code> gives you some ideas on how to do this; for now start with <code class="language-plaintext highlighter-rouge">break normalize_json</code></p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">&gt;</span> /home/willayd/buggy_script2.py<span class="o">(</span>1<span class="o">)</span>&lt;module&gt;<span class="o">()</span>
-&gt; def normalize_json<span class="o">(</span>
<span class="o">(</span>Pdb<span class="o">)</span> <span class="nb">break </span>normalize_json
Breakpoint 1 at /home/willayd/buggy_script2.py:1
<span class="o">(</span>Pdb<span class="o">)</span> <span class="nb">break
</span>Num Type         Disp Enb   Where
1   breakpoint   keep <span class="nb">yes   </span>at /home/willayd/buggy_script2.py:1
</code></pre></div></div>

<p>Continue along by hitting <code class="language-plaintext highlighter-rouge">c</code> then <code class="language-plaintext highlighter-rouge">l</code> to list where execution paused, and you will see it is the first line of the <code class="language-plaintext highlighter-rouge">normalize_json</code> function.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">(</span>Pdb<span class="o">)</span> c
<span class="o">&gt;</span> /home/willayd/buggy_script2.py<span class="o">(</span>7<span class="o">)</span>normalize_json<span class="o">()</span>
-&gt; <span class="k">if </span>isinstance<span class="o">(</span>data, dict<span class="o">)</span>:
<span class="o">(</span>Pdb<span class="o">)</span> l
  2              data,
  3              key_string,
  4              normalized_dict,
  5              separator
  6          <span class="o">)</span>:
  7  -&gt;          <span class="k">if </span>isinstance<span class="o">(</span>data, dict<span class="o">)</span>:
  8                  <span class="k">for </span>key, value <span class="k">in </span>data.items<span class="o">()</span>:
  9                      new_key <span class="o">=</span> f<span class="s2">"{key_string}{separator}{key}"</span>
 10                      normalize_json<span class="o">(</span>
 11                          <span class="nv">data</span><span class="o">=</span>value,
 12                          <span class="c"># to avoid adding the separator to the start of every key</span>
</code></pre></div></div>

<p>Another command worth introducing here is <code class="language-plaintext highlighter-rouge">backtrace</code>, or <code class="language-plaintext highlighter-rouge">bt</code> for short. Python functions operate as a <a href="https://en.wikipedia.org/wiki/Call_stack">call stack</a>, so backtrace tells you the sequence of calls that lead up to the breakpoint.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">(</span>Pdb<span class="o">)</span> bt
  /usr/lib/python3.10/bdb.py<span class="o">(</span>597<span class="o">)</span>run<span class="o">()</span>
-&gt; <span class="nb">exec</span><span class="o">(</span>cmd, globals, locals<span class="o">)</span>
  &lt;string&gt;<span class="o">(</span>1<span class="o">)</span>&lt;module&gt;<span class="o">()</span>
  /home/willayd/buggy_script2.py<span class="o">(</span>25<span class="o">)</span>&lt;module&gt;<span class="o">()</span>
-&gt; normalize_json<span class="o">({</span><span class="s2">"_a"</span>: <span class="o">{</span><span class="s2">"b"</span>: <span class="o">[</span>1, 2, 3]<span class="o">}}</span>, <span class="s2">""</span>, <span class="o">{}</span>, <span class="s2">"_"</span><span class="o">)</span>
<span class="o">&gt;</span> /home/willayd/buggy_script2.py<span class="o">(</span>7<span class="o">)</span>normalize_json<span class="o">()</span>
-&gt; <span class="k">if </span>isinstance<span class="o">(</span>data, dict<span class="o">)</span>:
</code></pre></div></div>

<p>Within pdb the most recent call appears on the bottom (other debuggers may reverse this), so reading from the bottom up we are at <code class="language-plaintext highlighter-rouge">normalize_json</code> line 7 which was called by our <code class="language-plaintext highlighter-rouge">buggy_script2.py</code> script on line 25. The calls preceding that are internal to Python. Hit <code class="language-plaintext highlighter-rouge">c</code> again and another <code class="language-plaintext highlighter-rouge">bt</code> to see what happens next:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">(</span>Pdb<span class="o">)</span> c
<span class="o">&gt;</span> /home/willayd/buggy_script2.py<span class="o">(</span>7<span class="o">)</span>normalize_json<span class="o">()</span>
-&gt; <span class="k">if </span>isinstance<span class="o">(</span>data, dict<span class="o">)</span>:
<span class="o">(</span>Pdb<span class="o">)</span> bt
  /usr/lib/python3.10/bdb.py<span class="o">(</span>597<span class="o">)</span>run<span class="o">()</span>
-&gt; <span class="nb">exec</span><span class="o">(</span>cmd, globals, locals<span class="o">)</span>
  &lt;string&gt;<span class="o">(</span>1<span class="o">)</span>&lt;module&gt;<span class="o">()</span>
  /home/willayd/buggy_script2.py<span class="o">(</span>25<span class="o">)</span>&lt;module&gt;<span class="o">()</span>
-&gt; normalize_json<span class="o">({</span><span class="s2">"_a"</span>: <span class="o">{</span><span class="s2">"b"</span>: <span class="o">[</span>1, 2, 3]<span class="o">}}</span>, <span class="s2">""</span>, <span class="o">{}</span>, <span class="s2">"_"</span><span class="o">)</span>
  /home/willayd/buggy_script2.py<span class="o">(</span>10<span class="o">)</span>normalize_json<span class="o">()</span>
-&gt; normalize_json<span class="o">(</span>
<span class="o">&gt;</span> /home/willayd/buggy_script2.py<span class="o">(</span>7<span class="o">)</span>normalize_json<span class="o">()</span>
-&gt; <span class="k">if </span>isinstance<span class="o">(</span>data, dict<span class="o">)</span>:
</code></pre></div></div>

<p>We are in a recursive function call, so we see that <code class="language-plaintext highlighter-rouge">normalize_json</code> is at the bottom of our backtrace twice. This pattern would continue every time we continue script execution.</p>

<p>pdb let’s you move up and down the stack trace. We know we are 2 <code class="language-plaintext highlighter-rouge">normalize_json</code> calls deep. The <code class="language-plaintext highlighter-rouge">up</code> and <code class="language-plaintext highlighter-rouge">down</code> commands not surprisingly move up and down the call stack trace, giving you the power to inspect each <em>frame</em>.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">(</span>Pdb<span class="o">)</span> locals<span class="o">()</span>
<span class="o">{</span><span class="s1">'data'</span>: <span class="o">{</span><span class="s1">'b'</span>: <span class="o">[</span>1, 2, 3]<span class="o">}</span>, <span class="s1">'key_string'</span>: <span class="s1">'_a'</span>, <span class="s1">'normalized_dict'</span>: <span class="o">{}</span>, <span class="s1">'separator'</span>: <span class="s1">'_'</span><span class="o">}</span>
<span class="o">(</span>Pdb<span class="o">)</span> up
<span class="o">&gt;</span> /home/willayd/buggy_script2.py<span class="o">(</span>10<span class="o">)</span>normalize_json<span class="o">()</span>
-&gt; normalize_json<span class="o">(</span>
<span class="o">(</span>Pdb<span class="o">)</span> locals<span class="o">()</span>
<span class="o">{</span><span class="s1">'data'</span>: <span class="o">{</span><span class="s1">'_a'</span>: <span class="o">{</span><span class="s1">'b'</span>: <span class="o">[</span>1, 2, 3]<span class="o">}}</span>, <span class="s1">'key_string'</span>: <span class="s1">''</span>, <span class="s1">'normalized_dict'</span>: <span class="o">{}</span>, <span class="s1">'separator'</span>: <span class="s1">'_'</span>, <span class="s1">'key'</span>: <span class="s1">'_a'</span>, <span class="s1">'value'</span>: <span class="o">{</span><span class="s1">'b'</span>: <span class="o">[</span>1, 2, 3]<span class="o">}</span>, <span class="s1">'new_key'</span>: <span class="s1">'__a'</span><span class="o">}</span>
<span class="o">(</span>Pdb<span class="o">)</span> down
<span class="o">&gt;</span> /home/willayd/buggy_script2.py<span class="o">(</span>7<span class="o">)</span>normalize_json<span class="o">()</span>
-&gt; <span class="k">if </span>isinstance<span class="o">(</span>data, dict<span class="o">)</span>:
<span class="o">(</span>Pdb<span class="o">)</span> locals<span class="o">()</span>
<span class="o">{</span><span class="s1">'data'</span>: <span class="o">{</span><span class="s1">'b'</span>: <span class="o">[</span>1, 2, 3]<span class="o">}</span>, <span class="s1">'key_string'</span>: <span class="s1">'_a'</span>, <span class="s1">'normalized_dict'</span>: <span class="o">{}</span>, <span class="s1">'separator'</span>: <span class="s1">'_'</span><span class="o">}</span>
</code></pre></div></div>

<p>The first time we called <code class="language-plaintext highlighter-rouge">locals()</code> we were at the most recent <code class="language-plaintext highlighter-rouge">normalize_json</code> call. The <code class="language-plaintext highlighter-rouge">up</code> command moved us back one frame; <code class="language-plaintext highlighter-rouge">down</code> takes us back to the current frame.</p>

<p>Since our input data isn’t too deeply nested, we could keep continuing and moving up and down the stack to try and find where the issue appears, but this could be impractical with many layers of recursion. Fortunately we can be more intelligent with where and when we choose to pause code execution.</p>

<p>To do that let’s <code class="language-plaintext highlighter-rouge">restart</code> our code execution and <code class="language-plaintext highlighter-rouge">clear</code> our existing breakpoint(s).</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">(</span>Pdb<span class="o">)</span> restart
Restarting /home/willayd/buggy_script2.py with arguments:

<span class="o">&gt;</span> /home/willayd/buggy_script2.py<span class="o">(</span>1<span class="o">)</span>&lt;module&gt;<span class="o">()</span>
-&gt; def normalize_json<span class="o">(</span>
<span class="o">(</span>Pdb<span class="o">)</span> clear
Clear all breaks? y
Deleted breakpoint 1 at /home/willayd/buggy_script2.py:1
</code></pre></div></div>

<p>If you inspected the <code class="language-plaintext highlighter-rouge">help break</code> output earlier, you might have noticed that <code class="language-plaintext highlighter-rouge">break</code> takes an optional condition argument. This is an expression that must evaluate to <code class="language-plaintext highlighter-rouge">True</code> for the breakpoint to pause execution.</p>

<p>We know from our bug report and from inspecting some of the <code class="language-plaintext highlighter-rouge">locals()</code> outputs earlier that the bug likely happens when a variable named <code class="language-plaintext highlighter-rouge">key_string</code> has the value of <code class="language-plaintext highlighter-rouge">a_b</code>, so we can pause execution only when that condition is met.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">(</span>Pdb<span class="o">)</span> <span class="nb">break </span>normalize_json, key_string <span class="o">==</span> <span class="s2">"a_b"</span>
Breakpoint 2 at /home/willayd/buggy_script2.py:1
<span class="o">(</span>Pdb<span class="o">)</span> c
<span class="o">&gt;</span> /home/willayd/buggy_script2.py<span class="o">(</span>7<span class="o">)</span>normalize_json<span class="o">()</span>
-&gt; <span class="k">if </span>isinstance<span class="o">(</span>data, dict<span class="o">)</span>:
<span class="o">(</span>Pdb<span class="o">)</span> bt
  /usr/lib/python3.10/bdb.py<span class="o">(</span>597<span class="o">)</span>run<span class="o">()</span>
-&gt; <span class="nb">exec</span><span class="o">(</span>cmd, globals, locals<span class="o">)</span>
  &lt;string&gt;<span class="o">(</span>1<span class="o">)</span>&lt;module&gt;<span class="o">()</span>
  /home/willayd/buggy_script2.py<span class="o">(</span>25<span class="o">)</span>&lt;module&gt;<span class="o">()</span>
-&gt; print<span class="o">(</span>normalize_json<span class="o">({</span><span class="s2">"_a"</span>: <span class="o">{</span><span class="s2">"b"</span>: <span class="o">[</span>1, 2, 3]<span class="o">}}</span>, <span class="s2">""</span>, <span class="o">{}</span>, <span class="s2">"_"</span><span class="o">))</span>
  /home/willayd/buggy_script2.py<span class="o">(</span>10<span class="o">)</span>normalize_json<span class="o">()</span>
-&gt; normalize_json<span class="o">(</span>
  /home/willayd/buggy_script2.py<span class="o">(</span>10<span class="o">)</span>normalize_json<span class="o">()</span>
-&gt; normalize_json<span class="o">(</span>
<span class="o">&gt;</span> /home/willayd/buggy_script2.py<span class="o">(</span>7<span class="o">)</span>normalize_json<span class="o">()</span>
-&gt; <span class="k">if </span>isinstance<span class="o">(</span>data, dict<span class="o">)</span>:
</code></pre></div></div>

<p>The above backtrace shows us pausing code execution within the third call of <code class="language-plaintext highlighter-rouge">normalize_json</code>. Even though our breakpoint was on the <code class="language-plaintext highlighter-rouge">normalize_json</code> function, the expression <code class="language-plaintext highlighter-rouge">key_string == "a_b"</code> did not evaluate to true for the first two function calls.</p>

<p>Where our execution paused <code class="language-plaintext highlighter-rouge">key_string</code> is not modified locally, but rather received as an argument. This means the bug may surface up one call in the backtrace, so move up and inspect the code:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">(</span>Pdb<span class="o">)</span> u
<span class="o">&gt;</span> /home/willayd/buggy_script2.py<span class="o">(</span>10<span class="o">)</span>normalize_json<span class="o">()</span>
-&gt; normalize_json<span class="o">(</span>
<span class="o">(</span>Pdb<span class="o">)</span> ll
  1 B        def normalize_json<span class="o">(</span>
  2              data,
  3              key_string,
  4              normalized_dict,
  5              separator
  6          <span class="o">)</span>:
  7              <span class="k">if </span>isinstance<span class="o">(</span>data, dict<span class="o">)</span>:
  8                  <span class="k">for </span>key, value <span class="k">in </span>data.items<span class="o">()</span>:
  9                      new_key <span class="o">=</span> f<span class="s2">"{key_string}{separator}{key}"</span>
 10  -&gt;                  normalize_json<span class="o">(</span>
 11                          <span class="nv">data</span><span class="o">=</span>value,
 12                          <span class="c"># to avoid adding the separator to the start of every key</span>
 13                          <span class="nv">key_string</span><span class="o">=</span>new_key
 14                          <span class="k">if </span>new_key[len<span class="o">(</span>separator<span class="o">)</span> - 1] <span class="o">!=</span> separator
 15                          <span class="k">else </span>new_key[len<span class="o">(</span>separator<span class="o">)</span> :],
 16                          <span class="nv">normalized_dict</span><span class="o">=</span>normalized_dict,
 17                          <span class="nv">separator</span><span class="o">=</span>separator,
 18                      <span class="o">)</span>
 19              <span class="k">else</span>:
 20                  normalized_dict[key_string] <span class="o">=</span> data
 21              <span class="k">return </span>normalized_dict
<span class="o">(</span>Pdb<span class="o">)</span> new_key
<span class="s1">'_a_b'</span>
</code></pre></div></div>

<p>Our code execution paused on line 10. On line 9 <code class="language-plaintext highlighter-rouge">new_key</code> was assigned a value of <code class="language-plaintext highlighter-rouge">_a_b</code>, which is what we want to see in the end result.</p>

<p>Look closely at line 13 however and you will note that we aren’t just forwarding <code class="language-plaintext highlighter-rouge">new_key</code> as an argument to the next <code class="language-plaintext highlighter-rouge">normalize_json</code> call. Instead we have an <code class="language-plaintext highlighter-rouge">if...else</code> statement that determines which gets forwarded along. We can evaluate both branches of the conditional to get an idea of what is going on:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">(</span>Pdb<span class="o">)</span> new_key[len<span class="o">(</span>separator<span class="o">)</span> - 1]
<span class="s1">'_'</span>
<span class="o">(</span>Pdb<span class="o">)</span> new_key[len<span class="o">(</span>separator<span class="o">)</span>:]
<span class="s1">'a_b'</span>
<span class="o">(</span>Pdb<span class="o">)</span> new_key
<span class="s1">'_a_b'</span>
</code></pre></div></div>

<p>Our first instinct might be to simplify the function call and make the argument <code class="language-plaintext highlighter-rouge">key_string=new_key</code>, making our buggy_script2.py script now look like:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">normalize_json</span><span class="p">(</span>
    <span class="n">data</span><span class="p">,</span>
    <span class="n">key_string</span><span class="p">,</span>
    <span class="n">normalized_dict</span><span class="p">,</span>
    <span class="n">separator</span>
<span class="p">):</span>
    <span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="nb">dict</span><span class="p">):</span>
        <span class="k">for</span> <span class="n">key</span><span class="p">,</span> <span class="n">value</span> <span class="ow">in</span> <span class="n">data</span><span class="p">.</span><span class="n">items</span><span class="p">():</span>
            <span class="n">new_key</span> <span class="o">=</span> <span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="n">key_string</span><span class="si">}{</span><span class="n">separator</span><span class="si">}{</span><span class="n">key</span><span class="si">}</span><span class="s">"</span>
            <span class="n">normalize_json</span><span class="p">(</span>
                <span class="n">data</span><span class="o">=</span><span class="n">value</span><span class="p">,</span>
                <span class="c1"># to avoid adding the separator to the start of every key
</span>                <span class="n">key_string</span><span class="o">=</span><span class="n">new_key</span><span class="p">,</span>
                <span class="n">normalized_dict</span><span class="o">=</span><span class="n">normalized_dict</span><span class="p">,</span>
                <span class="n">separator</span><span class="o">=</span><span class="n">separator</span><span class="p">,</span>
            <span class="p">)</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="n">normalized_dict</span><span class="p">[</span><span class="n">key_string</span><span class="p">]</span> <span class="o">=</span> <span class="n">data</span>
    <span class="k">return</span> <span class="n">normalized_dict</span>


<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
    <span class="k">print</span><span class="p">(</span><span class="n">normalize_json</span><span class="p">({</span><span class="s">"_a"</span><span class="p">:</span> <span class="p">{</span><span class="s">"b"</span><span class="p">:</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">]}},</span> <span class="s">""</span><span class="p">,</span> <span class="p">{},</span> <span class="s">"_"</span><span class="p">))</span>
</code></pre></div></div>

<p>This reads nicer, but we have fixed one thing by breaking another. Doing a <code class="language-plaintext highlighter-rouge">restart</code> and <code class="language-plaintext highlighter-rouge">continue</code> in no longer hits our breakpoint, but the script now prints out <code class="language-plaintext highlighter-rouge">{'__a_b': [1, 2, 3]}</code>. We want <code class="language-plaintext highlighter-rouge">_a_b</code> as the key not <code class="language-plaintext highlighter-rouge">__a_b</code>.</p>

<p>So back to the drawing board…in pdb input <code class="language-plaintext highlighter-rouge">restart</code> and <code class="language-plaintext highlighter-rouge">clear</code> to remove the breakpoint we set so far. Enter <code class="language-plaintext highlighter-rouge">break normalize_json</code> so we can stop again during every function call.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">(</span>Pdb<span class="o">)</span> clear
Clear all breaks? y
Deleted breakpoint 2 at /home/willayd/buggy_script2.py:1
<span class="o">(</span>Pdb<span class="o">)</span> <span class="nb">break </span>normalize_json
Breakpoint 3 at /home/willayd/buggy_script2.py:1
<span class="o">(</span>Pdb<span class="o">)</span>
</code></pre></div></div>

<p>Now step through a few function calls, inspect locals and see what might be happening:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">(</span>Pdb<span class="o">)</span> locals<span class="o">()</span>
<span class="o">{</span><span class="s1">'data'</span>: <span class="o">{</span><span class="s1">'_a'</span>: <span class="o">{</span><span class="s1">'b'</span>: <span class="o">[</span>1, 2, 3]<span class="o">}}</span>, <span class="s1">'key_string'</span>: <span class="s1">''</span>, <span class="s1">'normalized_dict'</span>: <span class="o">{}</span>, <span class="s1">'separator'</span>: <span class="s1">'_'</span><span class="o">}</span>
<span class="o">(</span>Pdb<span class="o">)</span> c
<span class="o">&gt;</span> /home/willayd/buggy_script2.py<span class="o">(</span>7<span class="o">)</span>normalize_json<span class="o">()</span>
-&gt; <span class="k">if </span>isinstance<span class="o">(</span>data, dict<span class="o">)</span>:
<span class="o">(</span>Pdb<span class="o">)</span> locals<span class="o">()</span>
<span class="o">{</span><span class="s1">'data'</span>: <span class="o">{</span><span class="s1">'b'</span>: <span class="o">[</span>1, 2, 3]<span class="o">}</span>, <span class="s1">'key_string'</span>: <span class="s1">'__a'</span>, <span class="s1">'normalized_dict'</span>: <span class="o">{}</span>, <span class="s1">'separator'</span>: <span class="s1">'_'</span><span class="o">}</span>
<span class="o">(</span>Pdb<span class="o">)</span> c
<span class="o">&gt;</span> /home/willayd/buggy_script2.py<span class="o">(</span>7<span class="o">)</span>normalize_json<span class="o">()</span>
-&gt; <span class="k">if </span>isinstance<span class="o">(</span>data, dict<span class="o">)</span>:
<span class="o">(</span>Pdb<span class="o">)</span> locals<span class="o">()</span>
<span class="o">{</span><span class="s1">'data'</span>: <span class="o">[</span>1, 2, 3], <span class="s1">'key_string'</span>: <span class="s1">'__a_b'</span>, <span class="s1">'normalized_dict'</span>: <span class="o">{}</span>, <span class="s1">'separator'</span>: <span class="s1">'_'</span><span class="o">}</span>
<span class="o">(</span>Pdb<span class="o">)</span>
</code></pre></div></div>

<p>If you look closely, you will notice that the <code class="language-plaintext highlighter-rouge">key_string</code> variable is already wrong on the second call to the <code class="language-plaintext highlighter-rouge">normalize_json</code> function. But the pattern of joining that key with one separator appears correct in the call thereafter.</p>

<p>A simplistic solution is to have some mechanism within our <code class="language-plaintext highlighter-rouge">normalize_json</code> call to know if it is the first time the function is being called or not, and special-case the handling of the first call. Inspecting <code class="language-plaintext highlighter-rouge">locals()</code> across the different function calls, we notice in the first call that <code class="language-plaintext highlighter-rouge">key_string</code> is an empty string but has a value in all subsequent calls. Knowing this we can set up a condition to only strip leading separators if we are NOT in the first function call.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">normalize_json</span><span class="p">(</span>
    <span class="n">data</span><span class="p">,</span>
    <span class="n">key_string</span><span class="p">,</span>
    <span class="n">normalized_dict</span><span class="p">,</span>
    <span class="n">separator</span>
<span class="p">):</span>
    <span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="nb">dict</span><span class="p">):</span>
        <span class="k">for</span> <span class="n">key</span><span class="p">,</span> <span class="n">value</span> <span class="ow">in</span> <span class="n">data</span><span class="p">.</span><span class="n">items</span><span class="p">():</span>
            <span class="n">new_key</span> <span class="o">=</span> <span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="n">key_string</span><span class="si">}{</span><span class="n">separator</span><span class="si">}{</span><span class="n">key</span><span class="si">}</span><span class="s">"</span>
            <span class="k">if</span> <span class="ow">not</span> <span class="n">key_string</span><span class="p">:</span>
                <span class="n">new_key</span> <span class="o">=</span> <span class="n">new_key</span><span class="p">.</span><span class="n">removeprefix</span><span class="p">(</span><span class="n">separator</span><span class="p">)</span>
            <span class="n">normalize_json</span><span class="p">(</span>
                <span class="n">data</span><span class="o">=</span><span class="n">value</span><span class="p">,</span>
                <span class="c1"># to avoid adding the separator to the start of every key
</span>                <span class="n">key_string</span><span class="o">=</span><span class="n">new_key</span><span class="p">,</span>
                <span class="n">normalized_dict</span><span class="o">=</span><span class="n">normalized_dict</span><span class="p">,</span>
                <span class="n">separator</span><span class="o">=</span><span class="n">separator</span><span class="p">,</span>
            <span class="p">)</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="n">normalized_dict</span><span class="p">[</span><span class="n">key_string</span><span class="p">]</span> <span class="o">=</span> <span class="n">data</span>
    <span class="k">return</span> <span class="n">normalized_dict</span>


<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
    <span class="k">print</span><span class="p">(</span><span class="n">normalize_json</span><span class="p">({</span><span class="s">"_a"</span><span class="p">:</span> <span class="p">{</span><span class="s">"b"</span><span class="p">:</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">]}},</span> <span class="s">""</span><span class="p">,</span> <span class="p">{},</span> <span class="s">"_"</span><span class="p">))</span>
</code></pre></div></div>

<p>To verify this now works, <code class="language-plaintext highlighter-rouge">restart</code> the program, <code class="language-plaintext highlighter-rouge">clear</code> any breakpoint(s) and <code class="language-plaintext highlighter-rouge">continue</code> to let things run. You should now get the right answer.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">(</span>Pdb<span class="o">)</span> restart
Restarting /home/willayd/buggy_script2.py with arguments:

<span class="o">&gt;</span> /home/willayd/buggy_script2.py<span class="o">(</span>1<span class="o">)</span>&lt;module&gt;<span class="o">()</span>
-&gt; def normalize_json<span class="o">(</span>
<span class="o">(</span>Pdb<span class="o">)</span> clear
Clear all breaks? y
Deleted breakpoint 3 at /home/willayd/buggy_script2.py:1
<span class="o">(</span>Pdb<span class="o">)</span> c
<span class="o">{</span><span class="s1">'_a_b'</span>: <span class="o">[</span>1, 2, 3]<span class="o">}</span>
The program finished and will be restarted
</code></pre></div></div>

<h2 id="closing-thoughts">Closing Thoughts</h2>

<p>If you have made it this far congratulations! With modern visual debuggers integrated into IDEs, the way of debugging illustrated above may not be the most commonplace. However, through liberal use of the <code class="language-plaintext highlighter-rouge">help</code> command you may find that <code class="language-plaintext highlighter-rouge">pdb</code> has many features that are not implemented or obvious to use in higher level debuggers. Barring some differences, you’ll also find that this method of using <code class="language-plaintext highlighter-rouge">pdb</code> translates well into using <code class="language-plaintext highlighter-rouge">gdb</code> and extensions like the Cython debugger, which will be represented in future blog posts.</p>]]></content><author><name>Will Ayd</name></author><category term="debugging" /><category term="python" /><summary type="html"><![CDATA[This blog post teaches you how to navigate pdb, the Python debugger. It is part 1 of a 3 part series.]]></summary></entry></feed>