Will Ayd | Personal Blog

15 Years of Data - From Closed to Open Source

2025-01-08T00:00:00+00:00

It feels almost surreal to take a step back and recognize that I have now spent 15 years of my life working professionally in the field of data.

Over this time, I have experienced a monumental shift in how organizations configure their reporting platforms. What was once a field dominated by add-ons provided by corporate B2B titans like SAP, Oracle, and IBM, has evolved into a field where open source solutions provide far superior options for organizations to utilize.

In this post I’ll share some of my experiences that have coincided with that shift, while providing anecdotes of how open source tools have changed the landscape for the better. I’ll also add in my thoughts on where open source tools are going to take us over the next few years.

Industry experience in the early 2010s

My first professional job was as a “Business Intelligence Analyst” at the company Under Armour, which at the time was growing rather rapidly. Under Armour had a huge partnership with SAP to provide their technological solutions, which included not just their flagship ERP solution, but a proliferation of analytical tools as well.

I didn’t realize it at the time, but the SAP analytical tools were downright…awful. To load data, we were forced to use SAP’s own proprietary programming language ABAP, which was very poorly documented and understood. It is highly likely that we wrote poor ABAP, but given its closed nature and lack of community, there was no way to really tell.

The data extraction jobs that we wrote in this language exhibited awful performance and were horribly unstable, even by 2010 standards. The vast majority of our business critical data loads from the ERP system ran anywhere from 4-8 hours in batch every night. I would estimate that on 60-70% of the nights we had a job failure that required the on call person to wake up at 3 AM and log into the system to restart it.

If we were lucky to have data loaded, the only interface through which we could then access the data would be through a tool called Bex Analyzer. I’ll let you google images of this tool, but needless to say, it was a glorified pivot table. SAP’s solution to visualization was implemented in another product they acquired called BusinessObjects. These tools in this suite were fine at best, but, particularly for interactive visualizations, they lagged far behind a tool like Tableau, which at the time was considered first-in-class for drag and drop visualization.

Another “selling point” to using these vendor-provided tools was that we had an official support contract. Unfortunately, a support contract with a company like SAP is just a game of cat and mouse, with the ultimate goal of discouraging you as a customer from using the contract in the future. There were many instances where we would open a high priority ticket that impacted business operations. In turn, we would get connected with support “experts” who had very little knowledge of the inner workings of their tool.

The fact that the first few layers of support did not have much knowledge of the tools they supported is not an admonition of the people; rather, it is a rebuke of the closed source model where, even within an organization, only a select few are allowed to see the inner workings of a tool. The only thing initial lines of support possessed was a private collection of internal notes for common issues. Think of a site like StackOverflow, but instead of being freely accessible, you pay for someone else to have exclusive access to it and they just tell you what they see.

Of course, the notes that were collected did not cover many of the issues we would face. There were many times where we would be engaging support for multiple hours, only for the support team to say “sorry our workday has finished, we will share our findings with associates in the next time zone that will help you.” Rarely ever were findings shared, so we ended up in the support Twilight Zone until something by chance resolved itself. In extreme cases, this would take days, and really no one learned anything from it - everyone was just relieved that they could close the ticket and move on until the next one.

In all fairness to SAP, I have had similar experiences with Microsoft and their Power Platform support contracts. SAP is not unique in offering poor value in exchange for these contracts; generally, closed-source B2B enterprises can make a large amount of profit off of these contracts while offering customers very poor ROI in return. In the early 2010s, this was common practice.

The cherry on top of poor ROI would manifest itself every few years when it was time to upgrade systems. In the early 2010s many customers had to manage their stack and avoid obsolescence (cloud, PaaS, and SaaS were not yet as common as they are today).

For the proprietary reporting platform provider like SAP, this presented a great opportunity to double dip on profits. By providing customers with subpar tools and not having accountability to fix issues, the solution to many unresolved issues would be “hey, you just need to upgrade.” Upgraded software came with a licensing fee, would be bundled with resold hardware, and would require the hiring of consultants to just get you onto a more recent version of the tool. Needless to say, this process was really expensive. At times, you would have to pay millions if not more to simply avoid obsolescence. Without no alternatives, many customers had no choice but to follow down this path.

Open Source Tools to the Rescue

My first serious exposure to open source data tools was back in 2017. It was not something that was promoted by the industry I worked in, but rather a curiosity of mine that led to some awesome discoveries.

Speaking briefly to my first use case, Under Armour had no standardization as to how their vendors were packing certain types of products into boxes. That might not sound like a big deal, but it has measurable costs when boxes break, or your warehouse needs to repack goods.

The dataset I was working with was a few million records of vendor shipments, and our SAP-provided tools we had were of no use with data, even at this scale. Curiously, I fired up pandas and used the seaborn library to generate a violin plot of this data. If you don’t know what a violin plot is, here’s one that I created in my recent book, the Pandas Cookbook, Third Edition:

This plot shows the distribution IMDB scores across different decades, and you can rather easily see trends not only in terms of averages but with the distributions of data as well. Imagine replacing the years on the y-axis with different product types and the x-axis values with the number of units packed into a carton, and you have an idea of what I was able to visualize in Python. The fact that I could do this with freely available tools in a matter of seconds, and in a way that was highly auditable and reproducible was downright…amazing!

Shortly after I started working with open source tools in corporate settings, I moved into the world of consulting, first at a small practice and then independently. During that time, I was able to contribute back to the open source ecosystem, and in doing so felt like I stumbled upon a gold mine. The first project I contributed to was pandas (of which I am a maintainer today), and that unlocked interactions with users in scikit-learn, SciPy, NumPy, and many other great tools. I also followed a lot of the work that Wes McKinney did with his move from pandas to Apache Arrow, a project which I also became a Committer to in 2024.

The work I put into open source at this time didn’t pay the bills, but it offered me a huge network and ecosystem to work with in my consulting engagements. It also taught me best practices in software engineering, which can be translated over into data engineering to drive real scalability.

Instead of using reporting tools provided by the SAPs and Oracles of the world, the stacks I build at clients today typically use some combination of:

Terraform, for Infrastructure as Code
Apache Airflow, for Job Orchestration
Python as a glue language for ETL
dbt-core for transformations and testing
git for version control
Github Actions for CI/CD

…and more. At a minimum these tools are entirely free to use, but if you wanted to go more of a hosted solution route you can find many third party PaaS and SaaS providers for them. The costs of these services are very affordable compared to the reporting platforms of the early 2010s.

Perhaps the most important thing to building a stack like this is that you limit vendor lock-in. Gone are the days where a B2B provider can provide a subpar solution and charge you more money to upgrade it - you ultimately have control over how your data is managed, maintained, and evolved.

Granted, you need some degree of technical inclination to maintain, but keep in mind that these tools are being taught and used at universities, and can be used at home free-of-cost by those willing to learn. That alone is a huge advantage compared to traditional closed-source reporting architectures. Also keep in mind that as LLMs continue to become more powerful, that the amount of open information for these tools is a huge asset. You can ask popular chatbots questions about these and get much higher quality information than if you were to ask about an SAP, Oracle, or IBM closed-source tool.

Open platforms saved companies significant amounts of money. I have not seen anywhere near the amount of money invested to implement, upgrade, and maintain open source reporting platforms as compared to traditional providers. Large project implementations are reduced from tens to hundreds of millions of dollars down to less than a million.

Of course, there are still some gaps in the open source space. You may have noticed that I didn’t list a visualization platform in my tools above. With the companies I’ve partnered with, Power BI and Tableau tend to dominate that space, with Looker not far behind. Hex is a tool that I will be following closely as well, given it integrates well into the open source ecosystem, but generally there is not an open source visualization tool I know of today that can compete in this space. Open source markets itself to technical audiences, but visualization tools are a bridge for non-technical audiences into the technical world. To build a big community in that space around an open source tool might be challenging, but who knows - maybe one day it will happen.

Where are We Headed

Throughout this blog post I’ve highlighted my experiences moving from closed to open source reporting tools, and I think that has been largely reflective of the industry as a whole. However, open source reporting architecture still needs to capture more market share. From both a technical and economical perspective, I see no reason why this won’t continue, but it takes time to shift industries. If you happen to work at a company that still has not yet adopted open source tools, shoot me an email at will_ayd@innobi.io - I’d love to chat about how the right platform will drive down your operational costs and improve the value of your data.

For companies that have already adopted an open source architecture, there is still a lot that the open source community itself can do to improve interoperability. If you have followed the dataframe space for the past few years, you have seen tools like Polars, Apache DataFusion, and DuckDb work their way into a space that was once dominated by tools like pandas or R. As a maintainer of pandas, I ultimately think this choice is a good thing. Tools should be a means to an end, not the end goal of expression. Find whatever works best for you and let the open source community optimize it.

The idea that you can plug and play different open source tools into your stack is a core tenet of the “Composable Data System.” Wes McKinney and many other titans of the industry have espoused this idea in their own writing; see Wes’ own 15 year reflection and the The Composable Data Management System Manifesto.

To think about what that means non-technically, just imagine that you are a company that uses SAP’s ERP solution and Microsoft Power BI for reporting. Those are both closed source tools that implement data storage in their own ways, so every time you move data from one to the other you have to pay some kind of cost. That cost can manifest itself in compute time, data loss, or in the troubleshooting of job failures.

In the open source space, the Apache Arrow project gives all of the tools the ability to “speak the same language.” So if you have a dataframe library like pandas and want to create visualizations in a tool like Vega, there is no additional cost to using those two together. Pandas can store data in a way that Vega understands; the two communicate and operate on the same data in memory, so your ETL costs go down to zero, you have no data loss, and there becomes no job failures to troubleshoot.

Apache Arrow will continue to help a lot in terms of exchanging data. If you’ve used the Apache Parquet format for exchanging data, you have already seen this. That format in particular enables highly efficient, lossless storage between dataframe libraries, databases, and visualization tools. It drastically improved a space that struggled mightily when the de facto method of data exchange was through CSV files.

As successful as that has been, there are other methods of data exchange that can still improve, with database communication being a prime example. To that end, the Apache Arrow project has developed the Arrow Database Connectivity (ADBC) standard. While a relatively young project, it has made it immensely more efficient to exchange data between dataframes and databases like PostgreSQL, SQLite, Snowflake, and BigQuery. If you haven’t yet seen my PyData 2023 talk on ADBC - check it out. Simply put, ADBC makes it much faster, cheaper, and safer to exchange data with databases. I hope to see adoption of ADBC continue to grow with more databases in the next few years.

There’s also the ability to exchange Arrow data over HTTP, which can solve a rather significant throughput problem. In today’s world, companies often partner with a multitude of SaaS solutions that refuse to provide lower level access to their database, instead giving customers a REST API through which they can access the data. This checks the box for saying “hey we can share your data with you,” but different consumers need data shared in a different way. Usually, the data exchange mechanisms provided by SaaS solutions have been sufficient for creating reactionary web hooks, but struggle to exchange data in bulk. Reporting platforms often need the latter, so I’m hoping to see more adoption of this practice in the next few years to solve real scalability problems that companies face today.

The 5 Data Mistakes Every Apparel Company Makes

2024-11-19T00:00:00+00:00

Over the course of 15 years, I have had the pleasure of working at and partnering with some amazing companies in the Apparel space. While I am not qualified to design product, I’ve been fortunate to work with many of the different business areas that make up an Apparel organization. As a data practitioner, this has immensely helpful when thinking about crafting data strategies that work for the entire organization.

Interestingly, I found that the companies I worked at repeated the same data mistakes over and over again. In this article, I highlight five major mistakes to avoid; doing so will pay immense dividends in terms of how much your organization has to spend to manage data, and in turn how effectively your organization can use data.

PLM System Overreach

The first major mistake I see Apparel companies make is to try and solve all of their product management issues with a PLM system.

Companies have a real need to categorize products, plan, merchandise, and manage changes throughout the product’s lifecycle. While in theory a PLM system could help you manage all of these, in practice I think PLM systems become far less useful the further downstream you get from managing the technical aspects of a product.

For example, I’ve seen companies try to implement planning solutions inside of their PLM. The logic that leads you to this solution usually goes:

We have a lot of plans that we manage in Excel. We need to store them somewhere else
We do a lot of data entry into our PLM system already
Product line managers might want to know how well a product is planned to craft their line

Sure, this is all well intentioned, but assuming your company has gone down this path, the question you must ask yourself after putting planning data into your PLM is now what?. I’ve yet to come across a PLM system that offers any type of robust planning solution (ironically, Excel has more to offer). By going down this path, your company has done nothing but “shuffle sand.” It may feel good that you got your data into a “real system,” but that system offers no capabilities to augment the existence of that data, and you’ve actually created more busy work for people to store, transmit, and analyze plans.

Another common culprit for abuse of the PLM system is Master Data Management (MDM). Once again, there’s some pretty logical thinking that deceives you into thinking your PLM system is the right place to try and manage this. Here’s the reasoning:

To achieve good MDM, we need to fix problems at the source
Our product line is developed in the PLM system
Product line managers should own the quality of their data

There are at least two major issues with this logic. First, the odds that your PLM system will completely encompass your product line offering are low. In the direct-to-consumer line of business, it is common to partner with 3rd parties on licensing agreements that allow you to cross-sell products. If you try to build these 3rd party products into your PLM system, you’ll likely violate all of the assumptions your system has in place about how products should be managed. Although very few PLM implementations actually enforce automated rules, just these few extra styles will weaken the overall way that you use your PLM to manage products that your company actually designs.

The other main issue is that product line managers often do not have complete control over their product. Sure, they manage a lot, but teams downstream from them almost assuredly will have their own custom product categorizations. To illustrate, let’s assume that your supply planning team wants to codify how a product should be stocked. If your goal is to have this information in your PLM system, then either your supply planning teams or product line management teams have to enter it somehow.

In the case that your supply planning teammates maintain this data, you’d have to train them up on a system to leverage only a very small portion of it. This is a waste of your supply planning teams’ time, and many PLM systems have per-user licensing agreements that make this unnecessarily expensive. If, on the other hand, you ask the product line manager to maintain this, your data quality and maintenance are surely going to suffer. You’ve asked your product line manager to own data that has nothing to do with their core function, which is a recipe for disaster.

ERP System Overreach

Now that we’ve cast some doubts on the flexibility offered by PLM systems, let’s move our focus onto the next biggest systems culprit - the ERP system.

Towards the beginning part of this millennium, the concept of partnering with an ERP system provider was strongly rooted at many Apparel companies. Not only did ERP providers want to help you manage and track inventory, they wanted to sell you a reporting platform and anything under the sun that your IT organization wanted.

While still strongly rooted, I think this model has started to scale back in recent years. Instead of relying on the Oracle’s and SAP’s of the world for reporting solutions, many organizations have started investing in specialized reporting platform providers like Snowflake and Databricks to handle their data needs. Others may be perfectly content to build their own reporting solution, using a variety of cloud-based offerings and open source tools that have come into vogue within the past 10 years.

However, old habits die hard, and there is a large consulting business within the Apparel space that will still push you towards leveraging ERP systems and the tools that their providers offer for any customizations. ERP systems are inherently inflexible, and their paired reporting solutions are extremely subpar. If you go this route, you are going to use proprietary, poorly understood ERP-provided tools (does anyone still write ABAP?!?). This will make the Oracle and SAP consultants of the world very happy (and rich!), but you’ll continue along with an inflexible, non-specialized solution.

But what if you’ve seen the writing on the wall and avoided buying more than an ERP from ERP vendors for the past decade? Well done for being ahead of the curve, but you still need to be careful when customizing your ERP system to add more data than it is designed to manage.

I usually see this customization manifest itself through custom SQL scripts that modify the underlying database. In rarer cases, this has the downside of locking you into the database underlying your ERP. Unless you strictly write ANSI SQL, you’ve just added to your workload for upgrading or replacing your database.

Another issue is that ERP providers are very strict about what they support when it comes to customization. If you make a mistake and modify the wrong table or schema, you’ve opened yourself up to the possibility that the ERP provider will claim their expensive support agreement as having been violated.

Finally, it’s worth noting that this is a pretty archaic way of managing systems. There are very few other modern day applications where database-level customization is considered a best practice; more commonly, you’d have some type of middleware that affords you safety and flexibility when augmenting your system. Directly modifying the underlying application database is so 90s.

To be clear, I don’t want this section to read as saying that there is no value to ERP systems. Smaller organizations tend to forgo traditional ERP systems for SaaS solutions to manage production, shipping and inventory. These solutions work up to a certain extent, but they become very difficult to scale. Having a centralized ERP system alongside many standard integrations (vendor management portals, warehouse management modules, point-of-sale systems, etc…) can be of immense value to an organization.

However, you should keep in mind that the benefits of an ERP are mainly relevant to tracking and creating a transactional record of events. When it comes to data and analytics, open-source tooling has completely disrupted the field that ERP vendors tried to dominate in the early 2000s. Open-source tooling is better, and hiring associates with that skill-set is significantly easier. The odds of a young data professional coming out of college knowing how to use SAP or Oracle is pretty low, yet the odds of them knowing how to use tools like Python are high.

Ceding Data to 3rd Parties

As a data professional, this one hurts to see. Once again, let’s talk about the logical steps that get you to this point:

Your marketing team uses a 3rd party tool that is great for
The 3rd party tool promises you they can build you a “single view of the customer”
You start copying all of your data to the third party tool, so you don’t have to manage it yourself

There’s a few issues with this. For starters, I’ve yet to come across a 3rd party SaaS provider that effectively manages an organization’s data. Sure, they can likely produce some attractive reports on top of the data that they created, but they simply do not have the capabilities to cover your entire organization. Managing data quality, ensuring systems talk to one another in meaningful ways, and aligning systems with business processes is an insanely complex task. If you think your email marketing platform that you pay for on a subscription basis is going to solve that for you…well I’ve got some snake oil to sell you too!

If you go this route, you need to be aware that you are giving away one of the greatest assets that your organization has. Within the past decade, we’ve continued to see data become more valuable by the year. Whether you are using data for analytical purpose, or you are collecting data that one day may help you augment an AI model for your organization, you should own that.

Of course a 3rd party SaaS provider would love to take this from you. Storage is exceptionally cheap, and, save having some very low-level integration with your third party, the amount of data that you send to them is going to cost peanuts. This will make your finance team happy, but it won’t take long to find that not only have you ceded control of your invaluable treasure chest of data, you’ve locked yourself into a 3rd party provider.

You should also consider that third party SaaS providers are being bought, sold, and acquired at a profound pace; any of these events can greatly change the priorities of that provider. Even large players like Google have had to make sweeping architectural changes to how they solution their products (Universal Analytics -> GA4, anyone?).

Your organization needs your data and they need it to be comprehensive, accurate, and insightful. Before you give your data away to 3rd parties, ask yourself if you truly think that they are going to forever architect a comprehensive solution for your organization, with limited interaction with your business and at the cost of a subscription. Odds are low, so that’s a huge risk to take on.

No Processes for Data Quality Control

This definitely falls under the purview of MDM, but is critical enough to report here as we talk about data in more general terms. Please DO NOT allow your teams to categorize their data differently across business units. This is short-term win for the business unit that feels like it needs more flexibility to manage its view of products and consumers, but you are proverbially “robbing Peter to pay Paul” when you do this.

This is a top-down failure within your data organization. Each analyst may be happy to have this freedom, but undoubtedly your data and communications will need to cross multiple parts of your organization. Can you imagine trying to cobble together a spreadsheet from three different business units that each refer to your product as:

Awesome Women’s T
Awesome W’s T
Awesome Womens T

As humans, we easily recognize these as the same thing. Computers are pretty stupid though (yes, even in the age of AI) so you’ve introduced work for someone somewhere to try and clean up this mess manually.

What a huge waste of time. I can’t stress this enough. Your poor data quality analyst is going to have to go through, reclassify these, communicate with people to try and establish a best practice, update a multitude of Excel files, etc…all to ultimately produce a non-reproducible report that leaves some ambiguity as to how well this product is being managed.

You may have less control over this on the consumer side and have to pay third party services to help cleanse your data (addresses come to mind), but those services tend to be relatively affordable. On the product side, setting standards up front and measuring adherence to those standards is something you can do up front. Please do this and save your organization from all of the non-value added data cleansing activities downstream.

Undervaluing Manufacturing Data

Very few large brands in Apparel do their own manufacturing at scale. Historically, large brands in the U.S. have partnered with overseas factories to produce goods at very low prices. As the world has changed and the geo-political climate evolves, it is hard to say how the future of this will shape out, but I sincerely doubt that there will be a radical, overnight shift to this setup.

With that being the case, manufacturing partners are rarely ever managed within a comprehensive data platform. At a minimum, I’ve seen Excel be the system of choice to transmit data from the manufacturing company to the Apparel brand. If you wanted to get fancy, you might have a quality control system and an ERP integration with your manufacturer, but these aren’t typically technologically savvy integrations.

Is that the best we can do with manufacturing data? Of course not! Think about the potential for automation - wouldn’t it be cool to have more AI trained to automate tasks like folding and sewing? You can find any number of conceptual inventions on that front - here’s one as an example:

https://blogs.nvidia.com/blog/hugging-face-lerobot-open-source-robotics/

Sure, that’s a very crude folding method and it probably won’t change the labor landscape in the coming months. But how fast can we evolve that space? I would imagine that 99.9% of manufacturing data points are simply lost to time because we don’t track them, and we don’t have the incentives to do so. Maybe the next big thing in Apparel comes from having enough video data to train the robotics to perform these tasks at scale and efficiently?

Outside of theoretical future applications, there’s still so much more that can be done in this space right now. If Apparel companies can apply more technology to manufacturing, they could do things like:

Deeply analyze flaws with the production process design
Get near real time updates into supply chain bottlenecks
Send customers ultra-detailed tracking information

If you are vertically integrated, data collection in your manufacturing process may be a killer feature for your organization. Even if you aren’t vertically integrated, an investment now in partners that are inclined down this path may pay dividends later. Like manufacturing in many other spaces, automation is the future. Don’t be naive and think that Apparel will always be made the same way it is today!

How can you fix these problems?

Lean into open source software

When I first started my career, the idea of using open source software to run an organization’s data stack was amount to heresy. Thankfully, the perception of open-source software has changed alongside the evolution of cloud offerings; these two things pair well together.

If done correctly, you can create an extremely robust, resilient, and highly performant data architecture that you own. You are no longer beholden to how a SaaS or ERP provider thinks you should run your business, and, quite frankly, open source tooling is light years ahead of any one-stop shop that I’ve found.

For instance, here’s a stack that I’ve found personal success with:

Infrastructure Management through Terraform
Job orchestration through Airflow
Data Modeling through dbt
Data Quality / Testing through dbt
Python as a glue language

I’d be lying if I said that this is all a “one-click” deployment, but I don’t think the bar to implement part or all of this stack is all that high either.

The trend of open source software is something that affects more than just the Apparel industry as well; essentially, deploying a stack like the above just keeps you in line with larger shifts in the data space. Just recently, some of the greatest minds in data teamed up to write https://www.vldb.org/pvldb/vol16/p2679-pedreira.pdf. For this with a keen interest in data systems, this is a must read.

Treat data quality as a shared goal

The problem with data quality is that traditional corporate stacks have done a poor job of automation. QA and MDM were often both treated as manual exercises for people to manage, which quickly become unwieldy at large organizations. Fortunately, and thanks to the ever-increasing role of open-source software within organizations, there are very capable tools that can help you manage data quality much more effectively than ever before.

With the tool of your choice, you should be able to orchestrate automated jobs that immediately flag and capture data quality issues. When you find them, you should immediately forward them to the appropriate parties to manage. This affords your organization the flexibility to have different people weave together parts of data that make up the larger picture. Done more accurately, you’ll find far less “busy work” in place to achieve this goal.

Also be sure to organizationally assign ownership to different pieces of data. If data quality issues arise with fields A, B, and C in your data warehouse, the organization should know who is responsible for maintaining those fields.

Overall the process for improving this is not complicated, but rather historically neglected. With modern tooling, you can turn that history on its head and reap higher quality, more trustworthy reporting.

Embrace Technology in Manufacturing

This last solution is a bit more open-ended because it is highly dependent upon how your company is organized and how it manages any potential manufacturing partnerships. So without offering a blanket solution, I’ll bet your company can be doing more to become tech-forward with its development. If you are of the belief that Apparel will be manufactured the same way in years’ time, I think you’ll find many others in the industry looking to disrupt that thought. As AI becomes more accessible and storage costs go down, the barriers to solving the automation problem are breaking down as well.

So please, do whatever you can on this front to not fall behind. If you are vertically integrated, make sure you have robust production tracking software. If you think your company wants to automate more, starting taking videos of your production process to use as training data for AI models. On the flip side, if you rely on a third party manufacturers, make sure you value their technological investments as part of your sourcing strategy. Sure, it may be difficult for them to compete on price with a manufacturer that has no technological investments today, but your supply chain and data management are going to pay those costs in the long run.

Leveraging the Arrow C Data Interface

2024-02-20T00:00:00+00:00

The Arrow C Data Interface is an amazing tool, and while it documents its own potential use cases I wanted to dedicate a blog post to my personal experience using it.

Problem Statement

Transferring data across systems and libraries is difficult and time-consuming. This statement applies not only to compute time but perhaps more importantly to developer time as well.

I first ran into this issue over 5 years ago when I started a library called pantab. At the time, I had just become a core developer of pandas, and through consulting work had been dealing a lot with Tableau. Tableau had just released their Hyper API, which is a way to exchange data to/from their proprietary Hyper database.

Great…, I said to myself, I know a lot of pandas internals and I think writing a DataFrame to a Hyper database will be easier than any other option. Hence, pantab was created.

As you may or may not already be aware, most high-performance Python libraries in the analytics space get their performance from implementing parts of their code base in lower-level languages like C/C++/Rust. So with pantab I set out to do the same thing.

The problem, however, is that pandas did NOT expose any of its internal data structures to other libraries. pantab was forced to hack a lot of things to make this integration “work”, but in a way that was very fragile across pandas releases.

Late in 2023 I decided that pantab was due for a rewrite. Hacking into the pandas internals was not going to work any more, especially as the number of data types that pandas supported started to grow. What pantab needed was an agreement with a library like pandas as to how to exchange low-level data at an extremely high level of performance.

Fortunately, I wasn’t the only person with that idea. Data interchange libraries that weren’t even a thought when pantab started were now a reality, so it was time to test those out.

Status Quo

pantab initially used pandas.DataFrame.itertuples to loop over every row and every element within a DataFrame before writing it out to a Hyper file. While this worked and was faster than what most users would write by hand, it still really wasn’t that fast.

Here is a high level overview of that process, with heavy Python runtime interactions highlighted in red:

A later version of pantab which required a minimum of pandas 1.3 ended up hacking into the internals of pandas, calling something like df._mgr.column_arrays to get a NumPy array for each column in the DataFrame. Combined with the NumPy Array Iterator API, pantab could iterate over raw NumPy arrays instead of doing a loop in Python.

This helped a lot with performance, and while the NumPy Array Iterator API was solid, the pandas internals would change across releases, so it took a lot of developer time to maintain.

The images and comments above assume we are writing a DataFrame to a Hyper file. Going the other way around, pantab would create a Python list of PyObjects and convert to more appropriate data types after everything was read. If we were to graph that process, it would be even more red - not good!

Initial Redesign Attempt - Python DataFrame Interchange Protocol

Before I ever considered the Arrow C Data Interface, my first try at getting high performance and easy data exchange from pandas to Hyper was through the Python DataFrame interchange protocol. While initially promising, this soon became problematic.

For starters, Memory ownership and lifetime is listed as something in scope of the protocol, but is not actually defined. Implementers are free to choose how long a particular buffer should last, and it is up the client to just know this. After many unexpected segfaults, I started to grow weary of this solution.

Another major issue for the interchange protocol is that Non-Python API standardization (e.g., C/C++ APIs) is explicitly a non-goal. With pantab being a consumer of raw data, this meant I had to know how to manage those raw buffers for every type I wished to consume. While that may not be a huge deal for simple primitive types like sized integers, it leaves much to be desired when you try to work with more complex types like decimals.

Next topic - nullability! Here is the enumeration the protocol specified:

class ColumnNullType(enum.IntEnum):
    """
    Integer enum for null type representation.

    Attributes
    ----------
    NON_NULLABLE : int
        Non-nullable column.
    USE_NAN : int
        Use explicit float NaN value.
    USE_SENTINEL : int
        Sentinel value besides NaN.
    USE_BITMASK : int
        The bit is set/unset representing a null on a certain position.
    USE_BYTEMASK : int
        The byte is set/unset representing a null on a certain position.
    """

    NON_NULLABLE = 0
    USE_NAN = 1
    USE_SENTINEL = 2
    USE_BITMASK = 3
    USE_BYTEMASK = 4

The way the DataFrame Interchange Protocol decided to handle nullability is an area where trying to be inclusive of many different strategies ended up as a detriment to all. Requiring developers to integrate all of these methods across any type they may consume is a lot of effort (particularly for USE_SENTINEL).

Another limitation with the DataFrame Interchange Protocol is the fact that it only talks about how to consume data, but offers no guidance on how to produce it. If starting from your extension, you have no tools or library to manually build buffers. Much like the status quo, this meant reading from a Hyper database to a pandas DataFrame would likely be going through Python objects.

Finally, and related to all of the issues above, the pandas implementation of the DataFrame Interchange Protocol left a lot to be desired. While started with good intentions, it never got the attention needed to make it really effective. I already mentioned the lifetime issues across various data types, but nullability handling was all over the place across types. Metadata was often passed along incorrectly from pandas down through the interface…essentially making it a very high effort for consumers to try and use it.

Arrow C Data Interface to the Rescue

After stumbling around the DataFrame Protocol Interface for a few weeks, Joris Van den Bossche asked me why I didn’t look at the Arrow C Data Interface. The answer of course was that I was just not very familiar with it. Joris knows a ton about pandas and Arrow, so I figured it best to take his word for it and try it out.

Almost immediately my issues went away. To wit:

Memory ownership and lifetime - well defined at low levels
Non-Python API - for this there is nanoarrow
Nullability handling - uses Arrow bitmasks
Producing buffers - can create (not just read) data
pandas implementation - it just works via PyCapsules

With well defined memory semantics, a low-level API and clean nullability handling, the amount of extension code I had to write was drastically reduced. I felt more confident in the implementation and had to deal with less memory corruption / crashes than before. And, perhaps most importantly, I saved a lot of time.

See the image below for a high level overview of the process. Note the lack of any red compared to the status quo - this has a very limited interaction with the Python runtime:

Without going too deep in the benchmarks game, the Arrow C Data Interface implementation yielded a 25% performance improvement for me when writing strings. When reading data, it was more like a 500% improvement than what had been previously implemented. Not bad…

My code is no longer tied to the potentially fragile internals of pandas, and with the stability of the Arrow C Data Interface things are far less likely to break when new versions are released.

Bonus Feature - Bring Your Own Library

While it wasn’t my goal at the outset, implementing the Arrow C Data Interface had the benefit of decoupling a dependency on pandas. pandas was the de facto library when pantab was first written, but since then many high quality Arrow-based libraries have popped up.

With the Arrow C Data Interface, pantab now has a bring your own DataFrame library mentality.

>>> import pantab as pt
>>> import pantab as pd
>>> df = pd.DataFrame({"col": [1, 2, 3]})
>>> pt.frame_to_hyper(df, "example.hyper", table="test")

>>> import polars as pl
>>> df = pl.DataFrame({"col": [1, 2, 3]})
>>> pt.frame_to_hyper(df, "example.hyper", table="test")

>>> import pyarrow as pa
>>> tbl = pa.Table.from_pydict({"col": [1, 2, 3]})
>>> pt.frame_to_hyper(tbl, "example.hyper", table="test")

These all produce the same results, and as the author of pantab I did not have to do anything extra to accommodate the various libraries - everything just works.

Closing Thoughts

The Arrow specification is simply put…awesome. While initiatives like the Python DataFrame Protocol have tried to solve the issue of interchange, I don’t believe that goal was ever achieved…until now. The Arrow C Data Interface is the tool developers have always needed to make analytics integrations easy.

pantab is not the first library to take advantage of these features. The Arrow ADBC drivers I previously blogged about are also huge users of nanoarrow / the Arrow C Data Interface, and heavily influenced the design of pantab. The Powered By Apache Arrow project page is the best resource to find others as they get developed in the future.

I, for one, am excited to see Arrow-based tooling grow and make open-source data integrations more powerful than ever before.

Leveraging the ADBC driver in Analytics Workflows

2023-06-16T00:00:00+00:00

The ADBC: Arrow Database Connectivity client API standard is new standard introduced in January 2023. Sparing some technical details, traditional formats like ODBC/JDBC has operated on data in a row-oriented manner. This made sense at the time those standards were created (in the 1990s) as the databases they targeted were pre-dominantly row-oriented as well. The past decade of analytics has shown a strong inclination towards column-oriented database storage, so using ODBC/JDBC to transfer data means you at a minimum always have to spend resources to translate to/from row- and column-oriented formats.

Many column databases solve the row->column transposition issue by ingesting or exporting columnar file formats like Apache Parquet. This can be an indispensable tool for achieving high throughput, but in going this route you often sacrifice the ecosystem benefits of standard tooling like ODBC/JDBC. Using pandas as an example, I can very easily read/write from almost any database using pd.read_sql and pd.DataFrame.to_sql. This works well for smallish datasets, but when you run into scalability issues you often end up exporting/importing via CSV/parquet, adding more potential points of breakage to your pipelines.

Its worth nothing that even if your source/target database is not columnar, ADBC has an advantage of being implemented at a low level. ADBC is tightly integrated with the Arrow Columnar Format, which essentially means that ADBC can optimally work with the data using its primitive layout in memory. Pandas by contrast does NOT have this, so all of the to_sql and read_sql calls you make in pandas have to do a lot of extra work at runtime to have database communications fit into the pandas data model. This is by no means free and one of the reasons why SQL interaction in pandas is slow, not to mention all the extra hoops pandas has to jump through to (oftentimes unsuccessfully) manage data types.

To see how much ADBC could help my workflows I decided to test things out against the Python ADBC Postgres Driver and compare it to the functional equivalent in pandas. As of writing the ADBC Postgres driver is still considered experimental, but I encourage you to install it on your own and try it out!

Performance Benchmarking

The following code serves as a crude benchmark for performance. If you’d like to run this on your end, simply tweak PG_URI to match your database configuration.

import functools
import time
from collections.abc import Callable

import numpy as np
import pandas as pd
import pyarrow as pa
import sqlalchemy as sa
from adbc_driver_postgresql import dbapi


PG_URI = "postgresql://"


def print_runtime(func: Callable):

    @functools.wraps(func)
    def wrapper(*args, **kwargs):
        start = time.time()
        result = func(*args, **kwargs)
        end = time.time()
        runtime = end - start
        print(f"function {func.__name__} took {runtime}")

        return result

    return wrapper


@print_runtime
def write_pandas(df: pd.DataFrame):
    table_name = "pandas_data"
    engine = sa.create_engine(PG_URI)
    df.to_sql(table_name, engine, if_exists="replace", method="multi", index=False)


@print_runtime
def write_arrow(tbl: pa.Table):
    table_name = "arrow_data"
    with dbapi.connect(PG_URI) as conn:
        with conn.cursor() as cur:
            cur.execute(f"DROP TABLE IF EXISTS {table_name}")

        with conn.cursor() as cur:
            cur.adbc_ingest(table_name, tbl)

        conn.commit()

@print_runtime
def read_pandas() -> pd.DataFrame:
    table_name = "pandas_data"
    engine = sa.create_engine(PG_URI)
    return pd.read_sql(f"SELECT * FROM {table_name}", engine)


@print_runtime
def read_arrow() -> pa.Table:
    table_name = "arrow_data"
    with dbapi.connect(PG_URI) as conn:
        with conn.cursor() as cur:
            cur.execute(f"SELECT * FROM {table_name}")
            return cur.fetch_arrow_table()


if __name__ == "__main__":
    np.random.seed(42)
    df = pd.DataFrame(np.random.randint(10_000, size=(100_000, 10)), columns=list("abcdefghij"))
    tbl = pa.Table.from_pandas(df)

    write_pandas(df)
    write_arrow(tbl)

    df_new = read_pandas()
    tbl_new = read_arrow()

Executing this very unscientific benchmark yields the following results on my machine:

function write_pandas took 11.065816879272461
function write_arrow took 1.1672894954681396
function read_pandas took 0.2586965560913086
function read_arrow took 0.0703287124633789

From this we can see the ADBC connector is significantly faster on both read and write. Keep in mind that Postgres is a row-oriented database; my expectation is that the performance benefits would be even bigger for a column-oriented database!

Better Data Types

If you’ve worked with pandas in an ETL workflow, chances are high that you’ve had to do some post-processing on numeric data. This happens often with nullable integral data (which the NumPy backend to pandas physically cannot store), but can also happen for many other reasons that differ across databases / driver implementations. For the sake of illustration, let’s append a row of NULL values to our tables.

INSERT INTO arrow_data VALUES (NULL);
INSERT INTO pandas_data VALUES (NULL);

This has no impact on the arrow code we wrote previously

>>> tbl_new = read_arrow()
>>> tbl.schema == tbl_new.schema
True

But will impact the pandas code

>>> df_new = read_pandas()
>>> (df.dtypes == df_new.dtypes).all()
False

Even though nothing changed with the data type in the database, we’ve gone from using integral data in pandas / postgres to now introducing float data in pandas, solely due to the introduction of NULL values in postgres. This can come up unexpectedly and be very surprising. To prevent this on the pandas side, you will see things like:

>>> df_new = df_new.astype("Int32")

>>> table_name = "pandas_data"
>>> engine = sa.create_engine(PG_URI)
>>> df_new = pd.read_sql_query(f"SELECT * FROM {table_name}", engine, dtype="Int32")

OR (starting in pandas 2.0)

>>> table_name = "pandas_data"
>>> engine = sa.create_engine(PG_URI)
>>> df_new = pd.read_sql(f"SELECT * FROM {table_name}", engine,
...   dtype_backend="pyarrow")

These are 3 different ways to solve the problem, each introducing their own subsequent nuance. If you already knew about the issues with nullable integral data and the NumPy backend in pandas then maybe this isn’t surprising, but not every user has or needs to have that low-level of an understanding of pandas. This was also a controlled example; in the real world you either need to be overly defensive or open to surprise when minor changes in your database data change your pandas data types and subsequent workflows. With the ADBC driver you do not have this issue; the data type you read is simply inferred from the database metadata.

Closing Thoughts

I for one am really excited to see how ADBC continues to evolve. Moving data from one database to another takes up a significant amount of my time as a data engineer, and the ability to do that faster with cleaner data types will be powerful. As more databases (particularly columnar ones) implement Arrow Flight SQL or at least provide ADBC clients I expect a lot of ETL tools to start leveraging ADBC drivers in turn.

Comparing Cython to Rust - Evaluating Python Extensions

2023-05-17T00:00:00+00:00

Rust as a language has had tremendous growth in recent years. With no intention of starting a language war, Rust has a much stronger type checking system than a language like C, and arguably feels more approachable than a language like C++. It also includes thread safety as part of the language, which can be immensely useful for those looking to optimize their system.

Rust is also growing in usage as an extension language for Python. PyO3 makes writing extensions relatively easy, especially when compared to the same toolchain(s) for C/C++ extensions. While not as “pythonic” as Cython, you can argue that Rust is more approachable to Python-developers than C/C++ are as languages. To see it in action, let’s compare a Cython written extension to a Rust-written extension.

For demonstration purposes we are taking a trivial example of a custom-implemented max function along the columns of a NumPy array. The example is admittedly naive (NumPy natively can handle this), but as a developer you may find yourself following a similar pattern for custom algorithms.

The source code for these exercises is available on my GitHub.

Coding the example in Cython

Here is our find_max function with a relatively optimized Cython implementation. Within a cdef function, we determine the bounds of a 2D int64 array, loop over the columns / rows and evaluate each member of the array, looking for the largest value in each column.

cimport cython
from libc.limits cimport LLONG_MIN
import numpy as np
from numpy cimport ndarray, int64_t
import time

@cython.boundscheck(False)
@cython.wraparound(False)
cdef ndarray[int64_t, ndim=1] _find_max(ndarray[int64_t, ndim=2] values):
    cdef:
        ndarray[int64_t, ndim=1] out
        int64_t val, colnum, rownum, new_val
        Py_ssize_t N, K

    N, K = (<object>values).shape
    out = np.zeros(K, dtype=np.int64)
    for colnum in range(K):
        val = LLONG_MIN  # imperfect assumption, but no INT64_T_MIN from numpy
        for rownum in range(N):
            new_val = values[rownum, colnum]
            if val < new_val:
                val = new_val

        out[colnum] = val

    return out


def find_max(ndarray[int64_t, ndim=2] values):
    cdef ndarray[int64_t, ndim=1] result
    start = time.time_ns()
    result = _find_max(values)
    end = time.time_ns()
    duration = (end - start) / 1_000_000
    print(f"cypy took {duration} milliseconds")
    return result

For brevity I won’t be listing out the instructions to cythonize and build a shared library, but if you need you can follow similar instructions from the previous article on debugging Cython extensions with gdb. For this article, assume that this gets built to a shared library named cypy.

Building the same in Rust

PyO3 will be our tool for setting up Rust <> Python interoperability. Per their documentation on building modules we could choose to build manually or use maturin. For ease of demonstration we will use the latter.

$ maturin new rustpy
$ cd rustpy

Within our newly created project, add numpy == "0.18" to the dependencies section. This will let us use the rust-numpy crate to pass numpy arrows between Python and Rust. Afterwards, open lib.rs an insert the following code:

use numpy::ndarray::{Array1, ArrayView2, Axis};
use numpy::{PyArray1, PyReadonlyArray2};
use pyo3::{pymodule, types::PyModule, PyResult, Python};
use std::time::SystemTime;

#[pymodule]
#[pyo3(name = "rustpy")]
fn rust_ext(_py: Python<'_>, m: &PyModule) -> PyResult<()> {
    fn find_max(arr: ArrayView2<'_, i64>) -> Array1<i64> {
        let mut out = Array1::default(arr.ncols());

        for (i, col) in arr.axis_iter(Axis(1)).enumerate() {
            let mut val = i64::MIN;
            for x in col {
                if val < *x {
                    val = *x;
                }
            }

            out[i] = val;
        }

        out
    }

    #[pyfn(m)]
    #[pyo3(name = "find_max")]
    fn find_max_py<'py>(py: Python<'py>, x: PyReadonlyArray2<'_, i64>) -> &'py PyArray1<i64> {
        let start = SystemTime::now();
        let result = find_max(x.as_array()).into_pyarray(py);
        let end = SystemTime::now();
        let duration = end.duration_since(start).unwrap();
        println!("rustpy took {} milliseconds", duration.as_millis());
        result
    }

    Ok(())
}

Studying the above closely, the find_max_py function is the bridge between Rust and Python, and it ultimately dispatches to the find_max function. That function accepts a 2 dimensional view of an array, and returns a newly created 1D array full of 64 bit integers. Within the function body, you see the dynamic creation of the return value, as well as iteration by column. While the semantics vary, you should see that this follows the same general outline as our Cython implementation.

With this in place, run maturin develop --release from the project root. This will take care of installing the local source code into a Python package with optimizations.

Comparing Results

Both implementations above include not-very-scientific timers to give us an idea of general performance. Let’s set up with the following code:

>>> import numpy as np
>>> np.random.seed(42)
>>> arr = np.random.randint(100_000, size=(100, 1_000_000))

Let’s check our cypy performance:

>>> import cypy
>>> result1 = cypy.find_max(arr)
cypy took 273.319301 milliseconds

Versus the same function implemented in Rust:

>>> import rustpy
>>> result2 = rustpy.find_max(arr)
rustpy took 116 milliseconds
>>> (result1 == result2).all()
True

The rust implementation only took ~45% of the time - not bad!

Parallelization

Another area where Rust extensions can really shine is in parallelization, due to the aforementioned language guarantees of thread safety. Cython offers parallelization using OpenMP, but as I recently discovered there are quite a few downsides to that when it comes to packaging, usability and cross-platform behavior.

Since Rust handles this more natively, let’s see how it would tackle the above code but in a parallel way. For this purpose we are going to use the rayon feature that comes bundled with the Rust ndarray crate. To enable that, go ahead and add ndarray = {version = "0.15", features=["rayon"]} to your dependencies in Cargo.toml.

Afterwards we are going to add 2 new functions to our rustpy library - one to handle the internals and the other to serve as the bridge to Python. For starters, let us update the imports at the top of our module:

use ndarray::parallel::prelude::*;
use numpy::ndarray::{Array1, ArrayView2, Axis, Zip};
use numpy::{IntoPyArray, PyArray1, PyReadonlyArray2};
use pyo3::{pymodule, types::PyModule, PyResult, Python};
use std::sync::{Arc, Mutex};
use std::time::SystemTime;

Then go ahead and all the following code below the find_max_py function.

fn find_max_parallel(arr: ArrayView2<'_, i64>) -> Array1<i64> {
    let mutex = Arc::new(Mutex::new(Array1::default(arr.ncols())));

    // parallel iterator is not implemented, so some hacks
    // https://github.com/rust-ndarray/ndarray/issues/1043
    // https://github.com/rust-ndarray/ndarray/issues/1093
    Zip::indexed(arr.axis_iter(Axis(1)))
        .into_par_iter()
        .for_each(|(i, col)| {
            let mut val = i64::MIN;
            for x in col {
                if val < *x {
                    val = *x;
                }
            }

            let mut guard = mutex.lock().unwrap();
            guard[i] = val;
        });

    // https://stackoverflow.com/questions/29177449/how-to-take-ownership-of-t-from-arcmutext
    let lock = Arc::try_unwrap(mutex).expect("Lock still have multiple owners");
    lock.into_inner().expect("Mutex cannot be locked")
}

// wrapper of `find_max`
#[pyfn(m)]
#[pyo3(name = "find_max_parallel")]
fn find_max_py_parallel<'py>(
    py: Python<'py>,
    x: PyReadonlyArray2<'_, i64>,
) -> &'py PyArray1<i64> {
    let start = SystemTime::now();
    let result = find_max_parallel(x.as_array()).into_pyarray(py);
    let end = SystemTime::now();
    let duration = end.duration_since(start).unwrap();
    println!("rustpy parallel took {} milliseconds", duration.as_millis());
    result
}

Within the comments I’ve linked some StackOverflow articles that you may find of interest. At a high level, now that we want to execute things in parallel we need to implement a Mutex to prevent data races. We also use a thread-safe reference counter Arc; using these in tandem is a common pattern in Rust.

So how does this compare performance-wise to our examples above?

>>> import rustpy
>>> result3 = rustpy.find_max_parallel(arr)
rustpy parallel took 234 milliseconds
>>> (result2 == result3).all()
True

We get the same results which is great, but compared to the non-parallel implementation we are now slower - almost twice as slow. What gives?!?

Without peering into every detail, it goes without saying that there is “no such thing as a free lunch”. Using the mutex to synchronize parallel code above is no exception, and likely the cost of that synchronization far exceeds the benefit of it. Keep in mind that we are dealing with an array of 100 x 1_000_000 and attempting to synchronize a thread per column. That’s a lot of threads to operate on rows of 100 records!

What happens if we transpose the array?

>>> arr2 = arr.T
>>> arr2.shape
(1000000, 100)
>>> rustpy.find_max(arr2)
rustpy took 67 milliseconds
>>> rustpy.find_max_parallel(arr2)
rustpy parallel took 38 milliseconds

That’s more like it! Whereas before we created 1_000_000 threads to operate on arrays of 100 records, now we use 100 threads to operate on arrays of 1_000_000 records. The relative cost of starting / stopping threads and synchronizing access via the mutex in this case is far lower than the relative performance gain we get from allowing threads to operate on large arrays in parallel.

Even Faster Parallelization

Irv Lustig had an idea that we could do away with the mutex, which would reduce the parallelization overhead of synchronizing access to the out variable. Internally the NumPy array manages its data in a contiguous array of memory, and indexing methods like out[i] just points to a location in memory that is i steps away from the start of that array. Because each thread manages its own value of i, each thread also writes to a unique memory location without any overlap. Careful attention paid to this fact makes the synchronization unnecessary.

Rust by default is skeptical of this, so we have to jump through a few hoops to make it work. Stepwise the first thing we wanted to do was get rid of the Mutex. However, Rust will reject the following code:

let mut out = Array1::default(arr.ncols());

Zip::indexed(arr.axis_iter(Axis(1)))
    .into_par_iter()
    .for_each(|(i, col)| {
        let mut val = i64::MIN;
        for x in col {
            if val < *x {
                val = *x;
            }
        }

        out[i] = val;
    });
out

With the following error

error[E0596]: cannot borrow `out` as mutable, as it is a captured variable in a `Fn` closure

As explained in this link the closure cannot use a mutable reference (here the out variable) defined outside of its scope. To make this possible we use the UnsafeCell primitive. Our first attempt to do so could look something like this:

let mut out = Array1::default(arr.ncols());
let uout = UnsafeCell::new(&mut out);

...
// Let's assume we are within the closure
   (*uout.get())[i] = val;
});

out

Alas things aren’t so simple. This will in turn yield another error

error[E0277]: `UnsafeCell<&mut ArrayBase<OwnedRepr<i64>, Dim<[usize; 1]>>>` cannot be shared between threads safely

...

 = help: within `[closure@src/lib.rs:56:23: 56:33]`, the trait `Sync` is not implemented for `UnsafeCell<&mut ArrayBase<OwnedRepr<i64>, Dim<[usize; 1]>>>`

If you look carefully the note that the trait Sync is not implemented... means Rust isn’t happy we are trying to use that object across threads without the Sync trait being implemented on it. Some research will take us to the SyncUnsafeCell. This object implements the Sync trait, but as of writing is only available in nightly builds. While it is something to track, it does not help us today.

To work around this, user Alice Ryhl over at StackOverflow came up with this nifty solution. Alice’s code works generically for slices; the implementation we have specializes only to Array1 types, but keeps the same structure in place.

At a high level, instead of using the UnsafeCell directly, we create our own structure that uses the UnsafeCell as a field member. The custom structure provides blank trait implementations for Send and Sync so the compiler is happy to let it work across threads. With that in place, we can call the write member function from within our threads.

// https://stackoverflow.com/questions/65178245/how-do-i-write-to-a-mutable-slice-from-multiple-threads-at-arbitrary-indexes-wit
#[derive(Copy, Clone)]
struct UnsafeArray1<'a> {
    array: &'a UnsafeCell<Array1<i64>>,
}

unsafe impl<'a> Send for UnsafeArray1<'a> {}
unsafe impl<'a> Sync for UnsafeArray1<'a> {}

impl<'a> UnsafeArray1<'a> {
    pub fn new(array: &'a mut Array1<i64>) -> Self {
        let ptr = array as *mut Array1<i64> as *const UnsafeCell<Array1<i64>>;
        Self {
            array: unsafe { &*ptr },
        }
    }

    /// SAFETY: It is UB if two threads write to the same index without
    /// synchronization.
    pub unsafe fn write(&self, i: usize, value: i64) {
        let ptr = self.array.get();
        (*ptr)[i] = value;
    }
}

fn find_max_unsafe(arr: ArrayView2<'_, i64>) -> Array1<i64> {
    let mut out = Array1::default(arr.ncols());
    let uout = UnsafeArray1::new(&mut out);

    Zip::indexed(arr.axis_iter(Axis(1)))
        .into_par_iter()
        .for_each(|(i, col)| {
            let mut val = i64::MIN;
            for x in col {
                if val < *x {
                    val = *x;
                }
            }

            unsafe { uout.write(i, val) };
        });

    out
}

#[pyfn(m)]
#[pyo3(name = "find_max_unsafe")]
fn find_max_py_unsafe<'py>(py: Python<'py>, x: PyReadonlyArray2<'_, i64>) -> &'py PyArray1<i64> {
    let start = SystemTime::now();
    let result = find_max_unsafe(x.as_array()).into_pyarray(py);
    let end = SystemTime::now();
    let duration = end.duration_since(start).unwrap();
    println!("rustpy unsafe took {} milliseconds", duration.as_millis());
    result
}

Turning off bounds checking

Since we are running unsafe code blocks, we also have the ability to disable bounds checking our arrays. In Cython you would typically do this with the @cython boundscheck(False) decorator. With the ndarray rust crate you would replace the index operator [] with uget or uget_mut. For us, this means changing our write implementation for the UnsafeArray1 class to:

pub unsafe fn write(&self, i: usize, value: i64) {
    let ptr = self.array.get();
    *(*ptr).uget_mut(i) = value;
}

So how does this compare function wise?

>>> res1 = cypy.find_max(arr)
cypy took 284.153331 milliseconds
>>> res2 = rustpy.find_max(arr)
rustpy took 113 milliseconds
>>> res3 = rustpy.find_max_parallel(arr)
rustpy parallel took 223 milliseconds
>>> res4 = rustpy.find_max_unsafe(arr)
rustpy unsafe took 47 milliseconds
>>> ((res1 == res2) & (res1 == res3) & (res1 == res4)).all()
True

Compared to our initial Cython implementation, our unsafe threaded implementation takes about 16.5% of the same runtime. Not bad.

The benchmarks above were recorded on a Lemur Pro laptop with a 12th Gen Intel(R) Core(TM) i7-1255U processor and 12 logical cores. Results will vary depending on your hardware and OS. If you want more control over the degree of parallelization than that which comes out of the box, be advised that this all dispatches to rayon under the hood. Rayon uses one thread per CPU by default. You could accept an argument into your extension function that limits the number of threads being spawned at one time, or alternately you can set the RAYON_NUM_THREADS environment variable.

From my machine if I run RAYON_NUM_THREADS=2 python and within the interpreter execute rustpy.find_max_parallel(arr), I get the response that rustpy parallel took 71 seconds. This is an improvement over the default parallel implementation, which as we noted in the previous section introduced a lot of overhead with thread synchronization when arrays had a large number of columns and a relatively small amount of rows.

Closing Thoughts

From my initial trials I was very surprised by how good Rust was for building extensions. The language itself is pretty natural in a way that I think could be useful to higher-level programmers, while offering great performance at the same time. Not pictured in the above analysis were a ton of mistakes in trying to get code parallelized via Rust. In C/C++ I likely would have made a very buggy program; the Rust compiler prevented me from doing so here. In all, I think Rust can creep into the same realm that Cython occupies today and become a serious competitor for easy extension authoring.

I also want to mention Irv Lustig, Brock Mendel, Marc Garcia and Nathan Goldblum for their help in implementing and improving this article. Thanks all for your help and support!

Fundamental Python Debugging Part 3 - Cython Extensions

2023-03-10T00:00:00+00:00

For the unaware, Cython is a transpiler from a Python-like syntax into C files. This gets you close to C performance while writing files that aren’t that dissimilar from Python. It is used extensively in the scientific Python community to generate high-performance extensions. A common approach to optimize Python libraries is to make sure you are as efficient as possible in pure Python, before building your code in Cython, and commonly as a last resort writing your C/C++ extensions by hand.

In spite of this pattern we are introducing Cython as the third part of the debugging series, after already having debugged C extensions. Why is that? Well, it turns out that the Cython debugger is in fact a gdb python extension, which we saw CPython also leverage in the last chapter. We aren’t doing anything novel in this chapter but just walking through some of the conveniences the ``cygdb` extension provides (interested users can find the source code here).

If you haven’t read the previous article on debugging Python extensions with gdb, I highly recommend that you do so before continuing here. Although writing Cython can be thought of as a stepping stone to writing C/C++ extensions, the inverse is true when it comes to debugging.

Setting up our environment

For this chapter we will leverage the same image as in the last, so start with:

docker pull willayd/cpython-debugging

In addition to the items outlined in the previous chapter, this image also includes Cython as a pip-installed package. If you don’t care to use the docker image you can also follow the instructions in the Debugging your Cython program documentation, but be aware that some of the interactions between Cython, gdb and Python aren’t very intuitive, especially if using Python installed as a virtual image.

If using the docker image above, be sure to run it as a container and mount a local directory for development into the container at /host. As in the previous section, I will be putting my work in a directory called ~/code-demos.

willayd@willayd:~$ docker run --rm -it -w /data -v ${HOME}/code-demos:/data willayd/cpython-debugging

Build our first Cython extension

We are going to start with the same extension we created in the previous chapter. Let’s create a file named debugging_cython.pyx in the folder on your computer that you mounted into docker and insert these contents:

def say_hello_and_return_none():
    print("Hello from the Cython extension")

That’s it! From here we now have two steps we need to follow to get this converted into an importable extension:

Transpile the Cython file into a C module
Build a shared library from the C module

The cython command can help us with Step 1; Step 2 builds on a lot of knowledge from the previous chapter. Here are the commands:

root@f241800d6a12:/data# cython --gdb debugging_cython.pyx
root@f241800d6a12:/data# gcc -g3 -Wall -Werror -std=c17 -shared -fPIC -I/usr/local/include/python3.10d debugging_cython.c -o debugging_cython.so

With the extension built, you can import the module and call the function.

root@f241800d6a12:/data# python3
>>> import debugging_cython
>>> debugging_cython.say_hello_and_return_none()
Hello from the Cython extension

Using cygdb

If you inspect the output of debugging_cython.c which was generated in the previous section, you could debug it using gdb as if it were a normal C module, because it is. It certainly doesn’t look that anything that you would have written by hand, but there isn’t any real magic to what is happening here; Cython takes Python-like code and transpiles a C file out of it. The rest of the tooling that we’ve seen in the previous chapter can pick things up from there. However, because the file was auto-generated you lose a lot of the abstractions that you get from writing Python-like code, and end up stepping through a tangled web of variables you aren’t familiar with in gdb. pdb cannot debug Cython files for us, so we need to use cygdb. We can then set a breakpoint at our function using the cy break command and open up a Python interpreter with cy run.

root@fad66408f996:/data# cygdb
(gdb) cy break say_hello_and_return_none
Function "__pyx_pw_16debugging_cython_1say_hello_and_return_none" not defined.
Breakpoint 1 (__pyx_pw_16debugging_cython_1say_hello_and_return_none) pending.
(gdb) cy run
Python 3.10.10+ (heads/3.10:bac3fe7, Feb 22 2023, 05:56:35) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>

With the Python interpreter running let us import and execute our function.

>>> import debugging_cython
>>> debugging_cython.say_hello_and_return_none()

Breakpoint 1, __pyx_pw_16debugging_cython_1say_hello_and_return_none (__pyx_self=0x0, unused=0x0) at debugging_cython.c:1202
1202   PyObject *__pyx_r = 0;
1    def say_hello_and_return_none():

We’ve hit a breakpoint at line 1202 of the generated debugging_cython.c file. The commands the Cython debugger exposes are not really that different from what we saw with gdb in the previous chapter. The difference is that the gdb built-in commands will work as if you are debugging debugging_cython.c, whereas the cygdb commands will work as if you are debugging debugging_cython.pyx. Inputting list and then cy list will help us see this in action:

(gdb) list
         1    def say_hello_and_return_none():1197
1198 /* Python wrapper */
1199 static PyObject *__pyx_pw_16debugging_cython_1say_hello_and_return_none(PyObject *__pyx_self, CYTHON_UNUSED PyObject *unused); /*proto*/
1200 static PyMethodDef __pyx_mdef_16debugging_cython_1say_hello_and_return_none = {"say_hello_and_return_none", (PyCFunction)__pyx_pw_16debugging_cython_1say_hello_and_return_none, METH_NOARGS, 0};
1201 static PyObject *__pyx_pw_16debugging_cython_1say_hello_and_return_none(PyObject *__pyx_self, CYTHON_UNUSED PyObject *unused) {
1202   PyObject *__pyx_r = 0;
1203   __Pyx_RefNannyDeclarations
1204   __Pyx_RefNannySetupContext("say_hello_and_return_none (wrapper)", 0);
1205   __pyx_r = __pyx_pf_16debugging_cython_say_hello_and_return_none(__pyx_self);
1206
(gdb) cy list
>    1    def say_hello_and_return_none():
     2        print("Hello from the Cython extension")

help cy gives a nice overview within gdb of the available commands. It is a much smaller set of commands than what gdb offers, but should cover the majority of needs in normal development.

(gdb) help cy

    Invoke a Cython command. Available commands are:

        cy import
        cy break
        cy step
        cy next
        cy run
        cy cont
        cy finish
        cy up
        cy down
        cy select
        cy bt / cy backtrace
        cy list
        cy print
        cy set
        cy locals
        cy globals
        cy exec

...
Type "help cy" followed by cy subcommand name for full documentation.
Type "apropos word" to search for commands related to "word".
Type "apropos -v word" for full documentation of commands related to "word".
Command name abbreviations are allowed if unambiguous.

cpdef functions

Our previous program leveraged a def function, which Cython makes callable from the Python interpreter. Cython also offers cdef functions (not callable from Python) and cpdef functions, which essentially generate a def and a cdef for you. A detailed explanation of why you would choose those is outside the scope of this article; if you need a primer be sure to check out the wonderful Cython language basics documentation.

For debugging purposes, let’s create debugging_cython2.pyx and change our function from def to cpdef.

cpdef say_hello_from_cpdef():
    print("Hello from the cpdef function")

If you are still running cygdb from the previous section, go ahead and exit to get back to your standard terminal. From there, we want to transpile and create our new shared library:

root@f241800d6a12:/data# cython --gdb debugging_cython2.pyx
root@f241800d6a12:/data# gcc -g3 -Wall -Werror -std=c17 -shared -fPIC -I/usr/local/include/python3.10d debugging_cython2.c -o debugging_cython2.so

Fire up cygdb again and set another breakpoint on that function:

(gdb) cy break say_hello_from_cpdef
Function "__pyx_f_17debugging_cython2_say_hello_from_cpdef" not defined.
Breakpoint 1 (__pyx_f_17debugging_cython2_say_hello_from_cpdef) pending.
Function "__pyx_pw_17debugging_cython2_1say_hello_from_cpdef" not defined.
Breakpoint 2 (__pyx_pw_17debugging_cython2_1say_hello_from_cpdef) pending.

What is interesting here is that we now have 2 breakpoints! The reason for this again is that cpdef generates two functions for us - one purely accessible from C and one accessible from Python. Go ahead and cy run to get the Python interpreter started; we will then run cy cont to continue past each breakpoint.

(gdb) cy run
Python 3.10.10+ (heads/3.10:bac3fe7, Feb 22 2023, 05:56:35) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import debugging_cython2
>>> debugging_cython2.say_hello_from_cpdef()

Breakpoint 2, __pyx_pw_17debugging_cython2_1say_hello_from_cpdef (__pyx_self=, unused=0x0) at debugging_cython2.c:1227
1227   PyObject *__pyx_r = 0;
(gdb) cy list
  1222    }
  1223
  1224    /* Python wrapper */
  1225    static PyObject *__pyx_pw_17debugging_cython2_1say_hello_from_cpdef(PyObject *__pyx_self, CYTHON_UNUSED PyObject *unused); /*proto*/
  1226    static PyObject *__pyx_pw_17debugging_cython2_1say_hello_from_cpdef(PyObject *__pyx_self, CYTHON_UNUSED PyObject *unused) {
> 1227      PyObject *__pyx_r = 0;
  1228      __Pyx_RefNannyDeclarations
  1229      __Pyx_RefNannySetupContext("say_hello_from_cpdef (wrapper)", 0);
  1230      __pyx_r = __pyx_pf_17debugging_cython2_say_hello_from_cpdef(__pyx_self);
  1231
(gdb) cy cont

Breakpoint 1, __pyx_f_17debugging_cython2_say_hello_from_cpdef (__pyx_skip_dispatch=0) at debugging_cython2.c:1194
1194   PyObject *__pyx_r = NULL;
1    cpdef say_hello_from_cpdef():
(gdb) cy list
>    1    cpdef say_hello_from_cpdef():
     2        print("Hello from the cpdef function")
(gdb) cy cont
Hello from the cpdef function
>>> quit()
[Inferior 1 (process 105) exited normally]

Note that the cy list in the first breakpoint lists C source code, whereas the second cy list shows the Cython source code. Given the purpose of cpdef this may not be too surprising, but it may be confusing to new users.

Managing cy break breakpoints

While cy break lets you create breakpoints, it does not give you any tools to delete, enable, disable, etc… However, you can work around this issue by using gdb's native commands for managing breakpoints, which we detailed on in the previous debugging article. Continuing with our example above, an info break yields the following:

(gdb) info break
Num     Type           Disp Enb Address            What
1       breakpoint     keep y   0x00007f1da010581d in __pyx_f_17debugging_cython2_say_hello_from_cpdef at debugging_cython2.c:1194
     breakpoint already hit 2 times
2       breakpoint     keep y   0x00007f1da01058c6 in __pyx_pw_17debugging_cython2_1say_hello_from_cpdef at debugging_cython2.c:1227
     breakpoint already hit 3 times

If you didn’t want the first breakpoint to be hit from Cython, you delete 1 or disable 1.

(gdb) disable 1
(gdb) cy run
Python 3.10.10+ (heads/3.10:bac3fe7, Feb 22 2023, 05:56:35) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import debugging_cython2
>>> debugging_cython2.say_hello_from_cpdef()

Breakpoint 2, __pyx_pw_17debugging_cython2_1say_hello_from_cpdef (__pyx_self=, unused=0x0) at debugging_cython2.c:1227
1227   PyObject *__pyx_r = 0;
1227      PyObject *__pyx_r = 0;
(gdb) cy cont
Hello from the cpdef function
>>> debugging_cython2.say_hello_from_cpdef()

Breakpoint 2, __pyx_pw_17debugging_cython2_1say_hello_from_cpdef (__pyx_self=, unused=0x0) at debugging_cython2.c:1227
1227   PyObject *__pyx_r = 0;
1227      PyObject *__pyx_r = 0;
(gdb) cy cont
Hello from the cpdef function
>>> quit()
[Inferior 1 (process 105) exited normally]

Closing Thoughts

If you’ve made it this far - congratulations! Debugging as we’ve done in this three part series is not going to be the flashiest thing you do as a developer. However, I can guarantee that working with these tools at such a level will give you a critical foundation with which you can build upon. Whether you are a Python developer looking to go lower level for performance reasons, or you are a C/C++ developer looking to go higher level to work with good abstractions, having these debuggers at your disposal will let you move up and down your computing stack with relative ease. Now go forth and have fun!

Fundamental Python Debugging Part 2 - Python Extensions

2023-02-22T00:00:00+00:00

Python extensions are a key component in making Python libraries fast. With an extension, you have the ability to write code in a lower-level language like C or C++ but still interact with that code via the Python runtime. Many high-performance scientific Python libraries use this type of architecture, whether through hand-writing a C/C++ extension(s) and/or generating them using a Python to C/C++ transpiler like Cython.

This has tradeoffs for a library author. While Python is an interpreted language, extensions are typically written in languages that need to be compiled. Extensions also cannot be debugged with pdb. However, as you’ll see in the following sections, pdb is heavily influenced by a lot of the tooling used for extension debugging, so if you worked through the first article in this debugging series you should have a solid foundation to build off of.

Setting up our environment

A challenge we didn’t face in the previous article was cross-platform tooling. pdb works regardless of your OS and architecture, but as we move further down into the stack we have to use tools more tailored to our environment.

Writing installation and usage instructions for all platforms would be quite the task. To abstract all of the nuances and make following through this guide easier, this guide assumes you will be using the docker image hosted at willayd/cpython-debugging. This docker image contains the following items:

gcc, which we use to build extensions
CPython source code located in /clones/cpython
A development build of Python pre-installed
A custom build of gdb which knows about the development Python installation

Not all of these elements are required, but they all make debugging easier.

To get started with the image, be sure to first install the docker engine, at which point you can then:

docker pull willayd/cpython-debugging

A quick docker image should show that same image on your local machine. Once you have the image installed, you will want to choose a location on your host computer to mount into the container you will run based off of that image, so something like:

docker run --rm -it -w /data -v :/data willayd/cpython-debugging

The -v flag here maps the part of its argument preceding the : and locates it on your host computer. It then mounts that location from your host computer to the path specified after the : within the container, which we’ve chosen above as /data. Note that you can use shell expansion of environment variables like -v ${HOME}/code:/data if you have your work locally in a code subdirectory of your home directory. Even simpler, you could do -v ${PWD}:/data if your shell is already within the directory you want to mount.

Building our first extension

Let’s start with the following code in a file named debugging_demo.c:

#define PY_SSIZE_T_CLEAN
#include 

static PyObject *
say_hello_and_return_none (PyObject *self, PyObject *args)
{
  printf ("Hello from the extension\n");
  Py_RETURN_NONE;
}

static PyMethodDef debugging_demo_methods[] = {
  {"say_hello_and_return_none", say_hello_and_return_none, METH_VARARGS,
   "Says hello and returns none."},
  {NULL, NULL, 0, NULL}  /* Sentinel */
};

static struct PyModuleDef debugging_demo_module = {
  PyModuleDef_HEAD_INIT,
  .m_name = "debugging_demo",
  .m_doc = "A simple extension to showcase debugging",
  .m_size = 0,
  .m_methods = debugging_demo_methods
};

PyMODINIT_FUNC PyInit_debugging_demo(void)
{
  return PyModuleDef_Init(&debugging_demo_module);
}

I’ve saved this locally under ~/code-demos, so I’m going to launch my docker container with docker run --rm -it -w /data -v ${HOME}/code-demos:/data willayd/cpython-debugging. A quick ls should confirm you have mounted everything properly:

willayd@willayd:~$ docker run --rm -it -w /data -v ${HOME}/code-demos:/data willayd/cpython-debugging
root@4a6161a82673:/data# ls
debugging_demo.c
root@4a6161a82673:/data#

We can build our C module into a shared library, after which we will be able to load it into Python.

root@12a481d4fa4c:/data# gcc -g3 -Wall -Werror -std=c17 -shared -fPIC -I/usr/local/include/python3.10d debugging_demo.c -o debugging_demo.so
root@12a481d4fa4c:/data# ls
debugging_demo.c  debugging_demo.so

gcc is our tool for building the code, and all of the flags we provide here are documented in the gcc Command Options.

-g3 instructs gcc to insert debugging information into the target, including macros. Without this, you may not properly be able to debug your application, may be unable to inspect source code, and may see things like optimized out when inspecting variables in gcc.

-Wall turns on a lot of warnings (not all) and pairs well with -Werror. For new C developers, I always suggest using these two. Coming from higher level languages like Python you may be used to ignoring warnings, but in C most warnings you get as a new developer really are critical coding errors.

-shared and -fPIC are both required for building a shared library, and -I/usr/local/include/python3.10d allows gcc to find our Python.h header file. All of these are necessary to make our extension loadable from Python.

-o debugging_demo.so created our shared library with an .so extension, which is common on GNU/Linux platforms. On macOS you may see a similar concept with a .dylib extension, whereas Windows has .dll.

Now that this shared library is available, it can be loaded, inspected and executed from the Python interpreter.

root@12a481d4fa4c:/data# python3
Python 3.10.10+ (heads/3.10:bac3fe7, Feb 22 2023, 05:56:35) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import debugging_demo
>>> debugging_demo.__doc__
'A simple extension to showcase debugging'
>>> dir(debugging_demo)
['__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', 'say_hello_and_return_none']
>>> debugging_demo.say_hello_and_return_none.__doc__
'Says hello and returns none.'
>>> debugging_demo.say_hello_and_return_none()
Hello from the extension

Inspecting things with gdb

If we wanted to look at the intermediate state of things, we can pause execution and move around the stack like we did with pdb in the previous article, but this time we will be using gdb. To get started, simply run gdb python3. Thereafter, help is a good place to start.

(gdb) help
List of classes of commands:

aliases -- User-defined aliases of other commands.
breakpoints -- Making program stop at certain points.
data -- Examining data.
files -- Specifying and examining files.
internals -- Maintenance commands.
obscure -- Obscure features.
running -- Running the program.
stack -- Examining the stack.
status -- Status inquiries.
support -- Support facilities.
text-user-interface -- TUI is the GDB text based interface.
tracepoints -- Tracing of program execution without stopping the program.
user-defined -- User-defined commands.

Type "help" followed by a class name for a list of commands in that class.
Type "help all" for the list of all commands.
Type "help" followed by command name for full documentation.
Type "apropos word" to search for commands related to "word".
Type "apropos -v word" for full documentation of commands related to "word".
Command name abbreviations are allowed if unambiguous.
(gdb)

Compared to pdb, there are way more features within gdb to sift through. apropos or going through help all may be a good place to start. The help menu by default uses a very simple pager; instead you may find it useful to open the help in something like less using a pipe, i.e. pipe help all | less.

The help status subsection introduces us to the info command. info breakpoint always lists your breakpoints, of which we have none right now.

(gdb) info breakpoint
No breakpoints or watchpoints.

help break gives great details on how to set a breakpoint. For now, let’s go ahead and enter break say_hello_and_return_none to enter the debugger when our function starts to execute.

(gdb) break say_hello_and_return_none
Function "say_hello_and_return_none" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (say_hello_and_return_none) pending.

Python has not yet loaded our shared library, so gdb isn’t sure yet that this function exists. It will become available when we start running our program, so you can enter y when prompted above.

At this point go ahead and run to start the Python interpreter that gdb attached to. You can then import the module and execute the function, at which point gdb will come back to the forefront:

(gdb) run
Starting program: /usr/local/bin/python3
warning: Error disabling address space randomization: Operation not permitted
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Python 3.10.10+ (heads/3.10:bac3fe7, Feb 22 2023, 05:56:35) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import debugging_demo
>>> debugging_demo.say_hello_and_return_none()

Breakpoint 1, say_hello_and_return_none (self=0x7f4de7360230, args=0x7f4de7674250) at debugging_demo.c:7
7      printf ("Hello from the extension\n");

Similar to pdb we have a backtrace command (or bt shortcut) to inspect the call stack. Unlike pdb, this shows the call sequence tracing from the bottom up rather than the top down.

(gdb) run
#0  say_hello_and_return_none (self=0x7f4de7360230, args=0x7f4de7674250) at debugging_demo.c:7
#1  0x0000558c96fed0cb in cfunction_call (func=0x7f4de73603b0, args=, kwargs=)
    at Objects/methodobject.c:552
#2  0x0000558c96e0a1c3 in _PyObject_MakeTpCall (tstate=tstate@entry=0x558c97f030b0,
    callable=callable@entry=0x7f4de73603b0, args=args@entry=0x7f4de76ea7c0, nargs=,
    keywords=keywords@entry=0x0) at Objects/call.c:215
#3  0x0000558c96ec1baa in _PyObject_VectorcallTstate (tstate=0x558c97f030b0, callable=0x7f4de73603b0, args=0x7f4de76ea7c0,
    nargsf=, kwnames=0x0) at ./Include/cpython/abstract.h:112
#4  0x0000558c96ec6185 in PyObject_Vectorcall (kwnames=0x0, nargsf=9223372036854775808, args=0x7f4de76ea7c0,
    callable=0x7f4de73603b0) at ./Include/cpython/abstract.h:123
#5  call_function (tstate=tstate@entry=0x558c97f030b0, trace_info=trace_info@entry=0x7fffe84db900,
    pp_stack=pp_stack@entry=0x7fffe84db8d0, oparg=oparg@entry=0, kwnames=kwnames@entry=0x0) at Python/ceval.c:5893
#6  0x0000558c96ed355e in _PyEval_EvalFrameDefault (tstate=0x558c97f030b0, f=0x7f4de76ea650, throwflag=)
    at Python/ceval.c:4181

Each frame is numbered on the left hand side from 0 (most-recent frame). You can use up and down to navigate the call stack, or you can use the frame command / f shortcut to jump to any particular frame.

Let us go ahead and jump to frame number 2, which is in the cpython source code at Objects/call.c on line 215. We can then use the list / l commands that pdb also has to look at that code.

(gdb) f 2
#2  0x0000558c96e0a1c3 in _PyObject_MakeTpCall (tstate=tstate@entry=0x558c97f030b0,
    callable=callable@entry=0x7f4de73603b0, args=args@entry=0x7f4de76ea7c0, nargs=,
    keywords=keywords@entry=0x0) at Objects/call.c:215
215          result = call(callable, argstuple, kwdict);
(gdb) l
210      }
211
212      PyObject *result = NULL;
213      if (_Py_EnterRecursiveCall(tstate, " while calling a Python object") == 0)
214      {
215          result = call(callable, argstuple, kwdict);
216          _Py_LeaveRecursiveCall(tstate);
217      }
218
219      Py_DECREF(argstuple);

Let’s do f 0 to get back to the most current frame. There you can use next / n to advance to the next line, and then continue / c to let the program continue.

(gdb) f 0
#0  say_hello_and_return_none (self=0x7f4de7360230, args=0x7f4de7674250) at debugging_demo.c:7
7      printf ("Hello from the extension\n");
(gdb) n
Hello from the extension
8      Py_RETURN_NONE;
(gdb) c
Continuing.
>>>

At the very end we get back to our Python interpret. You can quit() out of this to get back to gdb, and then exit gdb to get back to the shell.

>>> quit()
[Inferior 1 (process 57) exited normally]
(gdb) exit
root@ba83cd50f6ec:/data#

Debugging Segmentation Faults

Let’s introduce an off-by-one programming error into our source code. We can create a new debugging_demo2.c file with similar but updated content:

#define PY_SSIZE_T_CLEAN
#include 

#define NUM_WORDS 4

static PyObject *
say_hello_and_return_none (PyObject *self, PyObject *args)
{
  const char* words[NUM_WORDS] = {
    "Hello",
    "from",
    "the",
    "extension"
  };

  for (int i = 0; i <= NUM_WORDS; i++) {
    printf ("%s ", words[i]);
  }

  printf("\n");
  Py_RETURN_NONE;
}

static PyMethodDef debugging_demo2_methods[] = {
  {"say_hello_and_return_none", say_hello_and_return_none, METH_VARARGS,
   "Says hello and returns none."},
  {NULL, NULL, 0, NULL}  /* Sentinel */
};

static struct PyModuleDef debugging_demo2_module = {
  PyModuleDef_HEAD_INIT,
  .m_name = "debugging_demo2",
  .m_doc = "A simple extension to showcase debugging",
  .m_size = 0,
  .m_methods = debugging_demo2_methods
};

PyMODINIT_FUNC PyInit_debugging_demo2(void)
{
  return PyModuleDef_Init(&debugging_demo2_module);
}

Compile with gcc -g3 -Wall -Werror -std=c17 -shared -fPIC -I/usr/local/include/python3.10d debugging_demo2.c -o debugging_demo2.so. A quick python3 -c "import debugging_demo2; debugging_demo2.say_hello_and_return_none()" this time will likely give you a segmentation fault, with no real error message.

root@ba83cd50f6ec:/data# python3 -c "import debugging_demo2; debugging_demo2.say_hello_and_return_none()"
Segmentation fault (core dumped)

Fortunately, gdb will automatically stop execution on a segfault and give you the ability to inspect your program. Let’s start this program using the --args argument to gdb, which will allow us to forward arguments like -c "..." to the program gdb attaches to (here python3):

root@ba83cd50f6ec:/data# gdb --args python3 -c "import debugging_demo2; debugging_demo2.say_hello_and_return_none()"
GNU gdb (GDB) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
...
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python3...
(gdb)

Enter run and things will pause when the segmentation fault occurs:

(gdb) run
Starting program: /usr/local/bin/python3 -c import\ debugging_demo2\;\ debugging_demo2.say_hello_and_return_none\(\)
warning: Error disabling address space randomization: Operation not permitted
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
0x00007ffb409dc97d in ?? () from /lib/x86_64-linux-gnu/libc.so.6
(gdb)

f we inspect the backtrace, we will see that the first three frames are from /lib/x86_64-linux-gnu/libc.so, which is the part of the standard library on GNU/Linux.

(gdb) bt
#0  0x00007ffb409dc97d in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007ffb408b5db1 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007ffb4089f81f in printf () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x00007ffb405b5245 in say_hello_and_return_none (self=0x7ffb4065dc10, args=0x7ffb406e4250) at debugging_demo.c:15
#4  0x000055bed6af30cb in cfunction_call (func=0x7ffb406a2b10, args=, kwargs=)
    at Objects/methodobject.c:552

In contrast to the last 2 frames, there is also barely any function information. This is because these libraries are heavily optimized without any debugging symbols (remember the -g3 flag we using during compilation) so gdb can’t do much besides tell us the memory location of the calls. If you ever try to debug a program and can’t see the symbols you are looking for, keep this in mind.

In any case, we are going to assume there is no bug in the standard library and jump back to f 3 to inspect our code. There a quick info locals will tell us about the local variables.

(gdb) f 3
#3  0x00007ffb405b5245 in say_hello_and_return_none (self=0x7ffb4065dc10, args=0x7ffb406e4250) at debugging_demo.c:15
15       printf ("%s ", words[i]);
(gdb) info locals
i = 4
words = {0x7ffb405b6000 "Hello", 0x7ffb405b6006 "from", 0x7ffb405b600b "the", 0x7ffb405b600f "extension"}

Since C is a 0-indexed language, the expression words[i] tries to access memory that is out of bounds, which is the root cause of our segmentation fault:

(gdb) p words[3]
$1 = 0x7ffb405b600f "extension"
(gdb) p words[i]
$2 = 0x2e 

A quick l lists the code surrounding this function.

(gdb) l
     "the",
     "extension"
   };
15
   for (int i = 0; i <= NUM_WORDS; i++) {
     printf ("%s ", words[i]);
   }
19
   printf("\n");
   Py_RETURN_NONE;

The error is on line 14 and to have this program execute properly we would need to change for (int i = 0; i <= NUM_WORDS; i++) to for (int i = 0; i < NUM_WORDS; i++), keeping our array access in bounds.

As an aside, if we had turned on optimization when compiling this via the -O2 flag, gcc would have warned and then errored (as long as you use -Werror) up front. But that would have made debugging less fun.

root@ba83cd50f6ec:/data# gcc -g3 -O2 -Wall -Werror -std=c17 -shared -fPIC -I/usr/local/include/python3.10d debugging_demo2.c -o debugging_demo2.so
debugging_demo2.c: In function 'say_hello_and_return_none':
debugging_demo2.c:17:5: error: iteration 4 invokes undefined behavior [-Werror=aggressive-loop-optimizations]
   17 |     printf ("%s ", words[i]);
      |     ^~~~~~~~~~~~~~~~~~~~~~~~
debugging_demo2.c:16:21: note: within this loop
   16 |   for (int i = 0; i <= NUM_WORDS; i++) {
      |                     ^
cc1: all warnings being treated as errors

Debugging Python->C data exchange

CPython distributes a gdb python extension that bridges the gap between what you as a Python developer see at runtime versus what gdb knows about the objects it sees at a lower level. This extension is housed in the CPython source code, which we also have hanging around at /clones in our Docker image.

Let’s continue expanding on our previous extension, this time naming it debugging_demo3.c. Rather than being self contained, the new extension will print whatever name you pass to it through the Python interpreter. Our initial structure looks like this:

#define PY_SSIZE_T_CLEAN
#include 

#define NUM_WORDS 4

static PyObject *
say_hello_and_return_none (PyObject *self, PyObject *args)
{
  PyObject *name;
  if (!PyArg_ParseTuple(args, "O", &name)) {
    return NULL;
  }

  const char *str = PyUnicode_AsUTF8(name);
  printf("Hello, %s\n", str);
  Py_RETURN_NONE;
}

static PyMethodDef debugging_demo3_methods[] = {
  {"say_hello_and_return_none", say_hello_and_return_none, METH_VARARGS,
   "Says hello and returns none."},
  {NULL, NULL, 0, NULL}  /* Sentinel */
};

static struct PyModuleDef debugging_demo3_module = {
  PyModuleDef_HEAD_INIT,
  .m_name = "debugging_demo3",
  .m_doc = "A simple extension to showcase debugging",
  .m_size = 0,
  .m_methods = debugging_demo3_methods
};

PyMODINIT_FUNC PyInit_debugging_demo3(void)
{
  return PyModuleDef_Init(&debugging_demo3_module);
}

We need to build this extension just as we have done before, this time using gcc -g3 -Wall -Werror -std=c17 -shared -fPIC -I/usr/local/include/python3.10d debugging_demo3.c -o debugging_demo3.so.

If you look closely at the source code above we have introduced PyArg_ParseTuple, which handles unpacking function arguments into local variables. Our function takes 1 and only 1 argument in its current form; attempting to call it with anything else will set the global Python error indicator, hit the return NULL; statement, and propagate the error back to the Python runtime. That’s a lot of power packed into a few lines of code.

root@ba83cd50f6ec:/data# python3
Python 3.10.10+ (heads/3.10:bac3fe7, Feb 22 2023, 05:56:35) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import debugging_demo3
>>> debugging_demo3.say_hello_and_return_none("Will")
Hello, Will
>>> debugging_demo3.say_hello_and_return_none()
Traceback (most recent call last):
  File "", line 1, in 
TypeError: function takes exactly 1 argument (0 given)
>>> debugging_demo3.say_hello_and_return_none("Will", "Ayd")
Traceback (most recent call last):
  File "", line 1, in 
TypeError: function takes exactly 1 argument (2 given)

Things work great until you try passing through something that is not a unicode object.

>>> debugging_demo3.say_hello_and_return_none(555)
Hello, (null)
Fatal Python error: _Py_CheckFunctionResult: a function returned a result with an exception set
Python runtime state: initialized
TypeError: bad argument type for built-in operation

The above exception was the direct cause of the following exception:

SystemError: function say_hello_and_return_none> returned a result with an exception set

Current thread 0x00007f5dd4cbb740 (most recent call first):
  File "", line 1 in 

Extension modules: debugging_demo3 (total: 1)
Aborted (core dumped)

This time the program aborted instead of having a segmentation fault. That said, gdb will still allow you to jump in and inspect the state of things prior to termination.

root@ba83cd50f6ec:/data# gdb --args python3 -c "import debugging_demo3; debugging_demo3.say_hello_and_return_none(555)"
Reading symbols from python3...
(gdb) run
Starting program: /usr/local/bin/python3 -c import\ debugging_demo3\;\ debugging_demo3.say_hello_and_return_none\(555\)
warning: Error disabling address space randomization: Operation not permitted
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Hello, (null)
Fatal Python error: _Py_CheckFunctionResult: a function returned a result with an exception set
Python runtime state: initialized
TypeError: bad argument type for built-in operation

The above exception was the direct cause of the following exception:

SystemError: function say_hello_and_return_none> returned a result with an exception set

Current thread 0x00007f9b27e91740 (most recent call first):
  File "", line 1 in 

Extension modules: debugging_demo3 (total: 1)

Program received signal SIGABRT, Aborted.
0x00007f9b27f2aa7c in pthread_kill () from /lib/x86_64-linux-gnu/libc.so.6

When you look at the backtrace here, you won’t see any of our user code:

(gdb) bt
#0  0x00007f9b27f2aa7c in pthread_kill () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f9b27ed6476 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007f9b27ebc7f3 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x0000555c45a8505b in fatal_error_exit (status=) at Python/pylifecycle.c:2553
#4  0x0000555c45a895c7 in fatal_error (fd=2, header=header@entry=1,
    prefix=prefix@entry=0x555c45c08a60 <__func__.18> "_Py_CheckFunctionResult",
    msg=msg@entry=0x555c45c08528 "a function returned a result with an exception set", status=status@entry=-1)
    at Python/pylifecycle.c:2734
#5  0x0000555c45a89630 in _Py_FatalErrorFunc (func=func@entry=0x555c45c08a60 <__func__.18> "_Py_CheckFunctionResult",

This is a bit unfortunate because we can’t directly trace back to our function. With that said, the message a function returned a result with an exception set clues us in on where we need to look. CPython manages one global error indicator queryable via PyErr_Occurred().

To do this, let’s set a break say_hello_and_return_none to pause execution when we enter our function. Then run to get to that point and add a watch PyErr_Occurred() to the mix.

(gdb) break say_hello_and_return_none
Breakpoint 1 at 0x7f0a8fbf5200: file debugging_demo3.c, line 8.
(gdb) run
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /usr/local/bin/python3 -c import\ debugging_demo3\;\ debugging_demo3.say_hello_and_return_none\(555\)
warning: Error disabling address space randomization: Operation not permitted
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Breakpoint 1, say_hello_and_return_none (self=0x7f58e60305f0, args=0x7f58e5fe98b0) at debugging_demo3.c:8
8    {
(gdb) watch PyErr_Occurred()
Watchpoint 2: PyErr_Occurred()

At this point info break should show us the two conditions under which gdb will pause, either on say_hello_and_return_none entry or when the PyErr_Occurred() value changes.

(gdb) info break
Num     Type           Disp Enb Address            What
1       breakpoint     keep y   0x00007f58e5f87200 in say_hello_and_return_none at debugging_demo3.c:8
     breakpoint already hit 1 time
2       watchpoint     keep y                      PyErr_Occurred()

Type c to continue along and you will see that the watchpoint gets hit:

(gdb) c
Continuing.

Watchpoint 2: PyErr_Occurred()

Old value = (PyObject *) 0x0
New value = (PyObject *) 0x55ccfb73cc20 <_PyExc_TypeError>
_PyErr_Restore (tstate=tstate@entry=0x55ccfcb82930, type=type@entry=0x55ccfb73cc20 <_PyExc_TypeError>, value=value@entry=0x7f58e6057640, traceback=0x0) at Python/errors.c:60
60       tstate->curexc_value = value;

The watchpoint wasn’t hit within our code, but internal to CPython. No matter - we can inspect the backtrace and see what point in our code base this might happen at.

(gdb) bt
#0  _PyErr_Restore (tstate=tstate@entry=0x55ccfcb82930, type=type@entry=0x55ccfb73cc20 <_PyExc_TypeError>,
    value=value@entry=0x7f58e6057640, traceback=0x0) at Python/errors.c:60
#1  0x000055ccfb455d60 in _PyErr_SetObject (tstate=tstate@entry=0x55ccfcb82930,
    exception=exception@entry=0x55ccfb73cc20 <_PyExc_TypeError>, value=value@entry=0x7f58e6057640) at Python/errors.c:189
#2  0x000055ccfb455f59 in _PyErr_SetString (tstate=0x55ccfcb82930, exception=0x55ccfb73cc20 <_PyExc_TypeError>,
    string=string@entry=0x55ccfb645698 "bad argument type for built-in operation") at Python/errors.c:235
#3  0x000055ccfb455fdd in PyErr_BadArgument () at Python/errors.c:667
#4  0x000055ccfb402060 in PyUnicode_AsUTF8AndSize (unicode=, psize=psize@entry=0x0)
    at Objects/unicodeobject.c:4245
#5  0x000055ccfb402195 in PyUnicode_AsUTF8 (unicode=) at Objects/unicodeobject.c:4265
#6  0x00007f58e5f87245 in say_hello_and_return_none (self=0x7f58e60305f0, args=0x7f58e5fe98b0) at debugging_demo3.c:14

Frame 6 is say_hello_and_return_none, specifically on line 14. You can jump back to that and see the line being called.

(gdb) f 6
#6  0x00007f58e5f87245 in say_hello_and_return_none (self=0x7f58e60305f0, args=0x7f58e5fe98b0) at debugging_demo3.c:14
14     const char *str = PyUnicode_AsUTF8(name);

We know from our function invocation that we are passed the value 555 as an argument to the function call. However, you wouldn’t know this by trying to print the object in gdb.

(gdb) p name
$1 = (PyObject *) 0x7f58e6013bc0
(gdb) p *name
$2 = {ob_refcnt = 4, ob_type = 0x55ccfb73f180 }

We get some information when dereferencing this object about the basic PyObject struct members. But we’d have to muck around a bit more to see the members that are relevant to longs, or whatever object type it is we inspect.

This is where the gdb extension becomes a really powerful abstraction tool. First, we need to load the extension into gdb. This can be done at runtime with the source command pointing to the extension file. In our docker image, this would mean

(gdb) source /clones/cpython/Tools/gdb/libpython.py

Once you have loaded the extension, the default printing mechanism becomes a lot more familiar to Python users.

(gdb) p name
$3 = 555

This confirms that the object we have on this line is the same we provided to the function call, so nothing way out of the ordinary is going on. Since the global PyErr_Occurred() indicator was set, we can use PyErr_Print() to get information from the Python runtime about what went wrong. Note that calling this clears the error indicator.

(gdb) call PyErr_PrintEx(0)
TypeError
(gdb) p PyErr_Occurred()
$4 = 0x0

We called PyUnicode_AsUTF8 with a PyLong object even though it expected PyUnicode. In the Python runtime this would automatically trigger an exception and stop things immediately. C doesn’t have built-in error handling, so things continue unless you explicitly handle the issue.

Following the pattern of CPython Exception Handling, we are going to slightly modify our source code to look like this:

#define PY_SSIZE_T_CLEAN
#include 

#define NUM_WORDS 4

static PyObject *
say_hello_and_return_none (PyObject *self, PyObject *args)
{
  PyObject *name;
  if (!PyArg_ParseTuple(args, "O", &name)) {
    return NULL;
  }

  const char *str = PyUnicode_AsUTF8(name);
  if (str == NULL) {
    return NULL;
  }

  printf("Hello, %s\n", str);
  Py_RETURN_NONE;
}

static PyMethodDef debugging_demo3_methods[] = {
  {"say_hello_and_return_none", say_hello_and_return_none, METH_VARARGS,
   "Says hello and returns none."},
  {NULL, NULL, 0, NULL}  /* Sentinel */
};

static struct PyModuleDef debugging_demo3_module = {
  PyModuleDef_HEAD_INIT,
  .m_name = "debugging_demo3",
  .m_doc = "A simple extension to showcase debugging",
  .m_size = 0,
  .m_methods = debugging_demo3_methods
};

PyMODINIT_FUNC PyInit_debugging_demo3(void)
{
  return PyModuleDef_Init(&debugging_demo3_module);
}

The if (str == NULL) is our way of handling a failed PyUnicode_AsUTF8 call. By propagating that NULL value up the call stack, CPython will gracefully handle the error for us when we get back to the Python runtime. To confirm, recompile and trying passing the same argument to the function.

(gdb) exit
A debugging session is active.

     Inferior 1 [process 515] will be killed.

Quit anyway? (y or n) y
root@ba83cd50f6ec:/data# gcc -g3 -Wall -Werror -std=c17 -shared -fPIC -I/usr/local/include/python3.10d debugging_demo3.c -o debugging_demo3.so
root@ba83cd50f6ec:/data# python3 -c "import debugging_demo3; debugging_demo3.say_hello_and_return_none(555)"
Traceback (most recent call last):
  File "", line 1, in 
TypeError: bad argument type for built-in operation

We still have an error, but the error is the built-in TypeError that we can handle in our Python code if we wanted, instead of the SIGABRT that shut down the application previously.

While not in scope for this article, there are many ways you can improve the above function. You could either change the format string provided to PyArg_ParseTuple to map to something else besides a PyObject*, or alternately mix in a call to PyObject_Str to coerce any object to a unicode object prior to the PyUnicode_AsUTF8 call.

Closing Thoughts

Understanding how C and Python interacted was something I struggled with for years. Once unlocked, I found knowledge of how to interact at the lower levels using gdb to be invaluable. I can only hope that this article lays a good foundation for you to build upon.

The only other advice I can offer is to be patient! I’ve been at this for years and still find myself learning something new every day. Therein lies the true art of programming.

My next article will focus on using the Cython debugger, which is implemented as a gdb extension. The knowledge in this article is a hugely important stepping stone towards that. If you can understand how to control and debug all of these components, you are in a very good spot when it comes to Python development.

Fundamental Python Debugging Part 1 - Python

2023-02-08T00:00:00+00:00

The topic of debugging Python is well-covered. Regardless of whether you want to use your IDE interactively or work from a console with pdb, chances are this is not the first article you have read on the topic.

In spite of the wealth of content, I’ve found that most articles on debugging Python are singularly focused on debugging Python. That may not seem like such a bad thing at face value, but developing Python at an advanced level requires not only knowledge of the language itself, but also of lower level languages like C/C++. Being an expert in all of these languages at one time is near impossible, so knowing how to debug them effectively is critical.

Luckily, when viewed through the proper lens, there is a lot of overlap in the debugging tooling for these languages. The built-in Python pdb debugger borrows much of its utility from gdb, which will help you debug C/C++/Rust/Fortran, etc… gdb itself is extendable using Python, and this extensibility is the reason why things like the Cython debugger exist.

Few if any other articles on debugging Python applications touch on these synchronicities. This and my next few blog posts attempt to highlight this for you and help you seamlessly transition across the aforementioned tools.

Setting up your example

Let’s start with a buggy script. This code isn’t pythonic and you may be able to troubleshoot without even using a debugger, but that isn’t important for this exercise. Go ahead and save the below snippet as buggy_program.py:

def buggy_loop():
    animals = ["dog", "cat", "turtle"]
    index = 0

    while index <= len(animals):
        print(f"The animal at index {index} is {animals[index]}")
        index += 1

if __name__ == "__main__":
    buggy_loop()

Executing this program with python buggy_program.py should yield:

The animal at index 0 is dog
The animal at index 1 is cat
The animal at index 2 is turtle
Traceback (most recent call last):
  File "buggy_program.py", line 10, in 
    buggy_loop()
  File "buggy_program.py", line 6, in buggy_loop
    print(f"The animal at index {index} is {animals[index]}")
IndexError: list index out of range

Part 1: Debugging exceptions

Changing our command from python buggy_script.py to python -m pdb buggy_script.py will launch pdb and load the script. pdb will not immediately execute anything, but instead wait for your input. We assume we don’t know any commands yet, so typing help is the best thing for us to start with.

> /home/willayd/buggy_program.py(1)()
-> def buggy_loop():
(Pdb) help

Documented commands (type help ):
========================================
EOF    c          d        h         list      q        rv       undisplay
a      cl         debug    help      ll        quit     s        unt
alias  clear      disable  ignore    longlist  r        source   until
args   commands   display  interact  n         restart  step     up
b      condition  down     j         next      return   tbreak   w
break  cont       enable   jump      p         retval   u        whatis
bt     continue   exit     l         pp        run      unalias  where

Miscellaneous help topics:
==========================
exec  pdb

help allows you to navigate any of the items listed above. We can even input help help as a meta-command.

(Pdb) help help
h(elp)
        Without argument, print the list of available commands.
        With a command name as argument, print help about that command.
        "help pdb" shows the full pdb documentation.
        "help exec" gives help on the ! command.

The help we have input so far is a pdb command and not the built-in help function that Python provides. If you wanted to execute the latter, you should prefix your input with !:

(Pdb) !help()

Welcome to Python 3.8's help utility!

If this is your first time using Python, you should definitely check out
the tutorial on the Internet at https://docs.python.org/3.8/tutorial/.

Enter the name of any module, keyword, or topic to get help on writing
Python programs and using Python modules.  To quit this help utility and
return to the interpreter, just type "quit".

To get a list of available modules, keywords, symbols, or topics, type
"modules", "keywords", "symbols", or "topics".  Each module also comes
with a one-line summary of what it does; to list the modules whose name
or summary contain a given string such as "spam", type "modules spam".

If you executed the above !help() command be sure to input q and hit enter to quit the Python interactive help.

To actually get code executing we want to continue. help continue shows us more about this command.

(Pdb) help c
c(ont(inue))
        Continue execution, only stop when a breakpoint is encountered.

So c, cont, and continue would all do the same things for us. For now input c and press enter:

(Pdb) c
The animal at index 0 is dog
The animal at index 1 is cat
The animal at index 2 is turtle
Traceback (most recent call last):
  File "/usr/lib/python3.10/pdb.py", line 1726, in main
    pdb._runscript(mainpyfile)
  File "/usr/lib/python3.10/pdb.py", line 1586, in _runscript
    self.run(statement)
  File "/usr/lib/python3.10/bdb.py", line 597, in run
    exec(cmd, globals, locals)
  File "", line 1, in 
  File "/home/willayd/buggy_program.py", line 10, in 
    buggy_loop()
  File "/home/willayd/buggy_program.py", line 6, in buggy_loop
    print(f"The animal at index {index} is {animals[index]}")
IndexError: list index out of range
Uncaught exception. Entering post mortem debugging
Running 'cont' or 'step' will restart the program
> /home/willayd/buggy_program.py(6)buggy_loop()
-> print(f"The animal at index {index} is {animals[index]}")
(pdb)

The program has executed and printed the same traceback we saw without using pdb. However, since we are running our script under pdb execution halts after an error occurs and allows us to inspect the state of the program.

l (short for list) shows us where the execution halted (see -> below) and a few lines around that.

(Pdb) l
  1          def buggy_loop():
  2              animals = ["dog", "cat", "turtle"]
  3              index = 0
  4
  5              while index <= len(animals):
  6  ->              print(f"The animal at index {index} is {animals[index]}")
  7                  index += 1
  8
  9          if __name__ == "__main__":
 10              buggy_loop()
[EOF]

Typing l again interestingly does not give us the same result:

(Pdb) l
[EOF]

list automatically iterates through the code every time the command is entered, and because our script is small we just reach the end-of-file. To continually display where execution halted you can enter l .

(Pdb) l .
  1          def buggy_loop():
  2              animals = ["dog", "cat", "turtle"]
  3              index = 0
  4
  5              while index <= len(animals):
  6  ->              print(f"The animal at index {index} is {animals[index]}")
  7                  index += 1
  8
  9          if __name__ == "__main__":
 10              buggy_loop()
[EOF]

Another nice feature of pdb is that you can enter expressions see the result printed back. For instance, we know we have a variable named index in the function we are debugging, so entering that into pdb will print the value of index.

(Pdb) index
3

If you are debugging a longer function with a lot of variables, you may also be interested in the dir() or locals() functions. The former shows the names of all variables in the current scope; the latter gives you the names and values.

(Pdb) dir()
['animals', 'index']
(Pdb) locals()
{'animals': ['dog', 'cat', 'turtle'], 'index': 3}

Let us step back now and talk about the problem we are trying to solve. The traceback tells us we have an IndexError: list index out of range on line 6, and the debugger paused us at that same line. Upon inspecting the index variable in the debugger we note it has a value of 3.

Line 6 attempts to do animals[index], which fails because Python is a 0-based index language. One fix is for us to change line 5 from

while index <= len(animals):

while index < len(animals):

If you make that change to the source code you can enter restart into pdb to start over with the updated script logic. From there input c and you will note the script executes without issue.

(Pdb) restart
Restarting /home/willayd/buggy_program.py with arguments:

> /home/willayd/buggy_program.py(1)()
-> def buggy_loop():
(Pdb) c
Post mortem debugger finished. The /home/willayd/buggy_program.py will be restarted
> /home/willayd/buggy_program.py(1)()
-> def buggy_loop():
(Pdb) c
The animal at index 0 is dog
The animal at index 1 is cat
The animal at index 2 is turtle
The program finished and will be restarted
> /home/willayd/buggy_program.py(1)()
-> def buggy_loop():

Since things are good to go now, you can type quit() into the debugger to close things out.

Part 2: Debugging logical errors

Getting an exception in Python is a clear indicator that things are wrong, but not every bug shows up as an error. The code below is inspired by pandas bug #49861. The code as originally written used a recursive function call that was roughly equivalent to:

def normalize_json(
    data,
    key_string,
    normalized_dict,
    separator
):
    if isinstance(data, dict):
        for key, value in data.items():
            new_key = f"{key_string}{separator}{key}"
            normalize_json(
                data=value,
                # to avoid adding the separator to the start of every key
                key_string=new_key
                if new_key[len(separator) - 1] != separator
                else new_key[len(separator) :],
                normalized_dict=normalized_dict,
                separator=separator,
            )
    else:
        normalized_dict[key_string] = data
    return normalized_dict

This function aims to take the keys of deeply nested dictionaries and combine them into one key with a separator. Note below how hierarchies like a -> b -> c get folded into one a.b.c key.

>>> normalize_json({"a": {"b": [1, 2, 3]}}, "", {}, ".")
{'a.b': [1, 2, 3]}

>>> normalize_json({"a": {"b": {"c": [1, 2, 3]}}}, "", {}, ".")
{'a.b.c': [1, 2, 3]}

The OP of the pandas issue noticed that the function would incorrectly remove the start of the string at the top of the dictionary hierarchy if that key began with the separator argument. For instance, if you had a key at the top of the dictionary that began with an underscore and you used an underscore separator, the very first key would get mangled. This is visible below as the normalized key is shown as a_b when it should be _a_b.

>>> normalize_json({"_a": {"b": [1, 2, 3]}}, "", {}, "_")
{'a_b': [1, 2, 3]}

To diagnose, go ahead and save the following code as buggy_script2.py:

def normalize_json(
    data,
    key_string,
    normalized_dict,
    separator
):
    if isinstance(data, dict):
        for key, value in data.items():
            new_key = f"{key_string}{separator}{key}"
            normalize_json(
                data=value,
                # to avoid adding the separator to the start of every key
                key_string=new_key
                if new_key[len(separator) - 1] != separator
                else new_key[len(separator) :],
                normalized_dict=normalized_dict,
                separator=separator,
            )
    else:
        normalized_dict[key_string] = data
    return normalized_dict


if __name__ == "__main__":
    print(normalize_json({"_a": {"b": [1, 2, 3]}}, "", {}, "_"))

We can start the debugger and load the script using python -m pdb buggy_script2.py. However, since there is no bug this time the code will not stop unless we explicitly set a breakpoint. help break gives you some ideas on how to do this; for now start with break normalize_json

> /home/willayd/buggy_script2.py(1)()
-> def normalize_json(
(Pdb) break normalize_json
Breakpoint 1 at /home/willayd/buggy_script2.py:1
(Pdb) break
Num Type         Disp Enb   Where
1   breakpoint   keep yes   at /home/willayd/buggy_script2.py:1

Continue along by hitting c then l to list where execution paused, and you will see it is the first line of the normalize_json function.

(Pdb) c
> /home/willayd/buggy_script2.py(7)normalize_json()
-> if isinstance(data, dict):
(Pdb) l
            data,
            key_string,
            normalized_dict,
            separator
        ):
->          if isinstance(data, dict):
                for key, value in data.items():
                    new_key = f"{key_string}{separator}{key}"
                    normalize_json(
                        data=value,
                        # to avoid adding the separator to the start of every key

Another command worth introducing here is backtrace, or bt for short. Python functions operate as a call stack, so backtrace tells you the sequence of calls that lead up to the breakpoint.

(Pdb) bt
  /usr/lib/python3.10/bdb.py(597)run()
-> exec(cmd, globals, locals)
  (1)()
  /home/willayd/buggy_script2.py(25)()
-> normalize_json({"_a": {"b": [1, 2, 3]}}, "", {}, "_")
> /home/willayd/buggy_script2.py(7)normalize_json()
-> if isinstance(data, dict):

Within pdb the most recent call appears on the bottom (other debuggers may reverse this), so reading from the bottom up we are at normalize_json line 7 which was called by our buggy_script2.py script on line 25. The calls preceding that are internal to Python. Hit c again and another bt to see what happens next:

(Pdb) c
> /home/willayd/buggy_script2.py(7)normalize_json()
-> if isinstance(data, dict):
(Pdb) bt
  /usr/lib/python3.10/bdb.py(597)run()
-> exec(cmd, globals, locals)
  (1)()
  /home/willayd/buggy_script2.py(25)()
-> normalize_json({"_a": {"b": [1, 2, 3]}}, "", {}, "_")
  /home/willayd/buggy_script2.py(10)normalize_json()
-> normalize_json(
> /home/willayd/buggy_script2.py(7)normalize_json()
-> if isinstance(data, dict):

We are in a recursive function call, so we see that normalize_json is at the bottom of our backtrace twice. This pattern would continue every time we continue script execution.

pdb let’s you move up and down the stack trace. We know we are 2 normalize_json calls deep. The up and down commands not surprisingly move up and down the call stack trace, giving you the power to inspect each frame.

(Pdb) locals()
{'data': {'b': [1, 2, 3]}, 'key_string': '_a', 'normalized_dict': {}, 'separator': '_'}
(Pdb) up
> /home/willayd/buggy_script2.py(10)normalize_json()
-> normalize_json(
(Pdb) locals()
{'data': {'_a': {'b': [1, 2, 3]}}, 'key_string': '', 'normalized_dict': {}, 'separator': '_', 'key': '_a', 'value': {'b': [1, 2, 3]}, 'new_key': '__a'}
(Pdb) down
> /home/willayd/buggy_script2.py(7)normalize_json()
-> if isinstance(data, dict):
(Pdb) locals()
{'data': {'b': [1, 2, 3]}, 'key_string': '_a', 'normalized_dict': {}, 'separator': '_'}

The first time we called locals() we were at the most recent normalize_json call. The up command moved us back one frame; down takes us back to the current frame.

Since our input data isn’t too deeply nested, we could keep continuing and moving up and down the stack to try and find where the issue appears, but this could be impractical with many layers of recursion. Fortunately we can be more intelligent with where and when we choose to pause code execution.

To do that let’s restart our code execution and clear our existing breakpoint(s).

(Pdb) restart
Restarting /home/willayd/buggy_script2.py with arguments:

> /home/willayd/buggy_script2.py(1)()
-> def normalize_json(
(Pdb) clear
Clear all breaks? y
Deleted breakpoint 1 at /home/willayd/buggy_script2.py:1

If you inspected the help break output earlier, you might have noticed that break takes an optional condition argument. This is an expression that must evaluate to True for the breakpoint to pause execution.

We know from our bug report and from inspecting some of the locals() outputs earlier that the bug likely happens when a variable named key_string has the value of a_b, so we can pause execution only when that condition is met.

(Pdb) break normalize_json, key_string == "a_b"
Breakpoint 2 at /home/willayd/buggy_script2.py:1
(Pdb) c
> /home/willayd/buggy_script2.py(7)normalize_json()
-> if isinstance(data, dict):
(Pdb) bt
  /usr/lib/python3.10/bdb.py(597)run()
-> exec(cmd, globals, locals)
  (1)()
  /home/willayd/buggy_script2.py(25)()
-> print(normalize_json({"_a": {"b": [1, 2, 3]}}, "", {}, "_"))
  /home/willayd/buggy_script2.py(10)normalize_json()
-> normalize_json(
  /home/willayd/buggy_script2.py(10)normalize_json()
-> normalize_json(
> /home/willayd/buggy_script2.py(7)normalize_json()
-> if isinstance(data, dict):

The above backtrace shows us pausing code execution within the third call of normalize_json. Even though our breakpoint was on the normalize_json function, the expression key_string == "a_b" did not evaluate to true for the first two function calls.

Where our execution paused key_string is not modified locally, but rather received as an argument. This means the bug may surface up one call in the backtrace, so move up and inspect the code:

(Pdb) u
> /home/willayd/buggy_script2.py(10)normalize_json()
-> normalize_json(
(Pdb) ll
B        def normalize_json(
            data,
            key_string,
            normalized_dict,
            separator
        ):
            if isinstance(data, dict):
                for key, value in data.items():
                    new_key = f"{key_string}{separator}{key}"
->                  normalize_json(
                        data=value,
                        # to avoid adding the separator to the start of every key
                        key_string=new_key
                        if new_key[len(separator) - 1] != separator
                        else new_key[len(separator) :],
                        normalized_dict=normalized_dict,
                        separator=separator,
                    )
            else:
                normalized_dict[key_string] = data
            return normalized_dict
(Pdb) new_key
'_a_b'

Our code execution paused on line 10. On line 9 new_key was assigned a value of _a_b, which is what we want to see in the end result.

Look closely at line 13 however and you will note that we aren’t just forwarding new_key as an argument to the next normalize_json call. Instead we have an if...else statement that determines which gets forwarded along. We can evaluate both branches of the conditional to get an idea of what is going on:

(Pdb) new_key[len(separator) - 1]
'_'
(Pdb) new_key[len(separator):]
'a_b'
(Pdb) new_key
'_a_b'

Our first instinct might be to simplify the function call and make the argument key_string=new_key, making our buggy_script2.py script now look like:

def normalize_json(
    data,
    key_string,
    normalized_dict,
    separator
):
    if isinstance(data, dict):
        for key, value in data.items():
            new_key = f"{key_string}{separator}{key}"
            normalize_json(
                data=value,
                # to avoid adding the separator to the start of every key
                key_string=new_key,
                normalized_dict=normalized_dict,
                separator=separator,
            )
    else:
        normalized_dict[key_string] = data
    return normalized_dict


if __name__ == "__main__":
    print(normalize_json({"_a": {"b": [1, 2, 3]}}, "", {}, "_"))

This reads nicer, but we have fixed one thing by breaking another. Doing a restart and continue in no longer hits our breakpoint, but the script now prints out {'__a_b': [1, 2, 3]}. We want _a_b as the key not __a_b.

So back to the drawing board…in pdb input restart and clear to remove the breakpoint we set so far. Enter break normalize_json so we can stop again during every function call.

(Pdb) clear
Clear all breaks? y
Deleted breakpoint 2 at /home/willayd/buggy_script2.py:1
(Pdb) break normalize_json
Breakpoint 3 at /home/willayd/buggy_script2.py:1
(Pdb)

Now step through a few function calls, inspect locals and see what might be happening:

(Pdb) locals()
{'data': {'_a': {'b': [1, 2, 3]}}, 'key_string': '', 'normalized_dict': {}, 'separator': '_'}
(Pdb) c
> /home/willayd/buggy_script2.py(7)normalize_json()
-> if isinstance(data, dict):
(Pdb) locals()
{'data': {'b': [1, 2, 3]}, 'key_string': '__a', 'normalized_dict': {}, 'separator': '_'}
(Pdb) c
> /home/willayd/buggy_script2.py(7)normalize_json()
-> if isinstance(data, dict):
(Pdb) locals()
{'data': [1, 2, 3], 'key_string': '__a_b', 'normalized_dict': {}, 'separator': '_'}
(Pdb)

If you look closely, you will notice that the key_string variable is already wrong on the second call to the normalize_json function. But the pattern of joining that key with one separator appears correct in the call thereafter.

A simplistic solution is to have some mechanism within our normalize_json call to know if it is the first time the function is being called or not, and special-case the handling of the first call. Inspecting locals() across the different function calls, we notice in the first call that key_string is an empty string but has a value in all subsequent calls. Knowing this we can set up a condition to only strip leading separators if we are NOT in the first function call.

def normalize_json(
    data,
    key_string,
    normalized_dict,
    separator
):
    if isinstance(data, dict):
        for key, value in data.items():
            new_key = f"{key_string}{separator}{key}"
            if not key_string:
                new_key = new_key.removeprefix(separator)
            normalize_json(
                data=value,
                # to avoid adding the separator to the start of every key
                key_string=new_key,
                normalized_dict=normalized_dict,
                separator=separator,
            )
    else:
        normalized_dict[key_string] = data
    return normalized_dict


if __name__ == "__main__":
    print(normalize_json({"_a": {"b": [1, 2, 3]}}, "", {}, "_"))

To verify this now works, restart the program, clear any breakpoint(s) and continue to let things run. You should now get the right answer.

(Pdb) restart
Restarting /home/willayd/buggy_script2.py with arguments:

> /home/willayd/buggy_script2.py(1)()
-> def normalize_json(
(Pdb) clear
Clear all breaks? y
Deleted breakpoint 3 at /home/willayd/buggy_script2.py:1
(Pdb) c
{'_a_b': [1, 2, 3]}
The program finished and will be restarted

Closing Thoughts

If you have made it this far congratulations! With modern visual debuggers integrated into IDEs, the way of debugging illustrated above may not be the most commonplace. However, through liberal use of the help command you may find that pdb has many features that are not implemented or obvious to use in higher level debuggers. Barring some differences, you’ll also find that this method of using pdb translates well into using gdb and extensions like the Cython debugger, which will be represented in future blog posts.