Skip to content
Mar 11

Python Async and Await Basics

MT
Mindli Team

AI-Generated Content

Python Async and Await Basics

Modern data science workflows often involve fetching data from multiple APIs, reading from distributed databases, or scraping vast numbers of web pages. These are I/O-bound tasks, meaning the program spends most of its time waiting for external systems to respond, not using the CPU. Traditional synchronous code executes these tasks one after another, wasting precious time idling. Asynchronous programming with Python's asyncio library allows you to handle hundreds of such waiting periods concurrently, dramatically speeding up data collection pipelines. Mastering async and await transforms you from a data practitioner who waits into one who orchestrates.

Understanding the Async Paradigm: Coroutines and the Event Loop

At the heart of asynchronous programming is the coroutine. A coroutine is a special kind of function that can pause its execution and yield control back to the system, resuming later from where it left off. You define a coroutine using async def instead of a regular def. This simple change declares that the function is eligible for asynchronous execution, but calling async def my_coroutine() doesn't run its code immediately; it returns a coroutine object.

The orchestra conductor for all these pausable functions is the event loop. It manages and schedules the execution of coroutines, efficiently switching between them whenever one hits a point where it must wait (like a network request). When the awaited operation completes, the event loop wakes up the coroutine and lets it continue. You rarely interact with the loop directly; instead, you use asyncio.run() as the main entry point to execute your top-level coroutine. This function creates the event loop, runs your coroutine, and closes the loop.

import asyncio

async def fetch_data():
    print("Start fetching...")
    await asyncio.sleep(2)  # Simulate a network delay
    print("Done fetching!")
    return {"data": 123}

# asyncio.run() is the gateway to your async program
result = asyncio.run(fetch_data())
print(result)

The Await Keyword: The Point of Suspension

The await keyword is the signal within a coroutine that says, "This operation might take a while. Pause me here and go work on something else until it's done." You can only use await inside a coroutine (a function defined with async def). What you await is typically another coroutine, a task, or a Future (a lower-level object representing an eventual result).

Crucially, await is not a "timeout" or a "delay" in the synchronous sense. It is a yield point. When execution hits await something(), the current coroutine is suspended, and the event loop is free to execute other ready coroutines. This is the non-blocking behavior that makes async programming so powerful for I/O. The asyncio.sleep(1) in the example above is the async replacement for time.sleep(1), as it yields control instead of blocking the entire thread.

Concurrent Execution with asyncio.gather()

Running coroutines one by one with await is still sequential. For true concurrency, you need to schedule multiple coroutines to run at the same time. This is where asyncio.gather() becomes essential. It takes multiple coroutine objects, schedules them to run concurrently on the event loop, and returns a list of their results in the same order.

For a data scientist, this is the tool for making parallel API calls. Instead of waiting for call A to finish before starting call B, you gather them. The total execution time approximates the duration of the longest individual call, not the sum of all calls.

import asyncio
import aiohttp  # An async HTTP library

async def fetch_url(session, url):
    async with session.get(url) as response:
        return await response.text()

async def main():
    urls = ['http://api.example.com/data1', 'http://api.example.com/data2']
    async with aiohttp.ClientSession() as session:
        # Run both fetch_url coroutines concurrently
        html_contents = await asyncio.gather(
            *(fetch_url(session, url) for url in urls)
        )
        # Process all results now available
        for content in html_contents:
            print(f"Got {len(content)} bytes")

asyncio.run(main())

Managing Resources with Async Context Managers

Just like you use with open() as f: to reliably manage files in synchronous code, you need safe resource management in async code. An async context manager is a class that defines async def __aenter__(self) and async def __aexit__(self, ...) methods. You use it with the async with statement.

This pattern is critical for managing connections to databases (like async SQLAlchemy or Motor for MongoDB) or HTTP sessions (like aiohttp.ClientSession). It ensures resources like network connections are properly acquired and released, even if an error occurs during your data fetching operation. The aiohttp.ClientSession() in the previous example is an async context manager.

When to Use Async in Data Science (And When Not To)

Asynchronous programming is beneficial for I/O-bound data collection tasks. This includes calling REST APIs, querying remote databases, web scraping, reading/writing files from cloud storage, and downloading datasets. If your workflow is a series of HTTP requests to fetch training data, converting it to async can reduce runtime from minutes to seconds.

However, async provides no benefit for CPU-bound tasks. If your code is spending most of its time performing complex mathematical calculations, training a machine learning model with NumPy/pandas/scikit-learn, or doing heavy data transformation, the event loop will be blocked. These tasks should remain in synchronous functions or be offloaded to separate processes using the multiprocessing library. The ideal async data science pipeline fetches and prepares data concurrently, then passes batches to synchronous CPU-intensive functions for computation.

Common Pitfalls

  1. Forgetting the await keyword: Simply calling an async function like fetch_data() returns a coroutine object but doesn't execute it. You must await it to get the result. This often manifests as unexpected output like <coroutine object fetch_data at 0x...>.
  • Correction: Ensure every call to an async function is preceded by await (unless you are passing the coroutine object to a function like asyncio.gather() or asyncio.create_task()).
  1. Blocking the event loop with synchronous code: Using a blocking library like requests for HTTP calls or time.sleep() inside a coroutine halts the entire event loop, negating all async benefits.
  • Correction: Use dedicated async libraries (aiohttp instead of requests, asyncpg instead of psycopg2) and await asyncio.sleep() instead of time.sleep().
  1. Trying to call async code from synchronous code: You cannot use await in a regular function. You cannot directly call an async function from a synchronous script and get its result.
  • Correction: The synchronous code must either use asyncio.run() to enter the async world or delegate the async operation to a task running in an existing event loop. Structure your application so that the async "boundary" is clearly defined.
  1. Assuming asyncio.gather() guarantees parallelism: It schedules tasks concurrently, but true parallelism depends on the underlying system. If your tasks are purely Python-based and involve yielding, they run concurrently on a single thread. For I/O, this is sufficient because the waiting happens outside Python.

Summary

  • Use async def to define a coroutine, a pausable function that forms the basis of async programming.
  • The await keyword suspends a coroutine's execution until its operation completes, allowing the event loop to run other tasks during the wait.
  • Use asyncio.run() as the main entry point to execute your top-level async function.
  • For running multiple I/O operations simultaneously, schedule them with asyncio.gather() to achieve concurrent execution and drastically reduce total wait time.
  • Manage async resources like HTTP sessions or database connections safely using async with and async context managers.
  • Apply async patterns primarily to I/O-bound workflows like data collection from APIs and avoid using them for heavy, CPU-bound computations.

Write better notes with AI

Mindli helps you capture, organize, and master any subject with AI-powered summaries and flashcards.