Build Custom Financial Datasets with AI Agents (without Web Scraping or Finance APIs)
OpenAI Agents SDK with OpenAI or Perplexity Sonar Models
This tutorial shows you how to build a custom AI-driven workflow to automatically collect and structure financial datasets without relying on traditional web scraping or expensive financial APIs. You'll learn how to create AI agents with OpenAI's Agents SDK Python library, output structured data using the Pydantic library that is ready for additional LLM or machine learning analysis, and leverage web-connected models from OpenAI (GPT-4o-mini) and Perplexity Sonar (Sonar Pro) to gather unique, high-value data points like CEO turnover, board compositions, employee sentiment, and job postings, to just name a few potential variables that could be collected.
Why is this important? Leveraging publicly available but unstructured information gives you a powerful competitive edge in quantitative finance and machine learning, particularly because these strategies are not yet widely priced into the market. With minimal coding, you can create tailored datasets that precisely reflect your unique investment thesis or predictive analytics strategy.
By the end of the tutorial, you'll have the practical skills to automate and customize your data collection, enabling quicker, more informed decision-making. Now is the perfect moment to explore this technology—before it becomes standard practice and the market fully prices it in.
Want to learn how to build AI automations with Python, check out the AI Automation Crash Courses linked here: http://crashcourseai.com/
Env Setup
Code
####
## 1: OpenAI Models
# import os
# os.environ["OPENAI_API_KEY"] = "OPEN AI API KEY GOES HERE"
from agents import Agent, Runner, WebSearchTool
from typing import List
from pydantic import BaseModel
import pandas as pd
class CompanyInfo(BaseModel):
company_name: str
ticker: str
sector: str
founding_year: int
number_of_employees: int
ceo_tenure_years: float
ceo_count_since_2010: int
average_glassdoor_rating: float
institutional_ownership_pct: float
board_member_count: int
job_positions_open: int
# 1) Instantiate the search tool
web_search = WebSearchTool()
# 2) Create the Agent
agent = Agent(
name="CompanyInfoAgent",
instructions="""
For a given U.S.-listed company ticker, use the WebSearchTool to find:
- Full company name
- Ticker symbol
- Sector/industry
- Year the company was founded
- Current total number of employees
- Current CEO’s tenure in years
- Number of different CEOs the company has had since January 1, 2010
- Average employee rating on Glassdoor
- Percentage of shares held by institutional investors
- Total number of board members
- Current Number of Job Positions Opened (globally)
Then return exactly the JSON matching the CompanyInfo schema.
""",
tools=[web_search],
output_type=CompanyInfo,
model='gpt-4o-mini',
)
# 3) Loop over a list of tickers
tickers = [
"AAPL",
"MSFT",
"GOOGL",
"AMZN",
"TSLA"
]
all_company_data = []
for ticker in tickers:
info = await Runner.run(agent, ticker)
print(info.final_output)
all_company_data.append(info.final_output.model_dump())
# 4) Create a Pandas DataFrame from the collected data
df = pd.DataFrame(all_company_data)
df
####
## 2. Perplexity Sonar Models
# PPLX_API_KEY = "PERPLEXITY SONAR API KEY GOES HERE"
from agents import Agent, Runner, AsyncOpenAI, OpenAIChatCompletionsModel
from typing import List
from pydantic import BaseModel
import pandas as pd
class CompanyInfo(BaseModel):
company_name: str
ticker: str
sector: str
founding_year: int
number_of_employees: int
ceo_tenure_years: float
ceo_count_since_2010: int
average_glassdoor_rating: float
institutional_ownership_pct: float
board_member_count: int
job_positions_open: int
# 1) Setup the perplexity client
perplexity_client = AsyncOpenAI(base_url="https://api.perplexity.ai", api_key=PPLX_API_KEY)
# 2) Create the Agent
perplexity_agent = Agent(
name="CompanyInfoAgent_pplx",
instructions="""
For a given U.S.-listed company ticker, use the WebSearchTool to find:
- Full company name
- Ticker symbol
- Sector/industry
- Year the company was founded
- Current total number of employees
- Current CEO’s tenure in years
- Number of different CEOs the company has had since January 1, 2010
- Average employee rating on Glassdoor
- Percentage of shares held by institutional investors
- Total number of board members
- Current Number of Job Positions Opened (globally)
Then return exactly the JSON matching the CompanyInfo schema.
""",
output_type=CompanyInfo,
model=OpenAIChatCompletionsModel(
model="sonar-pro",
openai_client=perplexity_client) # perplexity client goes here
)
# 3) Loop over five major plants
tickers = [
"AAPL",
"MSFT",
"GOOGL",
"AMZN",
"TSLA"
]
all_company_data = []
for ticker in tickers:
info = await Runner.run(perplexity_agent, ticker)
print(info.final_output)
all_company_data.append(info.final_output.model_dump())
# 4) Create a Pandas DataFrame from the collected data
df_pplx = pd.DataFrame(all_company_data)
print("Perplexity Sonar Pro")
df_pplx
Want to learn how to build AI automations with Python, check out the AI Automation Crash Courses linked here: http://crashcourseai.com/
Subscribe to the Deep Charts YouTube Channel for more informative AI and Machine Learning Tutorials.