Rejoining a data frame after a scrape on index.

This page summarizes the projects mentioned and recommended in the original post on /r/learnpython

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • tqdm

    :zap: A Fast, Extensible Progress Bar for Python and CLI

  • # https://old.reddit.com/r/learnpython/comments/ql7m0c/rejoining_a_data_frame_after_a_scrape_on_index/ # ifreeski420.py import pandas as pd # https://tqdm.github.io/ from tqdm import tqdm def get_bio(url, index): # ...code to scrape profile bio... # Some of the URL rows are empty # and I think it de-couples from the index # when trying to merge everything back together. s = f"get_bio({url}, index)" if url != "url_2" else "bio not found" df = pd.DataFrame([s], columns=["bio"]) print(df) return df # bio # 0 get_bio(url_0, index) # bio # 0 get_bio(url_1, index) # bio # 0 bio not found # bio # 0 get_bio(url_3, index) # bio # 0 get_bio(url_4, index) df_list = [] df = pd.DataFrame({'player_profile': [f"url_{i}" for i in range(5)]}) print(f"\nInitial df") print(df) # Initial df # player_profile # 0 url_0 # 1 url_1 # 2 url_2 # 3 url_3 # 4 url_4 # for athlete_row in tqdm(df.iterrows()): for athlete_row in df.iterrows(): url = athlete_row[1]['player_profile'] index = athlete_row.index data = get_bio(url, index) ## VERY SUSPICIOUS! ## data is undefined when get_bio() raises error # try: # data = get_bio(url, index) # except: # continue df_list.append(data) final_bio_frame = pd.concat(df_list).reset_index(drop=True) print(f"\nfinal_bio_frame") print(final_bio_frame) # final_bio_frame # bio # 0 get_bio(url_0, index) # 1 get_bio(url_1, index) # 2 bio not found # 3 get_bio(url_3, index) # 4 get_bio(url_4, index) final = pd.merge(df, final_bio_frame , how='left', left_index=True, right_index=True) print(f"\nfinal") print(final) # final # player_profile bio # 0 url_0 get_bio(url_0, index) # 1 url_1 get_bio(url_1, index) # 2 url_2 bio not found # 3 url_3 get_bio(url_3, index) # 4 url_4 get_bio(url_4, index)

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts