Shaping the Future of AI with Cloud and Data
The Duhart Group partnered with Google and Kaggle to train Gemma 2, an initiative enhancing AI inclusivity. Collaborating with Harvard University Dataverse, University of California, Berkeley Library, and Stanford University Open Policing Project, we're providing diverse datasets ensuring equitable AI for marginalized groups.
Our mission is to ensure that AI and machine learning technologies can better understand and respect the unique cultural and social experiences of the following top 10 marginalized groups:
Training Gemma 2 with these datasets, we aim to create AI systems that are more equitable, empathetic, and effective in addressing the needs of marginalized communities.
💡 Stay tuned for more updates on this journey towards AI inclusion and diversity!
In the world of machine learning and AI, **data quality is key**. Before training any AI model, it is critical to cleanse data to remove inconsistencies, missing values, and outliers that could negatively affect performance. This website walks you through how Systems Enterprises builds AI solutions by leveraging multi-cloud platforms and Google Gemma 2, a powerful AI language model.
Many industries today rely on data-driven decision-making. Companies like Netflix, Airbnb, and Spotify use clean data to power AI models that drive recommendation systems, customer experiences, and more. By combining high-quality data with AI models like Google Gemma 2, businesses can enhance customer engagement, improve decision-making, and gain competitive advantage.
AWS Glue is an ETL service that simplifies the process of data preparation. It helps clean, transform, and load data from various sources. With its serverless capabilities, Glue provides efficient data cleansing tools that integrate well with other AWS services.
Azure Data Factory is a robust platform for building ETL pipelines. It offers numerous transformation activities like handling missing data, aggregations, and format conversions to ensure data is properly cleansed before use in AI training.
Powered by Trifacta, Google Cloud Dataprep offers intuitive data wrangling for data cleaning and preparation. Its ability to automatically detect issues and recommend fixes makes it an indispensable tool in the data cleansing pipeline for AI applications.
IBM DataStage provides enterprise-grade ETL tools for data integration and cleansing. With its support for structured and unstructured data, DataStage ensures that high-quality data flows through the AI pipelines.
You can pull datasets from Harvard Dataverse programmatically using the Dataverse API:
from pyDataverse.api import Api
api = Api('https://dataverse.harvard.edu', 'YOUR_API_TOKEN')
dataset = api.get_dataset('doi:10.7910/DVN/XXXXXX')
files = dataset.json()['data']['latestVersion']['files']
Example script for cleansing data with AWS Glue:
from awsglue.transforms import *
from awsglue.context import GlueContext
glueContext = GlueContext(SparkContext())
data = glueContext.create_dynamic_frame.from_catalog(database="harvard_db", table_name="dataset")
clean_data = data.drop_nulls()
Use Azure Data Factory to cleanse data through pipelines, including transformations for null handling, aggregation, and deduplication.
Dataprep allows you to cleanse and transform datasets using automated suggestions:
from google.cloud import dataprep_v1
job = dataprep_v1.DataflowProjectsLocationsJobsTrigger(
project='your-project', location='us-central1',
gcs_source='gs://your_bucket/raw_dataset.csv', gcs_target='gs://your_bucket/cleansed_data.csv'
)
IBM DataStage offers enterprise-level cleansing, including null handling, deduplication, and transformations. You can write the cleansed data back to IBM Cloud Object Storage.
After cleansing, compare the datasets from different platforms for consistency. Tools like Google Cloud Dataprep or AWS Glue can perform the comparison step.
from google.cloud import aiplatform
aiplatform.init(project='your-project', location='us-central1')
job = aiplatform.CustomTrainingJob(display_name='gemma-training', script_path='train_gemma.py',
container_uri='gcr.io/cloud-ml-algos/gemma2:latest', requirements=['tensorflow', 'numpy'])
job.run(dataset_uri='gs://your_bucket/cleansed_data.csv', model_display_name='gemma2-model',
replica_count=1, machine_type='n1-standard-4', accelerator_type='NVIDIA_TESLA_K80', accelerator_count=1)
Automate the data ingestion, cleansing, comparison, and training process using Apache Airflow or AWS Step Functions to create Directed Acyclic Graphs (DAGs).
If you’d like to work together or learn more about the Gemma 2 project, feel free to contact Daryl Duhart.
Email: Daryl@Duharts.com
Website: Duharts.com
Direct: 650-265-1546
Menlo Park, CA