Systems Enterprises

Shaping the Future of AI with Cloud and Data

Thrilled to Share Our Partnership for AI Inclusivity

The Duhart Group partnered with Google and Kaggle to train Gemma 2, an initiative enhancing AI inclusivity. Collaborating with Harvard University Dataverse, University of California, Berkeley Library, and Stanford University Open Policing Project, we're providing diverse datasets ensuring equitable AI for marginalized groups.

Our mission is to ensure that AI and machine learning technologies can better understand and respect the unique cultural and social experiences of the following top 10 marginalized groups:

Training Gemma 2 with these datasets, we aim to create AI systems that are more equitable, empathetic, and effective in addressing the needs of marginalized communities.

💡 Stay tuned for more updates on this journey towards AI inclusion and diversity!

Why Data Cleansing and AI are Important

In the world of machine learning and AI, **data quality is key**. Before training any AI model, it is critical to cleanse data to remove inconsistencies, missing values, and outliers that could negatively affect performance. This website walks you through how Systems Enterprises builds AI solutions by leveraging multi-cloud platforms and Google Gemma 2, a powerful AI language model.

Real-World Impact of Data Cleansing and AI Models

Many industries today rely on data-driven decision-making. Companies like Netflix, Airbnb, and Spotify use clean data to power AI models that drive recommendation systems, customer experiences, and more. By combining high-quality data with AI models like Google Gemma 2, businesses can enhance customer engagement, improve decision-making, and gain competitive advantage.

High-Level Technical Architecture

Cloud Providers and Their Data Cleansing Tools

AWS Glue

AWS Glue is an ETL service that simplifies the process of data preparation. It helps clean, transform, and load data from various sources. With its serverless capabilities, Glue provides efficient data cleansing tools that integrate well with other AWS services.

Azure Data Factory

Azure Data Factory is a robust platform for building ETL pipelines. It offers numerous transformation activities like handling missing data, aggregations, and format conversions to ensure data is properly cleansed before use in AI training.

Google Cloud Dataprep

Powered by Trifacta, Google Cloud Dataprep offers intuitive data wrangling for data cleaning and preparation. Its ability to automatically detect issues and recommend fixes makes it an indispensable tool in the data cleansing pipeline for AI applications.

IBM DataStage

IBM DataStage provides enterprise-grade ETL tools for data integration and cleansing. With its support for structured and unstructured data, DataStage ensures that high-quality data flows through the AI pipelines.

Steps to Build the Workflow

1. Ingest Data from Harvard Dataverse

You can pull datasets from Harvard Dataverse programmatically using the Dataverse API:


from pyDataverse.api import Api

api = Api('https://dataverse.harvard.edu', 'YOUR_API_TOKEN')
dataset = api.get_dataset('doi:10.7910/DVN/XXXXXX')
files = dataset.json()['data']['latestVersion']['files']
        

2. Data Cleansing on Each Platform

A. AWS Glue

Example script for cleansing data with AWS Glue:


from awsglue.transforms import *
from awsglue.context import GlueContext
glueContext = GlueContext(SparkContext())
data = glueContext.create_dynamic_frame.from_catalog(database="harvard_db", table_name="dataset")
clean_data = data.drop_nulls()
        

B. Azure Data Factory

Use Azure Data Factory to cleanse data through pipelines, including transformations for null handling, aggregation, and deduplication.

C. Google Cloud Dataprep

Dataprep allows you to cleanse and transform datasets using automated suggestions:


from google.cloud import dataprep_v1
job = dataprep_v1.DataflowProjectsLocationsJobsTrigger(
    project='your-project', location='us-central1',
    gcs_source='gs://your_bucket/raw_dataset.csv', gcs_target='gs://your_bucket/cleansed_data.csv'
)
        

D. IBM DataStage

IBM DataStage offers enterprise-level cleansing, including null handling, deduplication, and transformations. You can write the cleansed data back to IBM Cloud Object Storage.

3. Compare Cleansed Data

After cleansing, compare the datasets from different platforms for consistency. Tools like Google Cloud Dataprep or AWS Glue can perform the comparison step.

4. Train Google Gemma 2 on Google Cloud


from google.cloud import aiplatform

aiplatform.init(project='your-project', location='us-central1')
job = aiplatform.CustomTrainingJob(display_name='gemma-training', script_path='train_gemma.py', 
                                   container_uri='gcr.io/cloud-ml-algos/gemma2:latest', requirements=['tensorflow', 'numpy'])
job.run(dataset_uri='gs://your_bucket/cleansed_data.csv', model_display_name='gemma2-model', 
        replica_count=1, machine_type='n1-standard-4', accelerator_type='NVIDIA_TESLA_K80', accelerator_count=1)
        

5. Automate the Entire Workflow

Automate the data ingestion, cleansing, comparison, and training process using Apache Airflow or AWS Step Functions to create Directed Acyclic Graphs (DAGs).

Contact Us

If you’d like to work together or learn more about the Gemma 2 project, feel free to contact Daryl Duhart.

Email: Daryl@Duharts.com

Website: Duharts.com

Direct: 650-265-1546

Menlo Park, CA