More

vizually · on Jan 25, 2022

In this issue, I cover -

• There are many databases and tools but where is the data platform?

• There are only three roles in data

• Data visualization color guide

• The Future of the (Modern) Data Stack

• Indexes in Postgres

vizually · on Jan 16, 2022

https://datawithdev.com/dc-03-data-job-trends-vc-money

In this third issue, following topics are covered.

The distribution of data jobs by primary technical skill (spoiler – Python & SQL win by a wide margin).

The rise of operational analytics.

Inspiring Tableau data visualizations

Big data, big money

Free learning resources on deep learning and a data engineering bootcamp

vizually · on Sept 23, 2021

This course was created after feedback from Reddit community and my own professional experience.

Despite content explosion, many data professionals could use a holistic look at data visualization, SQL, and Tableau.

Which is why I built this course.

If you are a student from a developing country, please DM and I will make it available for free to you.

Some of the videos are available on YouTube and more will be added. https://www.youtube.com/channel/UCC82PXmB2bhtejeSUXcJ_hw

vizually · on June 29, 2021

@NicoJuicy,

If you want to learn Cloud, definitely go ahead with Azure certifications.

If you are expecting to get hired because of certifications, that may be hoping for less than 5 percentile outcome.

Certifications do provide a defined goal and courses to get started.

vizually · on April 16, 2021

"Sunlight is the best disinfectant". I will bookmark it.

vizually · on April 2, 2021

If a start-up stage company is just getting started on their AI/ML journey has budget for 3 FTE people. The company already has traditional ETL/BI expertise and a "DWH". Who would you hire (data scientist, ML engineer, Dat Engineer) and how would you allocate the division of responsibilities?

vizually · on April 2, 2021

1. Depending on your application, the end users can be from different target market/background. If that will be the case with your app, list down the top X markets and create a specific landing page which "talks" the language of the target market.

2. Cold-outreach, Find your ideal target customers on LinkedIn/Twitter, google them, message/email them on social media (lead finding tools) and ask for help. Be willing to offer to pay them for 10-15 minutes of their time. At least a few will help without asking for money.

3. Assuming what you are selling is described on a landing page (doesn't have to be), you can do a user test by asking questions to consumers using survey tools) the goal is find out if users understand what you are trying to sell (clarity of message, trustworthiness)

You can use tools like Survey Monkey, Google Surveys, even Facebook ads.

here are a couple of examples of a purpose-built tool for feedback called ninjafeedback:

https://ninjafeedback.com/simple-survey-tool

vizually · on March 27, 2021

also, looked at Snowflake's most recent quarterly results. They spent over $1.7 million/day in sales & marketing.

vizually · on March 27, 2021

@legg0myegg0 thanks for bringing up Dremio. Can Dremio connect with different databases RDBMS, Kafka, DataLake PRC file formats? Is the use case limited to certain data stores? Is the use-case primarily a united query engine (so that the code remains consistent across DB engines) or is the use case query acceleration?

vizually · on March 27, 2021

@gwittel, appreciate you sharing your insights. Will you be able to elaborate on "RDBMS will have natural limitations"? Can you provide a specific example?

gwittel · on March 27, 2021

Presto gets most of its speed from parallelizing work and taking advantage of columnar formats when it can.

In the case of a RDBMS can you get performance gains if you try to parallelize a query from many clients? It will depend on the DB adapter and query. In a random case, if you slice a query into N shards it’s not necessarily going to go faster. It’s still the same DB underneath bound by the same HW performance boundaries.

bitsondatadev · on April 1, 2021

Yeah this is a common misconception. Trino and Presto were aimed to replace and speed up the Hive engine.

As you say gwittel, adding Trino to an RDBMS itself won't speed things up. However, if you have operational data sitting in that RDBMS and data sitting in a data lake somewhere on like S3, then you can quickly join those datasets together.

Trino does its best to take advantage of any existing indexes that the RDBMS has by doing a pushdown but won't return that data any faster than the underlying database could. But it's the joining with other data sources data sets that makes the RDBMS connector worthwhile.

If you have a 1GB customer dataset in mysql and a 100TB dataset in s3 of all your orders, then Trino will first run a quick query against your mysql database, get a list of customer ids that meet the query, and then will use that list to filter the order id.

SELECT * FROM mysql.db_name.customer AS c JOIN s3.db_name.orders AS o ON c.id = o.customer_id WHERE c.credit_card_num = 123456789;