Masthash

#DataEngineering

Nick Rankovic 🍉
8 hours ago

Shocked that #pyspark 's `to_timestamp` doesn't handle milliseconds out of the box but I have to use workarounds 🤯 #dataengineering

brozu ▪️
5 days ago

#Nvidia releases **Rapid cuDF** that, through your GPU, allows you to speed up operations of #Pandas up to 150x times!

🔗 https://github.com/rapidsai/cudf

#DataScience #DataAnalysis #DataEngineering #GPUAcceleration #rapid #cuDF #RapidscuDF

Epimorphics
5 days ago

This week we’ve completed a number of interesting client project technical reports, continued the development work on an emerging product, started the design phase of a new client project and dug into the details of our carbon footprint data for our financial year.

A packed week for the team, thanks for everyone that we've chatted to and worked with this week. Have a lovely weekend

#ThisWeek #ConnectedData #DataMaturity #DataStandards #DataEngineering

a collection of Epimorphics technical report cover sheets in a set of bright colours on top of each other, offset slightly. Alongside this decorative image the text: for more about our projects see our project pages or our blog: and the link www.epimorphics.com/blog , under this the Epimorphics connecting data logo.
Jan Smeddinck
5 days ago

Together with our partner institution Salzburg Research we still have an amazing #job opportunity open for a #PhD Student (f|m|non-binary) – Data Engineer in Digital Health. #DataEngineering #DataScience #MachineLearning #ArtificialIntelligence #DevOps #PersonalizedHealth

The job advertisement is aimed at candidates with a strong technical background (esp. programming skills) e.g. in computer science, data science, AI, machine learning, HCI, or related fields.

Epimorphics
1 week ago

Unlock the full potential of your data by applying web principles to data integration, we empower valuable insights, underpin data engineering and data science and enable innovative data solutions.

https://www.epimorphics.com

#ConnectedData #LinkedData #DataIntegration #DataInsights #FAIRdata #DataEngineering #DataScience

Decorative image with 4 large coloured circle backgrounds overlaid with some data, web, environment, food, housing and technology themed sketch style icons in black.  In the centre the text for the link: www.epimorphics.com in dark blue
brozu ▪️
1 week ago

There can be a tendency to think of #data as strictly numbers and math. The reality is that data consists of people-driven inputs.

#datascience #dataviz #dataengineering #DataAnalysis

Epimorphics
2 weeks ago

This week we released an update to our SAPI-NT #API Library, we’ve been in an early design sprint for some client service updates, we’ve been pulling data together for our annual carbon footprint assessment, we’ve been contributing to some IcebreakerOne Stream workshops, and we’ve progressing one of our data technology projects.

https://www.epimorphics.com

#ThisWeek #ConnectedData #LinkedData #DataEngineering #GovTech #OpenData #DataStandards #Stream #DataTechnology

Decorative image with a large mid yellow circle background overlaid with some data, web and technology themed sketch style icons in black.  Underneath the text for the link: www.epimorphics.com in black text
rmoff 🏃🏻 🍺 🥓
2 weeks ago

My learning #apacheflink blog series continues, with a poke around options for connecting to Flink with JDBC: https://rmoff.net/2023/11/16/learning-apache-flink-s01e06-the-flink-jdbc-driver/

#dataEngineering #openSource #streamProcessing #JDBC

Epimorphics
2 weeks ago

Last week we highlighted some of our #TechTalks over the last year - have a look at recent posts

Yesterday we released an update to our SAPI-NT API Library - for more about our API libraries see: https://www.epimorphics.com/api-libraries/

#ConnectedData #LinkedData #DataEngineering #SAPI-NT #API

photo of Yellow Arrow LED Signage in an atmospheric dark urban setting with the Epimorphics connecting data logo
Data Dave
2 weeks ago

Are you using dbt?

I'm looking to learn more about the scale of dbt projects and the review process for modeling changes.

Can you answer 9 questions about your dbt usage?
(they're all multiple choice - yay!)

https://docs.google.com/forms/d/1O3korvArsxViiZbfatSw0RS9ocG5eJ4dh2D49zr5YSM

#dbt #DataEngineer #DataEngineering #AnalyticsEngineering #DataModeling #DataOps

Gordon Inggs
2 weeks ago

Alas, losing one of our talented data engineers to the lure of overseas! Anyone interested in mid-to-senior #DataEngineering, #CivicTech work at the best* city in the world, feel free to get a hold of me - <my first name>.<my surname>@capetown.gov.za

Believers (in a better world) need apply! Please boost.

* Come on, I grew up in #CapeTown, I'm pretty biased.

rmoff 🏃🏻 🍺 🥓
3 weeks ago

It might be 45 sleeps until Christmas, but it's only FOUR DAYS until @gunnarmorling and I launch our monthly roundup of interesting things happening in the streaming and data space. Stay tuned to the Decodable blog next week :)

👉 🗞️ https://decodable.co/blog 🎁

#openSource #dataEngineering #streamProcessing #databases #changeDataCapture

DALL·E 3 pixel art image of a modern computer stream processing system into a style reminiscent of an early 1990s video game.
bart6114 :starfleet:
3 weeks ago

I'm super excited to share that Sam Debruyn, recently crowned with the #dbt Community Award 👑, now has a new dbt adapter: dbt-timescaledb!

Integrating TimescaleDB with dbt workflows means more streamlined processes, quicker insights, and a more economical approach to data analysis.

🔗 https://github.com/sdebruyn/dbt-timescaledb 👀
⭐️ YAS. This deserves some stars.

#DataTransformation #dbt #TimescaleDB #DataAnalytics #ELT #DataEngineering dbt Labs

Luke 🎄
3 weeks ago

Once again, SQLite to the fucking rescue.

Have to analyze a ~450MB CSV file of merchant data.

sqlite> .mode csv
sqlite> .import 2023_september_CAID.csv merchants

Bam! Now I can query the merchants table with SQL rather than hacking together scripts to do funky CSV things.

#DataEngineering

Epimorphics
3 weeks ago

We support organisations developing their data strategies and designing their data architecture, helping to drive innovation.

Our data engineering experts provide advice a consultancy to IT Teams, service companies and other.

Trust us to help you unlock the full potential of your data.
https://www.epimorphics.com/consultancy

#DataStrategy #DataArchitecture #DataEngineering #ConnectedData

photo over London skyline as it gets dark, with orange and red sky with a decorative angled accent to the right in pink and the Epimorphics swish logo and the link www.epimorphics.com. In the centre of the photo in large bold white text: consultancy, advice & support.
Epimorphics
3 weeks ago

We’ve posted recently about some of our open data and other data and tech projects
(see: https://www.epimorphics.com/blog )including themes including data standards, real time data, data APIs, data maturity, data management, Chat GPT and data engineering, our current projects will expand on this soon.

#ODISummit2023 #ConnectedData #FAIRdata #AI #DataScience #LinkedData #DataStandards #RealTimeData #DataAPIs #DataMaturity #DataManagement #ChatGPT #DataEngineering

Image of a MacBook with the Epimorphics website blog page on the screen. Below the macbook the text Our Blog followed by the link: www.epimorphics.com/blog
Epimorphics
3 weeks ago

In the age of AI and data science, linked data acts as a unifying force, empowering organisations to use data assets to their full extent, fosters collaboration, accelerates research, and enhances reusability of data.

#ODISummit2023 #ConnectedData #FAIRdata #AI #DataScience #LinkedData #DataStandards #RealTimeData #DataAPIs #DataMaturity #DataManagement #ChatGPT #DataEngineering

rmoff 🏃🏻 🍺 🥓
3 weeks ago

📝A nice collection of real-world production uses of @ApacheKafka and #ApacheFlink, collected and curated by Thanh Tung Dao 👏:
🌟 Flink: https://github.com/dttung2905/flink-at-scale
🌟 Kafka: https://github.com/dttung2905/kafka-in-production

#dataEngineering #openSource #streamProcessing

Code The City
3 weeks ago

This Tuesday evening we team up with The Data Lab on the latest Aberdeen Data Meetup at ONE Tech Hub . 

The theme for this one is "How to get started in a career in data". Join us to network over pizza and drinks. Hear talks from Charlotte McLean and Lesley-Anne Kelly.

Thanks to ScotlandIS and CodeBase Aberdeen for supporting our sessions this year. 

Book now https://ti.to/code-the-city/adm-nov-2023 

#datascience #dataengineering #datavisualisation #datajournalism #aberdeen 

Tsutsuku (he/they)
4 weeks ago

I went all the way into the office on a Friday because I have an important modification to make ASAP. I get to the office and find out that the entire server that feeds my UI is down for up to 24 hours and not even the prod instance is working.
#programming #dataengineering

Code The City
1 month ago

Next Tuesday we team up with The Data Lab on the latest Aberdeen Data Meetup at ONE Tech Hub . 

The theme for this one is "How to get started in a career in data". 

Join us to network over pizza and drinks. Hear talks from Charlotte McLean and Lesley-Anne Kelly.

Thanks to ScotlandIS and CodeBase Aberdeen for supporting our sessions this year. 

Book now https://ti.to/code-the-city/adm-nov-2023 

#datascience #dataengineering #datavisualisation #datajournalism #aberdeen 

Epimorphics
1 month ago

We've been posting #TechTalk and other hopefully interesting posts on our blog https://www.epimorphics.com/blog/ but for those who prefer to use Medium see:
https://epimorphics.medium.com/

#ConnectedData #DataEngineering #LinkedData #DataStories #FAIRdata #Blog #Medium

Photo looking down at a desk with an open laptop with a decorative graphic of some icons and links with the #Data and #GovTech hashtags and some notebooks with one open with the text: Projects and Case studies, Food Standards Agency, HM Land Registry, UK CEH, Environment Agency, Natural Resources Wales, MetOffice, etc… And for the image an accented decorative graphic in black with the Epimorphics logo and www.epimorphics.com link
rmoff 🏃🏻 🍺 🥓
1 month ago

☁️Y'know the whole snarky thing about "it's not the cloud, it's just someone else's computer"? 🙄

Well what if the Kafka you wanted to connect to was literally on… just someone else's computer? 😁

✍️Blogged: Using @ApacheKafka with #ngrok https://rmoff.net/2023/11/01/using-apache-kafka-with-ngrok/

#openSource #dataEngineering

Jan :rust: :ferris:
1 month ago

#CsvDiff has finally reached v0.1.0, it's first ever non-alpha/-beta release! 🎉

New features like getting at the headers from the diffresult have been needed for the following PR in qsv (which is in final review):
https://github.com/jqnatividad/qsv/pull/1395

When merged, you'll be able to decide, whether the diffresult should output headers or not (see examples in the PR). :awesome:

Check out csv-diff's Changelog for the full details:
https://gitlab.com/janriemer/csv-diff/-/blob/main/CHANGELOG.md?ref_type=heads#010-30-october-2023

#CSV #qsv #CLI #DataScience #DataEngineering

rmoff 🏃🏻 🍺 🥓
1 month ago

I'm impressed by the pace and innovation of the #debezium project—TIL it now ships with a JDBC *sink* for Kafka Connect: https://debezium.io/documentation/reference/stable/connectors/jdbc.html

#dataEngineering #openSource #apacheKafka

Code The City
1 month ago

Just a week until we team up with The Data Lab - Innovation Centre on the latest Aberdeen Data Meetup at ONE Tech Hub . 

The theme for this one is "How to get started in a career in data". 

Join us to network over pizza and drinks. Hear talks from Charlotte McLean and Lesley-Anne Kelly.

Thanks to ScotlandIS and CodeBase Aberdeen for supporting our sessions this year. 

Book now https://ti.to/code-the-city/adm-nov-2023 

#datascience #dataengineering #datavisualisation #datajournalism #aberdeen

Gordon Inggs
1 month ago

Many of our pipeline steps follow the pattern of

```python
read_data_from_datastore()

# do exciting and ill-advised stuff here

write_data_back_to_datastore()
```

And frequently when debugging and/deving, I'll comment out the write step for paranoia. And as you would guess it, frequently commit the commenting out. I suppose it's a soft failure, but very hard to debug.

Pipeline behave as they should, dependencies are called, 99% of the log lines you expect are there, but...

#DataEngineering

Thanks for all the input! So it looks like in the #datascience and #dataengineering and related areas, #python seems to be superceding #rstats as a preferred skill set. As I teach data skills in the #LifeSciences with the key learning objective being an ability to process, visualise and analyse data in a reproducible and scalable way, we're not fixated on what methods we use, so going forwards, it's something to consider...

Mike Spillane
1 month ago

A great video by Simon Whiteley from Advancing Analytics on Data Lake architecture and how Advancing Analytics build lakes and the different stages of processing they go through.

A great takeaway is not getting hung up on the medallion architecture and confining yourself to the three steps of Bronze, Silver and Gold to process your data.

https://www.youtube.com/watch?v=fz4tax6nKZM

#dataArchitecture #dataLake #dataEngineering

Epimorphics
1 month ago

In case you missed it last week, we posted our most recent #TechTalk blog post exploring a topic that has come up in discussions: URLs and URIs for #LinkedData

Take a look at: https://www.epimorphics.com/uris-for-linked-data/

#TechTalk #URLs #URIs #LinkedData #DataEngineering #ConnectedData #Identifiers

Decorative Image - bold white text: #TECHTALK over a stylised photo of a partially opened bright laptop screen with the subtitle: URIs for Linked data and a corner accent in bright cyan with the Epimorphics connecting data logo and www.epimorphics.com link in white
Epimorphics
1 month ago

Unlock the full potential of your data by applying web principles to data integration, we empower valuable insights, underpin data engineering and data science and enable innovative data solutions.

https://www.epimorphics.com/

#ConnectedData #LinkedData #DataIntegration #DataInsights #DataEngineering & #DataScience

Decorative image with 4 large coloured circle backgrounds overlaid with some data, web, environment, food, housing and technology themed sketch style icons in black.  Underneath the text for the link: www.epimorphics.com in black text
Epimorphics
1 month ago

In our newest tech talk we explore an old topic, but one that keeps coming up in discussions: URLs and URIs forLinked Data - the why, the what and the how of URIs.

https://www.epimorphics.com/uris-for-linked-data/

#TechTalk #URLs #URIs #Identifiers #LinkedData- #DataEngineering #ConnectedData

Decorative Image - bold white text: #TECHTALK over a stylised photo of a partially opened bright laptop screen with the subtitle: URIs for Linked data and a corner accent in bright cyan with the Epimorphics connecting data logo and www.epimorphics.com link in white
Catalyst Cooperative
1 month ago

Ooof, it took a year to close this issue, completing the migration of our ETL and analysis assets into @dagster but we're finally done.

We still need to rename all the tables so there's some semblance of a clear naming convention, but you no longer have to install all the software required to run the ETL if you just want to access all of our outputs.

https://github.com/catalyst-cooperative/pudl/issues/1973

#OpenData #DataEngineering #EnergyMastodon

Sal Rahman
2 months ago

So if I understand it correctly, the reason why Twitter went with the fan-out-write approach was because you had a tradeoff between writing to an indexed table, or an unindexed table.

In an unindexed table, writes are fast; in an indexed table, writes are slow.

So they decided to write to an unindexed table, but fan out the write to an in-memory cache of each user's timelines.

Does that sound right?

#SoftwareArchitecture #DataArchitecture #DataEngineering

Sal Rahman
2 months ago

Does Mastodon use fan-out write?

#Mastodon #DataEngineering #SoftwareArchitecture

Joe Wood
2 months ago

Useful summary of the #dataengineering stack. Missing for me are the extract, load tools like #airbyte, #meltano. Would have liked to see some governance tooling to support data mesh, catalog support too.

https://thenewstack.io/the-architects-guide-to-the-modern-data-stack/

#databuildtool

NewStack article on modern data stack
Catalyst Cooperative
2 months ago

A great interview on the Data Engineering Podcast with the founder of CoverageCat, which liberates insurance coverage and pricing data from intentionally obfuscated data sources -- often published as scanned PDFs with constantly shifting structures.

Kindred spirits! And their mascot is also a cat!

https://www.coveragecat.com/

https://www.dataengineeringpodcast.com/coveragecat-insurance-industry-data-engineering-episode-395

#insurance #DataScience #DataEngineering #podcast

0️⃣0️⃣ - Chapter 00 - A New Frontier
🔗 https://voltrondata.com/codex/a-new-frontier

There are 100s of database systems available, and yet, companies with lots of data don't find one for their niche, and instead develop their own. It is a very expensive endeavor 💸 and all these systems are not that different. 🧵 2/...

#ComposableCodex #data #DataEngineering

We have companies that have a lot of money writing the same database system software over and over.

- Andy Pavlo
Mike Blake ☮️
2 months ago

I'm looking for a role leading an engineering team. I've got several years both managing teams as well as an individual contributor. Strong recent #BigData and #DataEngineering experience.

https://github.com/AppTrain
https://www.linkedin.com/in/apptrain

#FediHire #HireMe #FediHired #jobs #python #sql #javascript #nodejs, #dataMesh #APIs #MasterDataManagement #Spark #dbt #starRocks

INNOQ
2 months ago

On June 30th, we celebrated women+ in #dataengineering #datascience #machinelearning and #mlops at the Women+ in Data and AI summer festival. Let’s look back and reflect on how we bootstrapped a tech conference with a 100% women+ speaker lineup. Full story 👉 https://www.innoq.com/en/blog/2023/09/women-plus-in-data-and-ai-summer-festival-2023/

Pascal Laliberté
2 months ago

Do I know any #Rails shops using dbt-core for data engineering?

#ruby #rubyonrails #dataengineering #dbt

Code The City
3 months ago

On Wednesday at ONE Tech Hub  Robert McWilliam will explain how using the async library will speed up your code execution. We'll have pizza and drinks, then his talk, then you get a chance to try the code yourself. See how fast you can get your code running ! 

Book tickets now! https://ti.to/code-the-city/apug-sep-2023

#Python #coding #programming #pizza #networking #KnowledgeExchange #datascience #dataengineering #webdev

InfoQ
3 months ago

Dive into our annual #InfoQ #TrendsReport where we dissect the latest in #AI, #ML & #DataEngineering.

Discover key trends & insights that every software engineer, architect, or data scientist should keep an eye on: https://bit.ly/3Z9gadO

💪 Knowledge is power! #StayAhead of the curve!

Enormous kudos to Roland Meertens, Srini Penchikala, Sherin Thomas, Daniel Dominguez & Anthony Alford for their role in this report.

Code The City
3 months ago

Last chance to book!

Tomorrow evening Aberdeen Data Meetup returns - with the postponed talk from Harris Assad on "Why Data Engineering is the Imperative Foundation for Data Science". 

Join us for networking over pizza and drinks, followed by Harris' talk.

Book now https://dashboard.tito.io/code-the-city/adm-sep-2023

#datascience #dataengineering #aberdeen #networking 

Code The City
3 months ago

This Tuesday Aberdeen Data Meetup is back - with the postponed talk from Harris Assad on "Why Data Engineering is the Imperative Foundation for Data Science". 

Join us for networking over pizza and drinks, followed by Harris' talk. 

Book now https://dashboard.tito.io/code-the-city/adm-sep-2023

#datascience #dataengineering #aberdeen #networking 

Code The City
3 months ago

Whether you are a data scientist, web developer, astrophysicist, or pharmacological researcher, you need fast code execution. On 13th September Robert McWilliam will explain how using the async library speeds up code execution. We'll have pizza and drinks, then his talk, then you get a chance to try the code yourself. See how fast you can get your code running ! 

Book tickets now! https://ti.to/code-the-city/apug-sep-2023

#Python #coding #programming #KnowledgeExchange #datascience #dataengineering #webdev

Casper Weiss Bang
3 months ago

An #introduction.

My name is Casper. I am a software engineer, currently working with #dataengineering.

Currently busy with#renovation of a 1911 house with my lovely wife and 3 year old daughter. Oh I have a new obsession, with improving #biodiversity in our small garden.🌻

I live in the city center, and identify (a bit too) strongly with being #carfree (but #cargobike owner)

Regarding #softwaredevelopment, I am inspired by #ddd , #xp #tdd , and other abbreviations. 👩‍💻

Steven Kell
3 months ago

PyData Global 2023
December 6 - 8
CFP closes on October 1st

We welcome attendees with various experiences, expertise, and backgrounds to join us virtually. Users, contributors, and newcomers can share experiences and learn from one another to solve challenging problems and grow a stronger open-source community.

https://lnkd.in/gD9-z5aT

#pydata #numfocus #datascience #ai #artificialintelligence #conference #machinelearning #dataengineering

Code The City
3 months ago

Next Tuesday evening Aberdeen Data Meetup is back - with the postponed talk from Harris Assad on "Why Data Engineering is the Imperative Foundation for Data Science". 

Join us for networking over pizza and drinks, followed by Harris' talk. 

Book now https://dashboard.tito.io/code-the-city/adm-sep-2023

#datascience #dataengineering #aberdeen #networking 

PipeRider
3 months ago

Want to see how easy it is to signup for PipeRider Cloud and share for you first report?

In this video we take an existing dbt project and:

- Sign up for PipeRider Cloud
- Generate a Data Impact Report
- Upload the report and share it

All in a matter of minutes:

https://youtu.be/tex8fLQLzDo

#PipeRider #DataQuality #DataEngineering #AnalyticsEngineering #DataOps #DataViz #DataTeam #DataLineage

Code The City
3 months ago

One week until Aberdeen Data Meetup returns with the postponed talk from Harris Assad on "Why Data Engineering is the Imperative Foundation for Data Science". 

Join us for networking over pizza and drinks, followed by Harris' talk. 

Book now https://dashboard.tito.io/code-the-city/adm-sep-2023

#datascience #dataengineering #aberdeen #networking 

Code The City
3 months ago

Less than four weeks until our 30th Hack weekend; and it is all about Union Street and the City Centre. We've multiple challenges. We need you - no matter what your skills and experience. We can all play a part! 

More details of the challenges and how to book: https://codethecity.org/what-we-do/hack-weekends/ctc30/ 

#Aberdeen #history #mapping #photography #planning #osm #datascience #dataengineering #heritage

Code The City
3 months ago

Less than two weeks until Aberdeen Data Meetup is back - with the postponed talk from Harris Assad on "Why Data Engineering is the Imperative Foundation for Data Science". 

Book now https://dashboard.tito.io/code-the-city/adm-sep-2023

#datascience #dataengineering #aberdeen #networking 

Code The City
3 months ago

Just over a month until our 30th Hack weekend; and it is all about Union Street and the City Centre. We've multiple challenges. We need you - no matter what your skills and experience. We can all play a part! More details of the challenges and how to book: https://codethecity.org/what-we-do/hack-weekends/ctc30/ #Aberdeen #history #mapping #photography #planning #osm #datascience #dataengineering #heritage

Code The City
3 months ago

Just two weeks until Aberdeen Data Meetup is back - with the postponed talk from Harris Assad on "Why Data Engineering is the Imperative Foundation for Data Science". 

Join us for networking over pizza and drinks, followed by Harris' talk. 

Book now https://dashboard.tito.io/code-the-city/adm-sep-2023

#datascience #dataengineering #aberdeen #networking 

Catalyst Cooperative
4 months ago

Hey #EnergyMastodon, we finally got set up here so I guess we should do an #introduction

Catalyst is an all-remote worker #Cooperative focused on open #DataEngineering, analysis, and open source software. We mostly work with public US data related to the energy system, pulling together information from #FERC, #EIA, #EPA, and other agencies and trying to turn it into analysis-ready database tables so advocates, policymakers, & researchers can focus on their own novel work instead.

Kathe Todd-Brown
4 months ago

Wonderfully insightful links between farming, city/country planning, and informatics. Bit of a weak ending but overall I really enjoyed "Seeing like a state" (1998) by James Scott and would recommend it to anyone working #informatics, #DataScience, or #DataEngineering related to public service https://yalebooks.yale.edu/book/9780300078152/seeing-like-a-state/ #miniBookReview

James C :python:
6 months ago

Hey #pandas #dataengineering people - is anyone *happily* using PyArrow `decimal128` types or similar for fixed point arithmetic in DataFrames?

All boosts appreciated for visibility 🔁 🙏

After multiple hours of digging and testing, I think I'm going to recommend our work project avoids them and uses plain Python `Decimal` in DF, even though there's no vectorization 😞

More thoughts in my latest comment in this GH issue thread: https://github.com/intake/pandas-decimal/issues/3#issuecomment-1590957624

#datadon

Data Dave
6 months ago

What does your testing strategy look like for dbt?

On the one hand I'd like to make sure all models are covered by tests but, on the other hand, maintaining tests and test rot is a big concern.

Do you test all models?
Do you care if some models are not covered by tests?

Any input on strategy welcome 🙏

#dbt #test #DataEngineering

Data Folks! I need your help!

If I gave you a date in the format of 1220923, would you be able to make sense of it? I desperately need human-readable dates, and I'm losing my mind.

#data #DataScience #DataEngineering

rmoff 🏃🏻 🍺 🥓
9 months ago

The essential component of any data toolkit: Duck[DB] Tape!

#datadon #duckdb #dataengineering #analyticsengineering

Ianhopkinson
9 months ago

I have a bunch of utilities in a Python package which I use across multiple Python projects. So far I have used the utilities package as an editable install, or sometimes just copied the source for the required utilities into a project.

This approach is breaking down as I start using CI/CD pipelines.

What is the best way forward?

I feel like the correct answer is "Install from private PyPi" but I'm interested to hear your experience

#python #datascience #dataengineering

kcarruthers
9 months ago

I will be giving the keynote presentation at the Connected Enterprise Forum in Sydney on 23 March. Come join us!

Or Would be great to see you there!

RSVP now. Free to attend.
https://streamsets.com/forum/sydney/

#ai #future #DataEngineering #BigData #DataAnalytics #ML #generativeAI

Click the link in the message to find out more and register
Luca Hammer
9 months ago

I never bothered with optimizing the parsing of #jsonl #ndjson files because in most cases it was an one off task before I put the data into a database or parquet file. But the files got bigger and waiting 20 minutes for the data to load made me reconsider my decision. So tried some different approaches.

Tested with a 1.7 GB file of 300 k Tweets.

jsonlines.reader: 17.2 seconds 100%
orjson: 6.49 seconds 37%
msgspec: 3.06 seconds 17%

I like how orjson cuts the time by two thirds without the need to change anything else. Just use it as a drop in replacement and you are good.

msgspec is twice as fast as orjson or six times as fast as jsonlines if you define the schema of the data that you want. For Tweets that's okay, as I can reuse the schema many times. With data that is used only once, I prefer orjson.

Memory usage was nearly identical across the different solutions. Probably because they all parse the data per line. I restarted the kernel each time to get comparable numbers.

Load time for all 23 million Tweets in the dataset was reduced from 25 to 4 minutes.

This blogposts was useful to me: https://pythonspeed.com/articles/faster-python-json-parsing/ #Python #DataEngineering

Ale Segura
10 months ago

I am considering a subscription to #medium What do people think? Is it worth it? I use their #software and #dataEngineering articles when suggested by my browser. So far I’ve found them useful but those articles could be a sub sample of the good ones that make it into Google’s recommendation. So far if I find a good one and I cannot access it anymore (you can only access a few per week (?) without subscription), I just open it in a private tab. Is the subscription worth this tiny extra effort?

Marcos Ortiz
10 months ago

Today I got my first paid subscriber on my Substack newsletter https://interestingdatagigs.substack.com/subscribe

All my content is free.

So: I’m incredibly grateful that someone still want to support my work.

Today was a good day. 🙏🏾🙏🏾🙏🏾

#contentcreator #dataengineering

Mikiko
10 months ago

Data is complicated, esp. at scale & w/ lots of ppl in the mixture. A data platform needs to meet so many diff needs, esp. for enterprises - From @adipolak #datadaytexas #ddtx2023 #dataops #dataengineering

Joanna Denni
10 months ago

@ike At my previous company (~1,000 employees), our #DataEngineering team (~50 engineers) used Airflow and Astronomer (and Spark) to design and run ETL and model training DAGs. In many cases, we needed to do a lot of processing of flat files provided by third parties into our data warehouse. We also had some pipelines to export data to vendors that we used. Airflow helped us manage dependencies between pipelines, as well as monitor and debug pipelines. Disclaimer: IANADataEngineer

Mikiko
10 months ago

If the answer is similar to:
1️⃣ ASAP
2️⃣ Minimal
3️⃣ Divorced
4️⃣ We can't
5️⃣ Less than 5

Then your first step shouldn't be building an ML platform, it should be developing models or ML-drive product features using the simplest, tried & true patterns possible.

#mlops #mlplatform #datascience #mlengineering #platformengineering #dataengineering #ai #mlinproduction

Henry
10 months ago

Now I’ve been here a few weeks, found the fire exits, amenities and snacks, I should probably add an #introduction post.

Hello 👋 I’m Henry, and professionally I do #DataScience and #DataEngineering in the #Aviation industry, mostly with #python and #pyspark

Unprofessionally I spend my time #parenting two children along with my wife and occasionally manage to find time to play #guitar 🎸 improve my use of #emacs and hack at too many half forgotten #maker projects.

Marcos Ortiz
11 months ago

🦆🦆🦆 New year, a new "This Month in the DuckDB ecosystem" issue on MotherDuck https://motherduck.com/blog/duckdb-ecosystem-newsletter-two/

Highlights of the month: 🧵

✅ 2 amazing members of the community you will love to read about it: @matsonj & @markhneedham

✅ A big milestone was reached for the DuckDB Python package https://twitter.com/peterabcz/status/1610684170350596110

✅ An amazing video tutorial from the one and only Marc Lamberti about how to start with DuckDB
from scratch https://www.youtube.com/watch?v=AjsB6lM2-zw

#duckdb #dataengineering #serverless

Chis-R 🐟
11 months ago

So I'm hiring for a role in my team and got my first batch of (thankfully shortlisted) CVs from our HR team today. I don't mind the recruitment process - I love speaking to new people and talking about their stories & experience, it's really exciting. But hoo boy, working my way through a list of mostly badly written PDFs is a grind. 😪

As a hiring manager, a token piece of advice for any job hunters out there - please make your CV skimmable! I'm mainly looking for critical keywords related to the role. ⌨️ Try using dotpoints with the relevant words bolded that explained what you did. Remove ALL of the waffly sentences; leave (some of) the waffle for the interview. Less is more.

All that said, if you have a bit of #DataEngineering experience, you're in Australia and looking for a junior role, DM me! 😃 I'll definitely enjoy a conversation on Mastodon more than skimming through a sanitised resume.

#hiring #recruitment #jobs #data #datadon