#dataengineering
Got a massive DAG?
Use this 3D visualizer specially made for dbt projects with 100s or 1000s of modesl!
It even works on GitLab's dbt project:
#DataEngineering #DataViz #DataLineage #dbt #DataOps #AnalticsEngineering #DAG

Shocked that #pyspark 's `to_timestamp` doesn't handle milliseconds out of the box but I have to use workarounds 🤯 #dataengineering
#Tech #Technology #DevOps #Automation #BigData #DataAnalytics #Data #DataEngineering #AI #ML Here's a New Plan to Rein In the Gilded Tech Bros https://www.wired.com/story/new-plan-tom-wheeler-book-rein-in-the-gilded-tech-bros/?utm_source=dlvr.it&utm_medium=mastodon #ArtificialIntelligence #MachineLearning #DataScience

#Tech #Technology #DevOps #Automation #BigData #DataAnalytics #Data #DataEngineering #AI #ML AI system self-organizes to develop features of brains of complex organisms https://www.sciencedaily.com/releases/2023/11/231120124246.htm?utm_source=dlvr.it&utm_medium=mastodon #ArtificialIntelligence #MachineLearning #DataScience

#Tech #Technology #DevOps #Automation #BigData #DataAnalytics #Data #DataEngineering #AI #ML TikTok Streamers Are Staging ‘Israel vs. Palestine’ Live Matches to Cash In on Virtual Gifts https://www.wired.com/story/tiktok-live-matches-israel-hamas-war/?utm_source=dlvr.it&utm_medium=mastodon #ArtificialIntelligence #MachineLearning #DataScience

#Tech #AI #ML #Cloud #DataEngineering Spotify Open-Sources Voyager Nearest-Neighbor Search Library https://www.infoq.com/news/2023/11/spotify-ann-voyager/?utm_campaign=infoq_content&utm_source=dlvr.it&utm_medium=mastodon&utm_term=AI%2C%20ML%20%26%20Data%20Engineering-news #ArtificialIntelligence #MachineLearning #DataScience

#Tech #Technology #DevOps #Automation #BigData #DataAnalytics #Data #DataEngineering #AI #ML The UK’s Controversial Online Safety Act Is Now Law https://www.wired.com/story/the-uks-controversial-online-safety-act-is-now-law/?utm_source=dlvr.it&utm_medium=mastodon #ArtificialIntelligence #MachineLearning #DataScience


#Nvidia releases **Rapid cuDF** that, through your GPU, allows you to speed up operations of #Pandas up to 150x times!
🔗 https://github.com/rapidsai/cudf
#DataScience #DataAnalysis #DataEngineering #GPUAcceleration #rapid #cuDF #RapidscuDF

This week we’ve completed a number of interesting client project technical reports, continued the development work on an emerging product, started the design phase of a new client project and dug into the details of our carbon footprint data for our financial year.
A packed week for the team, thanks for everyone that we've chatted to and worked with this week. Have a lovely weekend
#ThisWeek #ConnectedData #DataMaturity #DataStandards #DataEngineering

Together with our partner institution Salzburg Research we still have an amazing #job opportunity open for a #PhD Student (f|m|non-binary) – Data Engineer in Digital Health. #DataEngineering #DataScience #MachineLearning #ArtificialIntelligence #DevOps #PersonalizedHealth
The job advertisement is aimed at candidates with a strong technical background (esp. programming skills) e.g. in computer science, data science, AI, machine learning, HCI, or related fields.

#Tech #Technology #DevOps #Automation #BigData #DataAnalytics #Data #DataEngineering #AI #ML The US Has Failed to Pass AI Regulation. New York City Is Stepping Up https://www.wired.com/story/us-failed-to-pass-ai-regulation-new-york-city-stepping-up/?utm_source=dlvr.it&utm_medium=mastodon #ArtificialIntelligence #MachineLearning #DataScience


#Technology #Tech #Infrastructure #DataArchitecture #DataDriven #DataEngineering Reddit Unveils REV2: Modernised Rule-Execution with Kubernetes, Kafka, and Flink Stateful Functions https://www.infoq.com/news/2023/10/reddit-rev2/?utm_campaign=infoq_content&utm_source=dlvr.it&utm_medium=mastodon&utm_term=Architecture%20%26%20Design #DataIntelligence #DataArchitect #DataEngineer

#Tech #Technology #DevOps #Automation #BigData #DataAnalytics #Data #DataEngineering #AI #ML Neuromorphic computing will be great... if hardware can handle the workload https://www.sciencedaily.com/releases/2023/11/231106202950.htm?utm_source=dlvr.it&utm_medium=mastodon #ArtificialIntelligence #MachineLearning #DataScience

Unlock the full potential of your data by applying web principles to data integration, we empower valuable insights, underpin data engineering and data science and enable innovative data solutions.
#ConnectedData #LinkedData #DataIntegration #DataInsights #FAIRdata #DataEngineering #DataScience

#Tech #AI #ML #Cloud #DataEngineering OpenAI Launches GPTs to Enable Creating No-Code, Custom Versions of ChatGPT https://www.infoq.com/news/2023/11/openai-gpts-custom-chatgpt/?utm_campaign=infoq_content&utm_source=dlvr.it&utm_medium=mastodon&utm_term=AI%2C%20ML%20%26%20Data%20Engineering-news #ArtificialIntelligence #MachineLearning #DataScience

There can be a tendency to think of #data as strictly numbers and math. The reality is that data consists of people-driven inputs.


#Tech #Technology #DevOps #Automation #BigData #DataAnalytics #Data #DataEngineering #AI #ML How China’s EV Boom Caught Western Car Companies Asleep at the Wheel https://www.wired.com/story/how-chinas-ev-boom-caught-western-car-companies-asleep-at-the-wheel/?utm_source=dlvr.it&utm_medium=mastodon #ArtificialIntelligence #MachineLearning #DataScience

#Tech #Technology #DevOps #Automation #BigData #DataAnalytics #Data #DataEngineering #AI #ML The US Has Failed to Pass AI Regulation. New York City Is Stepping Up https://www.wired.com/story/us-failed-to-pass-ai-regulation-new-york-city-stepping-up/?utm_source=dlvr.it&utm_medium=mastodon #ArtificialIntelligence #MachineLearning #DataScience



This week we released an update to our SAPI-NT #API Library, we’ve been in an early design sprint for some client service updates, we’ve been pulling data together for our annual carbon footprint assessment, we’ve been contributing to some IcebreakerOne Stream workshops, and we’ve progressing one of our data technology projects.
#ThisWeek #ConnectedData #LinkedData #DataEngineering #GovTech #OpenData #DataStandards #Stream #DataTechnology

#Technology #Tech #Infrastructure #DataArchitecture #DataDriven #DataEngineering Reddit Unveils REV2: Modernised Rule-Execution with Kubernetes, Kafka, and Flink Stateful Functions https://www.infoq.com/news/2023/10/reddit-rev2/?utm_campaign=infoq_content&utm_source=dlvr.it&utm_medium=mastodon&utm_term=Architecture%20%26%20Design #DataIntelligence #DataArchitect #DataEngineer

#Tech #Technology #DevOps #Automation #BigData #DataAnalytics #Data #DataEngineering #AI #ML Neuromorphic computing will be great... if hardware can handle the workload https://www.sciencedaily.com/releases/2023/11/231106202950.htm?utm_source=dlvr.it&utm_medium=mastodon #ArtificialIntelligence #MachineLearning #DataScience


My learning #apacheflink blog series continues, with a poke around options for connecting to Flink with JDBC: https://rmoff.net/2023/11/16/learning-apache-flink-s01e06-the-flink-jdbc-driver/

#Tech #Technology #DevOps #Automation #BigData #DataAnalytics #Data #DataEngineering #AI #ML How the FTX Thieves Have Tried to Launder Their $400 Million Haul https://www.wired.com/story/ftx-hack-400-million-crypto-laundering/?utm_source=dlvr.it&utm_medium=mastodon #ArtificialIntelligence #MachineLearning #DataScience

#Tech #Technology #DevOps #Automation #BigData #DataAnalytics #Data #DataEngineering #AI #ML A Graphic Hamas Video Donald Trump Jr. Shared on X Is Actually Real, Research Confirms https://www.wired.com/story/x-community-notes-failures/?utm_source=dlvr.it&utm_medium=mastodon #ArtificialIntelligence #MachineLearning #DataScience

#Tech #AI #ML #Cloud #DataEngineering AWS Unveils Gemini, a Distributed Training System for Swift Failure Recovery in Large Model Training https://www.infoq.com/news/2023/11/aws-gemini-deep-learning/?utm_campaign=infoq_content&utm_source=dlvr.it&utm_medium=mastodon&utm_term=AI%2C%20ML%20%26%20Data%20Engineering-news #ArtificialIntelligence #MachineLearning #DataScience

Last week we highlighted some of our #TechTalks over the last year - have a look at recent posts
Yesterday we released an update to our SAPI-NT API Library - for more about our API libraries see: https://www.epimorphics.com/api-libraries/

Are you using dbt?
I'm looking to learn more about the scale of dbt projects and the review process for modeling changes.
Can you answer 9 questions about your dbt usage?
(they're all multiple choice - yay!)
https://docs.google.com/forms/d/1O3korvArsxViiZbfatSw0RS9ocG5eJ4dh2D49zr5YSM
#dbt #DataEngineer #DataEngineering #AnalyticsEngineering #DataModeling #DataOps
#Tech #Technology #DevOps #Automation #BigData #DataAnalytics #Data #DataEngineering #AI #ML Live Updates: The Trial of FTX Founder Sam Bankman-Fried https://www.wired.com/live/sam-bankman-fried-sbf-ftx-trial-live-blog-week-2/?utm_source=dlvr.it&utm_medium=mastodon #ArtificialIntelligence #MachineLearning #DataScience

Alas, losing one of our talented data engineers to the lure of overseas! Anyone interested in mid-to-senior #DataEngineering, #CivicTech work at the best* city in the world, feel free to get a hold of me - <my first name>.<my surname>@capetown.gov.za
Believers (in a better world) need apply! Please boost.
* Come on, I grew up in #CapeTown, I'm pretty biased.
It might be 45 sleeps until Christmas, but it's only FOUR DAYS until @gunnarmorling and I launch our monthly roundup of interesting things happening in the streaming and data space. Stay tuned to the Decodable blog next week :)
👉 🗞️ https://decodable.co/blog 🎁
#openSource #dataEngineering #streamProcessing #databases #changeDataCapture
I'm super excited to share that Sam Debruyn, recently crowned with the #dbt Community Award 👑, now has a new dbt adapter: dbt-timescaledb!
Integrating TimescaleDB with dbt workflows means more streamlined processes, quicker insights, and a more economical approach to data analysis.
🔗 https://github.com/sdebruyn/dbt-timescaledb 👀
⭐️ YAS. This deserves some stars.
#DataTransformation #dbt #TimescaleDB #DataAnalytics #ELT #DataEngineering dbt Labs
Over the last year we’ve published a range of #TechTalks on topics around:
#DevOps, #AWS, #Monitoring, #Security, #SPARQL, #Kubernetes, #Ontologies, #ChatGPT, #DataMaturity, #GraphQL, #Hydrology #API, #URIs for #LinkedData #DataEngineering #ConnectedData #Identifiers and #OpenData
Take a look at our blog https://www.epimorphics.com/blog

Once again, SQLite to the fucking rescue.
Have to analyze a ~450MB CSV file of merchant data.
sqlite> .mode csv
sqlite> .import 2023_september_CAID.csv merchants
Bam! Now I can query the merchants table with SQL rather than hacking together scripts to do funky CSV things.
We support organisations developing their data strategies and designing their data architecture, helping to drive innovation.
Our data engineering experts provide advice a consultancy to IT Teams, service companies and other.
Trust us to help you unlock the full potential of your data.
https://www.epimorphics.com/consultancy
#DataStrategy #DataArchitecture #DataEngineering #ConnectedData

We’ve posted recently about some of our open data and other data and tech projects
(see: https://www.epimorphics.com/blog )including themes including data standards, real time data, data APIs, data maturity, data management, Chat GPT and data engineering, our current projects will expand on this soon.
#ODISummit2023 #ConnectedData #FAIRdata #AI #DataScience #LinkedData #DataStandards #RealTimeData #DataAPIs #DataMaturity #DataManagement #ChatGPT #DataEngineering

In the age of AI and data science, linked data acts as a unifying force, empowering organisations to use data assets to their full extent, fosters collaboration, accelerates research, and enhances reusability of data.
#ODISummit2023 #ConnectedData #FAIRdata #AI #DataScience #LinkedData #DataStandards #RealTimeData #DataAPIs #DataMaturity #DataManagement #ChatGPT #DataEngineering
📝A nice collection of real-world production uses of @ApacheKafka and #ApacheFlink, collected and curated by Thanh Tung Dao 👏:
🌟 Flink: https://github.com/dttung2905/flink-at-scale
🌟 Kafka: https://github.com/dttung2905/kafka-in-production
This Tuesday evening we team up with The Data Lab on the latest Aberdeen Data Meetup at ONE Tech Hub .
The theme for this one is "How to get started in a career in data". Join us to network over pizza and drinks. Hear talks from Charlotte McLean and Lesley-Anne Kelly.
Thanks to ScotlandIS and CodeBase Aberdeen for supporting our sessions this year.
Book now https://ti.to/code-the-city/adm-nov-2023
#datascience #dataengineering #datavisualisation #datajournalism #aberdeen
#Tech #Technology #DevOps #Automation #BigData #DataAnalytics #Data #DataEngineering #AI #ML Chatbot Hallucinations Are Poisoning Web Search https://www.wired.com/story/fast-forward-chatbot-hallucinations-are-poisoning-web-search/?utm_source=dlvr.it&utm_medium=mastodon #ArtificialIntelligence #MachineLearning #DataScience
I went all the way into the office on a Friday because I have an important modification to make ASAP. I get to the office and find out that the entire server that feeds my UI is down for up to 24 hours and not even the prod instance is working.
#programming #dataengineering
Next Tuesday we team up with The Data Lab on the latest Aberdeen Data Meetup at ONE Tech Hub .
The theme for this one is "How to get started in a career in data".
Join us to network over pizza and drinks. Hear talks from Charlotte McLean and Lesley-Anne Kelly.
Thanks to ScotlandIS and CodeBase Aberdeen for supporting our sessions this year.
Book now https://ti.to/code-the-city/adm-nov-2023
#datascience #dataengineering #datavisualisation #datajournalism #aberdeen
We've been posting #TechTalk and other hopefully interesting posts on our blog https://www.epimorphics.com/blog/ but for those who prefer to use Medium see:
https://epimorphics.medium.com/
#ConnectedData #DataEngineering #LinkedData #DataStories #FAIRdata #Blog #Medium

☁️Y'know the whole snarky thing about "it's not the cloud, it's just someone else's computer"? 🙄
Well what if the Kafka you wanted to connect to was literally on… just someone else's computer? 😁
✍️Blogged: Using @ApacheKafka with #ngrok https://rmoff.net/2023/11/01/using-apache-kafka-with-ngrok/


#CsvDiff has finally reached v0.1.0, it's first ever non-alpha/-beta release! 🎉
New features like getting at the headers from the diffresult have been needed for the following PR in qsv (which is in final review):
https://github.com/jqnatividad/qsv/pull/1395
When merged, you'll be able to decide, whether the diffresult should output headers or not (see examples in the PR). :awesome:
Check out csv-diff's Changelog for the full details:
https://gitlab.com/janriemer/csv-diff/-/blob/main/CHANGELOG.md?ref_type=heads#010-30-october-2023
I'm impressed by the pace and innovation of the #debezium project—TIL it now ships with a JDBC *sink* for Kafka Connect: https://debezium.io/documentation/reference/stable/connectors/jdbc.html
Just a week until we team up with The Data Lab - Innovation Centre on the latest Aberdeen Data Meetup at ONE Tech Hub .
The theme for this one is "How to get started in a career in data".
Join us to network over pizza and drinks. Hear talks from Charlotte McLean and Lesley-Anne Kelly.
Thanks to ScotlandIS and CodeBase Aberdeen for supporting our sessions this year.
Book now https://ti.to/code-the-city/adm-nov-2023
#datascience #dataengineering #datavisualisation #datajournalism #aberdeen
Many of our pipeline steps follow the pattern of
```python
read_data_from_datastore()
# do exciting and ill-advised stuff here
write_data_back_to_datastore()
```
And frequently when debugging and/deving, I'll comment out the write step for paranoia. And as you would guess it, frequently commit the commenting out. I suppose it's a soft failure, but very hard to debug.
Pipeline behave as they should, dependencies are called, 99% of the log lines you expect are there, but...
🎉 #ApacheFlink 1.18 release notes: https://flink.apache.org/2023/10/24/announcing-the-release-of-apache-flink-1.18/ with tons of new goodies, including …
🌟 Time Travel! 🤯https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/dev/table/sql/queries/time-travel/
🌟 JDBC driver for the SQL gateway https://flink.apache.org/2023/10/24/announcing-the-release-of-apache-flink-1.18/#introduce-flink-jdbc-driver-for-sql-gateway
🌟 Stored Procedure support for catalogs https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/dev/table/procedures/ (already implemented in #ApachePaimon)
🌟 Flink SQL performance improvements https://flink.apache.org/2023/10/24/announcing-the-release-of-apache-flink-1.18/#performance-improvements--tpc-ds-benchmark
🌟 SQL Client "Quality of Life" improvements (DX) https://cwiki.apache.org/confluence/display/FLINK/FLIP-189%3A+SQL+Client+Usability+Improvements
#dataEngineering #openSource #streamingSQL #streamProcessing
Thanks for all the input! So it looks like in the #datascience and #dataengineering and related areas, #python seems to be superceding #rstats as a preferred skill set. As I teach data skills in the #LifeSciences with the key learning objective being an ability to process, visualise and analyse data in a reproducible and scalable way, we're not fixated on what methods we use, so going forwards, it's something to consider...
A great video by Simon Whiteley from Advancing Analytics on Data Lake architecture and how Advancing Analytics build lakes and the different stages of processing they go through.
A great takeaway is not getting hung up on the medallion architecture and confining yourself to the three steps of Bronze, Silver and Gold to process your data.
Ooof, it took a year to close this issue, completing the migration of our ETL and analysis assets into @dagster but we're finally done.
We still need to rename all the tables so there's some semblance of a clear naming convention, but you no longer have to install all the software required to run the ETL if you just want to access all of our outputs.
For a guy that already know a bit of #sql which book would you choose to start from the beginning and reviews concepts already forgotten?:
Does Mastodon use fan-out write?
A great interview on the Data Engineering Podcast with the founder of CoverageCat, which liberates insurance coverage and pricing data from intentionally obfuscated data sources -- often published as scanned PDFs with constantly shifting structures.
Kindred spirits! And their mascot is also a cat!
https://www.dataengineeringpodcast.com/coveragecat-insurance-industry-data-engineering-episode-395
0️⃣0️⃣ - Chapter 00 - A New Frontier
🔗 https://voltrondata.com/codex/a-new-frontier
There are 100s of database systems available, and yet, companies with lots of data don't find one for their niche, and instead develop their own. It is a very expensive endeavor 💸 and all these systems are not that different. 🧵 2/...
#data cleaning techniques. 👇💡
Interested in getting into #DataEngineering? Come work with me!
I'm looking for a role leading an engineering team. I've got several years both managing teams as well as an individual contributor. Strong recent #BigData and #DataEngineering experience.
https://github.com/AppTrain
https://www.linkedin.com/in/apptrain
#FediHire #HireMe #FediHired #jobs #python #sql #javascript #nodejs, #dataMesh #APIs #MasterDataManagement #Spark #dbt #starRocks
On June 30th, we celebrated women+ in #dataengineering #datascience #machinelearning and #mlops at the Women+ in Data and AI summer festival. Let’s look back and reflect on how we bootstrapped a tech conference with a 100% women+ speaker lineup. Full story 👉 https://www.innoq.com/en/blog/2023/09/women-plus-in-data-and-ai-summer-festival-2023/
Do I know any #Rails shops using dbt-core for data engineering?
On Wednesday at ONE Tech Hub Robert McWilliam will explain how using the async library will speed up your code execution. We'll have pizza and drinks, then his talk, then you get a chance to try the code yourself. See how fast you can get your code running !
Book tickets now! https://ti.to/code-the-city/apug-sep-2023
#Python #coding #programming #pizza #networking #KnowledgeExchange #datascience #dataengineering #webdev
Dive into our annual #InfoQ #TrendsReport where we dissect the latest in #AI, #ML & #DataEngineering.
Discover key trends & insights that every software engineer, architect, or data scientist should keep an eye on: https://bit.ly/3Z9gadO
💪 Knowledge is power! #StayAhead of the curve!
Enormous kudos to Roland Meertens, Srini Penchikala, Sherin Thomas, Daniel Dominguez & Anthony Alford for their role in this report.
Last chance to book!
Tomorrow evening Aberdeen Data Meetup returns - with the postponed talk from Harris Assad on "Why Data Engineering is the Imperative Foundation for Data Science".
Join us for networking over pizza and drinks, followed by Harris' talk.
Book now https://dashboard.tito.io/code-the-city/adm-sep-2023

This Tuesday Aberdeen Data Meetup is back - with the postponed talk from Harris Assad on "Why Data Engineering is the Imperative Foundation for Data Science".
Join us for networking over pizza and drinks, followed by Harris' talk.
Book now https://dashboard.tito.io/code-the-city/adm-sep-2023

Whether you are a data scientist, web developer, astrophysicist, or pharmacological researcher, you need fast code execution. On 13th September Robert McWilliam will explain how using the async library speeds up code execution. We'll have pizza and drinks, then his talk, then you get a chance to try the code yourself. See how fast you can get your code running !
Book tickets now! https://ti.to/code-the-city/apug-sep-2023
#Python #coding #programming #KnowledgeExchange #datascience #dataengineering #webdev
An #introduction.
My name is Casper. I am a software engineer, currently working with #dataengineering.
Currently busy with#renovation of a 1911 house with my lovely wife and 3 year old daughter. Oh I have a new obsession, with improving #biodiversity in our small garden.🌻
I live in the city center, and identify (a bit too) strongly with being #carfree (but #cargobike owner)
Regarding #softwaredevelopment, I am inspired by #ddd , #xp #tdd , and other abbreviations. 👩💻
PyData Global 2023
December 6 - 8
CFP closes on October 1st
We welcome attendees with various experiences, expertise, and backgrounds to join us virtually. Users, contributors, and newcomers can share experiences and learn from one another to solve challenging problems and grow a stronger open-source community.
#pydata #numfocus #datascience #ai #artificialintelligence #conference #machinelearning #dataengineering
Next Tuesday evening Aberdeen Data Meetup is back - with the postponed talk from Harris Assad on "Why Data Engineering is the Imperative Foundation for Data Science".
Join us for networking over pizza and drinks, followed by Harris' talk.
Book now https://dashboard.tito.io/code-the-city/adm-sep-2023
Want to see how easy it is to signup for PipeRider Cloud and share for you first report?
In this video we take an existing dbt project and:
- Sign up for PipeRider Cloud
- Generate a Data Impact Report
- Upload the report and share it
All in a matter of minutes:
#PipeRider #DataQuality #DataEngineering #AnalyticsEngineering #DataOps #DataViz #DataTeam #DataLineage
One week until Aberdeen Data Meetup returns with the postponed talk from Harris Assad on "Why Data Engineering is the Imperative Foundation for Data Science".
Join us for networking over pizza and drinks, followed by Harris' talk.
Book now https://dashboard.tito.io/code-the-city/adm-sep-2023
Less than four weeks until our 30th Hack weekend; and it is all about Union Street and the City Centre. We've multiple challenges. We need you - no matter what your skills and experience. We can all play a part!
More details of the challenges and how to book: https://codethecity.org/what-we-do/hack-weekends/ctc30/
#Aberdeen #history #mapping #photography #planning #osm #datascience #dataengineering #heritage
Less than two weeks until Aberdeen Data Meetup is back - with the postponed talk from Harris Assad on "Why Data Engineering is the Imperative Foundation for Data Science".
Book now https://dashboard.tito.io/code-the-city/adm-sep-2023
Just over a month until our 30th Hack weekend; and it is all about Union Street and the City Centre. We've multiple challenges. We need you - no matter what your skills and experience. We can all play a part! More details of the challenges and how to book: https://codethecity.org/what-we-do/hack-weekends/ctc30/ #Aberdeen #history #mapping #photography #planning #osm #datascience #dataengineering #heritage

Hey #EnergyMastodon, we finally got set up here so I guess we should do an #introduction
Catalyst is an all-remote worker #Cooperative focused on open #DataEngineering, analysis, and open source software. We mostly work with public US data related to the energy system, pulling together information from #FERC, #EIA, #EPA, and other agencies and trying to turn it into analysis-ready database tables so advocates, policymakers, & researchers can focus on their own novel work instead.
Hey #pandas #dataengineering people - is anyone *happily* using PyArrow `decimal128` types or similar for fixed point arithmetic in DataFrames?
All boosts appreciated for visibility 🔁 🙏
After multiple hours of digging and testing, I think I'm going to recommend our work project avoids them and uses plain Python `Decimal` in DF, even though there's no vectorization 😞
More thoughts in my latest comment in this GH issue thread: https://github.com/intake/pandas-decimal/issues/3#issuecomment-1590957624
What does your testing strategy look like for dbt?
On the one hand I'd like to make sure all models are covered by tests but, on the other hand, maintaining tests and test rot is a big concern.
Do you test all models?
Do you care if some models are not covered by tests?
Any input on strategy welcome 🙏
🔎 🗺️ The State of Data Engineering 2023 - a nice round up and review from Einat Orr of #dataengineering in 2023
https://lakefs.io/blog/the-state-of-data-engineering-2023/?utm_campaign=Social%20media%20activity&utm_source=mastodon&utm_medium=social&utm_content=blog_rm-SoDE23_230524
#datadon #analyticsengineering #opensource #data
Data Folks! I need your help!
If I gave you a date in the format of 1220923, would you be able to make sense of it? I desperately need human-readable dates, and I'm losing my mind.
The essential component of any data toolkit: Duck[DB] Tape!
I have a bunch of utilities in a Python package which I use across multiple Python projects. So far I have used the utilities package as an editable install, or sometimes just copied the source for the required utilities into a project.
This approach is breaking down as I start using CI/CD pipelines.
What is the best way forward?
I feel like the correct answer is "Install from private PyPi" but I'm interested to hear your experience
I will be giving the keynote presentation at the Connected Enterprise Forum in Sydney on 23 March. Come join us!
Or Would be great to see you there!
RSVP now. Free to attend.
https://streamsets.com/forum/sydney/
#ai #future #DataEngineering #BigData #DataAnalytics #ML #generativeAI
I never bothered with optimizing the parsing of #jsonl #ndjson files because in most cases it was an one off task before I put the data into a database or parquet file. But the files got bigger and waiting 20 minutes for the data to load made me reconsider my decision. So tried some different approaches.
Tested with a 1.7 GB file of 300 k Tweets.
jsonlines.reader: 17.2 seconds 100%
orjson: 6.49 seconds 37%
msgspec: 3.06 seconds 17%
I like how orjson cuts the time by two thirds without the need to change anything else. Just use it as a drop in replacement and you are good.
msgspec is twice as fast as orjson or six times as fast as jsonlines if you define the schema of the data that you want. For Tweets that's okay, as I can reuse the schema many times. With data that is used only once, I prefer orjson.
Memory usage was nearly identical across the different solutions. Probably because they all parse the data per line. I restarted the kernel each time to get comparable numbers.
Load time for all 23 million Tweets in the dataset was reduced from 25 to 4 minutes.
This blogposts was useful to me: https://pythonspeed.com/articles/faster-python-json-parsing/ #Python #DataEngineering
I am considering a subscription to #medium What do people think? Is it worth it? I use their #software and #dataEngineering articles when suggested by my browser. So far I’ve found them useful but those articles could be a sub sample of the good ones that make it into Google’s recommendation. So far if I find a good one and I cannot access it anymore (you can only access a few per week (?) without subscription), I just open it in a private tab. Is the subscription worth this tiny extra effort?
Today I got my first paid subscriber on my Substack newsletter https://interestingdatagigs.substack.com/subscribe
All my content is free.
So: I’m incredibly grateful that someone still want to support my work.
Today was a good day. 🙏🏾🙏🏾🙏🏾
Data is complicated, esp. at scale & w/ lots of ppl in the mixture. A data platform needs to meet so many diff needs, esp. for enterprises - From @adipolak #datadaytexas #ddtx2023 #dataops #dataengineering
@ike At my previous company (~1,000 employees), our #DataEngineering team (~50 engineers) used Airflow and Astronomer (and Spark) to design and run ETL and model training DAGs. In many cases, we needed to do a lot of processing of flat files provided by third parties into our data warehouse. We also had some pipelines to export data to vendors that we used. Airflow helped us manage dependencies between pipelines, as well as monitor and debug pipelines. Disclaimer: IANADataEngineer
If the answer is similar to:
1️⃣ ASAP
2️⃣ Minimal
3️⃣ Divorced
4️⃣ We can't
5️⃣ Less than 5
Then your first step shouldn't be building an ML platform, it should be developing models or ML-drive product features using the simplest, tried & true patterns possible.
#mlops #mlplatform #datascience #mlengineering #platformengineering #dataengineering #ai #mlinproduction
Now I’ve been here a few weeks, found the fire exits, amenities and snacks, I should probably add an #introduction post.
Hello 👋 I’m Henry, and professionally I do #DataScience and #DataEngineering in the #Aviation industry, mostly with #python and #pyspark
Unprofessionally I spend my time #parenting two children along with my wife and occasionally manage to find time to play #guitar 🎸 improve my use of #emacs and hack at too many half forgotten #maker projects.
🦆🦆🦆 New year, a new "This Month in the DuckDB ecosystem" issue on MotherDuck https://motherduck.com/blog/duckdb-ecosystem-newsletter-two/
Highlights of the month: 🧵
✅ 2 amazing members of the community you will love to read about it: @matsonj & @markhneedham
✅ A big milestone was reached for the DuckDB Python package https://twitter.com/peterabcz/status/1610684170350596110
✅ An amazing video tutorial from the one and only Marc Lamberti about how to start with DuckDB
from scratch https://www.youtube.com/watch?v=AjsB6lM2-zw
So I'm hiring for a role in my team and got my first batch of (thankfully shortlisted) CVs from our HR team today. I don't mind the recruitment process - I love speaking to new people and talking about their stories & experience, it's really exciting. But hoo boy, working my way through a list of mostly badly written PDFs is a grind. 😪
As a hiring manager, a token piece of advice for any job hunters out there - please make your CV skimmable! I'm mainly looking for critical keywords related to the role. ⌨️ Try using dotpoints with the relevant words bolded that explained what you did. Remove ALL of the waffly sentences; leave (some of) the waffle for the interview. Less is more.
All that said, if you have a bit of #DataEngineering experience, you're in Australia and looking for a junior role, DM me! 😃 I'll definitely enjoy a conversation on Mastodon more than skimming through a sanitised resume.